Wouldn't it be cool if you could grab content directly from a Web site and use that content in an application of your own? This idea is nothing new--it's just good old copy and paste. But what if the content you want is dynamic, and your application needs to reflect this ever-changing content? Now what do you do? The manual copying and pasting process won't suffice--you need an easy-to-use programmatic way of obtaining this information. Without .NET, you have to use non-standard programming techniques to automatically scrape other Web sites. This process, known as page scraping, usually started with loading the remote Web page into your application and writing custom parsing procedures to get the necessary data. However, if the target Web page changed format, you had to rewrite your parsing logic. With .NET, you can resolve this dilemma with Visual Basic .NET or C#, Web Services, Windows services, and a timer control. This article touches on techniques you can use within .NET to accomplish this.
Here's a brief overview of how I used some .NET technology to create a Web page scraper. This is a relatively simple scraper that goes out to a Web site and scrapes a certain portion of text from the page, saving it to a file in a specified directory. Because the data on this site changes every 5 minutes or so, I use the timer control inside the Windows service, which acts as a wrapper that calls the Web Service and instantiates the methods every 5 minutes. By scraping the Web site this frequently, you ensure that the saved data is current.
The following is a snippet of the Web Services Description Language (WSDL) file and the Visual Basic .NET code that was used to define the Web service.
<output> <tm:text> <tm:match name="Start" type="" pattern="##(.*)" ignoreCase="true" /> </tm:text> </output>
Notice the value of the pattern XML attribute, "##(.*)". Built right into the WSDL file is the capability to define string patterns that will be used to scrape Web pages. This example looks for a string pattern within the scraped HTML page that looks like ##(.\[ANYTHING IN HERE\]). The "*" is an ambiguous operator that works just as file searches do on your folders--it's just a string pattern match. You could, for example, have a pattern equal to "<A HREF" to find all the anchors in the HTML Web page. This built-in power is what frees you from writing custom code for scraping different Web pages. Additionally, you can react much more quickly when the structure of the Web pages you are scraping changes.
The following code shows you how you can call this Web Service from Visual Basic .NET. It lets you retrieve the results from the Web page by scraping the Web Service performs.
' textLookup, this is how we interface to the Web Service Dim textlookup As New localhost.RetrievText () ' match, this is what is returned from the Web Service Dim match As New localhost.GetTextDetailsMatches() Dim strText As String ' GetTextDetails is a method within the Web Service to ' scrape the Web page. match = textlookup.GetTextDetails Try ' stores the result of the Web Service call into ' a text file FileOpen(1, "C:\temp\scraperText.txt", OpenMode.Output) Write(1, strText) FileClose(1) ' Close file. Catch theExeption As Exception Dim logthis As New WriteToEventLog.WriteToEventLog() logthis.Log(Application.CompanyName, theExeption.ToString, Application.CompanyName) End Try
Web page scraping can be a useful tool when using nicely defined Web Services to retrieve information is not an option. And with the new features in .NET, it becomes an even more attractive option because of how easy it is to build.
The code for this Perspective is available at http://interknowlogy.com/knowledge/articles.aspx?aid=1065