Screen Scraping

Getting the Content You Want

asp:feature

Topics: Screen Scraping

Languages: Visual Basic, C#

 

Screen Scraping

Getting the Content You Want

 

By Alex Lowe

 

Have you ever wondered how to grab headlines from a news site on the Internet and beautify them for display on your own personal or corporate site? Maybe you'd like to take your company's current stock price from a finance site and place it on an intranet somewhere. Well, a screen scrape could make those processes easy. [There's a discussion of the legality of screen scraping at the end of this article.-Ed.]

 

One challenge ASP Classic developers faced was the ability to execute or retrieve information from a page or script on a remote Web server. While ASP Classic developers could create a custom component to scrape information from a page or script on a remote Web server, they often had to purchase third-party components to achieve that functionality. Microsoft's ASP.NET development team recognized the lack of built-in support for screen scraping and built screen-scraping classes into the .NET Framework.

 

As you will see in this article, ASP.NET developers can scrape a screen by using a couple classes from the .NET Framework. This article will examine three techniques that allow you to screen-scrape standard Web pages and scripts; forms using GET and POST requests; and secure (basic, digest, NTLM, and Kerberos) Web pages and scripts. GET and POST are methods developers use to request Web files as defined in the HTTP/1.1 Protocol specifications, located at http://www.w3c.org. All three screen-scraping techniques use the WebClient class (a child of the System.Net class). Essentially, WebClient acts as a wrapper for the WebRequest and WebResponse classes. The WebClient class contains methods and functions that provide much of the same functionality WebRequest and WebResponse provide, but without all the messy details.

 

Screen Scraping without a Form

The first problem to address is how to screen-scrape a standard Web page or script that does not expect any values or parameters. The listing in FIGURE 1 contains the code to a screen-scraping Web application, webclient.aspx.

 

<%@ Page Language="VB" %>

<%@ Import Namespace="System.Net" %>

<%@ Import Namespace="System.IO" %>

  

  

URL To Scrape:

       Width="500" runat="Server" />


 Text="Scrape The Screen" OnClick="GetScreen_Click"

 runat="server" />


Result:

FIGURE 1: Contents of webclient.aspx.

 

Let's break this code down and analyze each line. First, you need to create an instance of the WebClient class and an instance of at least one encoding class. This example uses the ASCIIEncoding class, but any encoding class in the System.Text class would be fine:

 

Dim encASCIIEncoding As ASCIIEncoding = New ASCIIEncoding()

Dim wcWebClient As WebClient = New WebClient()

 

Now you are ready to call the DownloadData method of the WebClient class. The DownloadData method expects an address parameter (a URL you want to scrape) and returns a byte array containing the data you scraped from the address (passed in as a parameter):

 

wcWebClient.DownloadData(strURL)

 

The byte array the DownloadData method returns must be converted to a string before you can assign it to the Text property of the Literal control ReturnData. Each encoding class, ASCIIEncoding in this case, has a GetString method that expects a byte array parameter and converts that parameter into a string. So you use the GetString method to turn your byte array into a string and then assign it to the Text property of the Literal control ReturnData:

 

ReturnData.Text = encASCIIEncoding.GetString( _

 wcWebClient.DownloadData(URLToScrape.Text))

 

Voil ! That's all it takes to create a simple screen-scraping Web application. As you can see, screen scraping is fairly simple and, amazingly, only takes three lines of code (not counting the Try..Catch error handling). To test the screen-scraping Web application, scrape aspalliance.com/aldotnet/scrapedscreen.aspx:

 


FIGURE 2: A screen-scraping test page on AspAlliance.com.

 

Screen Scraping with a Form

Let's say you want to screen-scrape the search results from a remote Web site. The page or script that performs the actual search is going to expect search parameters of some kind. What if you allow your users to enter search parameters and have your screen scraper submit those values to the search page or script? There are some nuances to submitting form values programmatically inside your screen-scraping Web application.

 

You can use two methods to submit data to a form: GET or POST. First, let's examine how to make a GET request to aspalliance.com/aldotnet/gettest.aspx. gettest.aspx is a simple page that prints all the query string parameters it receives.

 

Just as in the first example (FIGURE 1), you need to create an instance of the WebClient class and an instance of at least one encoding class. Once again, you will use the ASCIIEncoding class for this example, but any encoding class in the System.Text class would be fine. Once you have created an instance of the WebClient class, you need to create an instance of the NameValueCollection class. The NameValueCollection class is a child of the System.Collections class. The NameValueCollection instance does exactly what it sounds like it does: stores values and their names. The name, sometimes called a key, is the index that can be used to set or read its associated value in a NameValueCollection. In this example, the NameValueCollection object nvcValuesToPost will store the names and values of the form information you wish to send in your GET request to gettest.aspx:

 

   Dim nvcValuesToPost As New NameValueCollection

   nvcValuesToPost.Add("Test1", Value1.Text)

   nvcValuesToPost.Add("Test2", Value2.Text)

   nvcValuesToPost.Add("Test3", Value3.Text)

 

Once you have all your name and value pairs in the NameValueCollection object, you need to assign them to the QueryString property of wcWebClient:

 

wcWebClient.QueryString = nvcValuesToPost

 

Just as you did in the first example, you must call the DownloadData method of the WebClient class. You will convert the byte array DownloadData returns into a string and then assign the results to the Text property of your Literal control, ReturnData:

 

ReturnData.Text = encASCIIEncoding.GetString( _

 wcWebClient.DownloadData(URLToScrape.Text))

 

From there, you can download and convert the output of the Web Form just as you downloaded and converted the response in the first example.

 


FIGURE 3: Using the GET request technique.

 

The code required to make a POST request is very similar to the code you used earlier to make a GET request. In the GET request example, you saw that form values were sent as part of the URL or querystring. A POST request works a bit differently. In it, form values are embedded in the HTTP header sent to the requested page or script. You will be posting to aspalliance.com/aldotnet/posttest.aspx. Because much of the code required to make a POST request is the same as that to make a GET request, let's focus on what is unique to the POST request.

 

One important difference between GET and POST is that you are no longer using the DownloadData method. In order to POST values to posttest.aspx, you need to use the UploadValues method instead. The UploadValues method expects three parameters: the base URL, the method you'll use (POST in this case), and a NameValueCollection containing the names and values to be posted:

 

wcWebClient.UploadValues(URLToScrape.Text, _

 "POST", nvcValuesToPost)

 

From there, you can download and convert the output of the Web Form just as you downloaded and converted the response from the GET request.

 

Screen Scraping with Authentication

What if the page you want to screen-scrape is only accessible using some form of authentication (Basic, NTLM, Kerberos, etc.)? Let's outline what changes you must make to your original GET request to screen-scrape a Web page or script that requires authentication. A screen-scraping example that uses authentication is shown in FIGURE 4.

 

<%@ Page Language="VB" %>

<%@ Import Namespace="System.Net" %>

<%@ Import Namespace="System.IO" %>

 OnClick="GetScreen_Click" runat="server" />


Result:

FIGURE 4: The webclientsecure.aspx page.

 

The code used in the authentication example (FIGURE 4) is virtually identical to the code from the first example (FIGURE 1). The authentication example, however, contains two additional lines of code that are responsible for setting the username and password credentials. The first line creates an instance of the NetworkCredential class (a child of the System.Net class) and passes the username and password into the NetworkCredential class constructor:

 

Dim ncCred As NetworkCredential =  _

 new NetworkCredential("username","password")

 

The second line of code sets the Credential property of wcWebClient to ncCred:

 

wcWebClient.Credentials = ncCred

 

From there, you can download and convert the output of the Web Form just as you downloaded and converted the response in the previous examples.

 

Is Screen Scraping Legal?

Before you screen-scrape all your favorite sites on the Internet, you should consider the legal ramifications of scraping someone else's content. In a world of subscription-based content, it should be obvious there are legal and illegal forms of screen scraping. Although I'm not a lawyer, it's safe to say that a person or company should not scrape someone else's content without his or her permission (unless it's stated clearly in some kind of legally binding document). The examples in this article screen-scrape pages I have created for my column at AspAlliance.com. Therefore, you have permission to scrape them.

 

For more information and examples, check out these links:

 

The files referenced in this article are available for download.

 

Alex Lowe is a technical director for SequoiaNET.com, an end-to-end solution provider of network infrastructure services and Web-based application development. Alex also founded the Grand Rapids, MI ASP.NET User Group and writes an ASP.NET-focused column on http://AspAlliance.com. Readers may contact Alex by sending e-mail to mailto:[email protected].

 

Tell us what you think! Please send any comments about this article to [email protected]. Please include the article title and author.

 

 

 

Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish