Skip navigation

Progressive Perl for Windows: Fetching Web Data with Win32::Internet

Downloads
23937.zip

Accessing Web servers on the Internet has become more than just an interesting novelty—it's become a necessity. It seems as though everyone has a need to fetch data from the Web for one reason or another. Typically, you just start a Web browser and enter a URL. But the moment you need to script the process of accessing Web data, matters become a little bit more complicated.

The wonderful world of Perl offers all sorts of ways to use HTTP to download data, ranging from manually opening and interacting with a socket to using the popular Library for WWW access in Perl (LWP) modules. However, sometimes even this abundance of functionality isn't quite enough, which is where Perl's Win32::Internet extension comes in.

The WinInet Library
All Windows versions include the WinInet library, a collection of functions that let you connect to Internet servers. WinInet provides access to HTTP, HTTP over Secure Sockets Layer (HTTPS), FTP, and even Gopher. Most Windows applications such as Microsoft Internet Explorer (IE) use the library to quickly connect to Internet servers.

WinInet is invaluable for applications and scripting languages such as Perl. Yes, Perl! The Perl Win32::Internet extension provides an interface to the library, thus letting Perl scripts do some pretty cool things that even the slickest Perl modules can't quite do. Here are some reasons to use Win32::Internet.

SSL. WinInet provides pretty seamless Secure Sockets Layer (SSL) functionality.

Authentication. If you supply a user ID and password, WinInet manages authentication automatically and sends your credentials to the Web server.

NTLM and automatic credential use. No Perl-based Web authentication packages support Microsoft Windows NT LAN Manager (NTLM) authentication. If the server specifies the NTLM authentication package, WinInet can automatically submit your Windows logon credentials so that you don't need to manually enter them. This technique typically works only with Web sites that participate in the NT domain that you logged on to (such as intranet Web sites).

Cookies. The WinInet library manages cookies for your application. When a client downloads Web pages, cookies are submitted to the server automatically. Likewise, when a Web server sets cookies, the WinInet library manages them. The greatest thing about this feature is that the two sets of cookies are the same cookies that IE uses.

Protocols. As I mentioned earlier, the Win32::Internet extension provides access to the HTTP, HTTPS, FTP, and Gopher protocols.

Proxies. The Win32::Internet extension automatically uses the proxy settings found in the Control Panel Internet Options applet. These settings are the same settings that IE uses.

Redirects. If the Web server issues a client or proxy redirect, the WinInet library handles it automatically and connects to the new URL or proxy server.

Cache. The WinInet library automatically places downloaded Web pages in the Internet cache and uses these cached pages when possible. Of course, the library respects all cache-control directives such as no-cache (which prevents content caching) and freshness expiration times.

As you can see, using the Perl Win32::Internet extension offers quite a few benefits. As I mentioned earlier, the extension works on all versions of Windows (Windows .NET Server, Windows XP, Windows 2000, NT, Windows Me, Windows 9x) but only on Windows platforms.

Using Win32::Internet

You can perform simple or complex tasks with the Win32::Internet extension. One of the simplest tasks is calling the extension's FetchUrl() function. You pass the function a URL, and it returns the associated Web page.

First, you create a Win32::Internet object:

$INET = new Win32::Internet();

Then, you can call the FetchURL() function. Just pass in a URL and receive the requested data, as GetUrl.pl in Listing 1 shows.

You can run GetUrl.pl by just passing in a URL, as in

perl GetUrl.pl
  http://www.amazon.com/

The Web page is then displayed on screen. Or you can use the command

perl GetUrl.pl
  http://www.amazon.com/
  > c:\AmazonHomePage.htm

to redirect the output to a file.

If you need to pass in a username and password, you can simply add them to the URL before the host name and Web page, as in http://username:password@hostname/webpage. So, for example, if you have a username of me and a password of 1234 and you want to access the secure Web page mywebpage.htm on host www.mydomain.com, you would use the command

perl GetUrl.pl http://me:1234@
  www.mydomain.com/
  mywebpage.htm

This mangling of the URL with user credentials works not only in Perl but also in IE and pretty much any other Windows Internet-capable program.

A More Complex Request
As I previously pointed out, you can use the WinInet library to perform more complex tasks. Let's say that you need to not only download a Web page but also examine all the HTTP protocol headers that the Web server sends as part of that page. You can use the Win32::Internet extension to easily do both.

First, you create a Win32::Internet object. Next, you create an HTTP session, a connection that uses HTTP as opposed to FTP or Gopher. Then, you use the HTTP session to open a request object, which represents the interaction that the script will have with the Web server. After the script sends a request to the server and the server returns a response, you can interact with the request object to access the resulting Web page data, the protocol headers, any error messages and result codes, and other information. GetUrlEx.pl, which you can find in the Code Library on the Windows Scripting Solutions Web site (http://www.winscriptingsolutions.com), prints out the Web page and the server's protocol headers only.

To call GetUrlEx.pl and have it display the protocol headers and the http://www.google.com Web page, you type a command such as

perl GetUrlEx.pl -d
  http://www.google.com

To redirect GetUrlEx.pl's output to a file, you can use the command

perl GetUrlEx.pl -d
  http://www.google.com
  > c:\temp\Google.htm

GetUrlEx.pl still displays the protocol headers on the screen but stores the Web page data in the specified file.

GetUrlEx.pl defines two constants (INTERNET_FLAG_SECURE and INTERNET_FLAG_KEEP_CONNECTION) that not all versions of the Win32::Internet extension export. INTERNET_FLAG_SECURE forces the WinInet library to use SSL over the HTTP connection (which is the equivalent of HTTPS). INTERNET_FLAG_KEEP_CONNECTION (aka HTTP/1.1 socket keep alive) tells the WinInet library to let the HTTP socket remain open for the duration of the script's interaction with it.

Like GetUrl.pl, GetUrlEx.pl first creates a Win32::Internet object. Next, GetUrlEx.pl translates the requested URL into its canonical form (i.e., the script converts special characters such as the space character into the standard Internet URL format). In addition, the script calls the CrackUrl() method to separate the URL into individual components. Cracking the URL is necessary because WinInet requires the individual components when interacting with the Web server. (WinInet doesn't require the components to fulfill a simple request for a Web page download, as you saw in GetUrl.pl.) The CrackUrl() method returns an array of components, including any username and password that the URL might contain. The script assigns the returned array to a slice, which correctly populates the %Url hash.

GetUrlEx.pl does its real work in the code that Listing 2 shows. First, the script calls the HTTP() method to create an HTTP session. The method populates its first parameter (the $Http variable) with an HTTP session object. The other parameters passed into the HTTP() method are the host, port, username, and password specified in the URL.

At callout A in Listing 2, the script sets the $Flags variable with the INTERNET_FLAG_KEEP_CONNECTION constant so that NTLM, MSN, and some other authentication packages can work. If the URL specified the HTTPS protocol, the script also logically ORs the $Flags variable to set the flag specifying SSL use.

The script calls the HTTP session's OpenRequest() method at callout B to obtain a request object and specify the actual Web page and flags. After obtaining the request object, the script adds any protocol headers that the user specified when invoking the script. At the end of callout B, the script calls the SendRequest() method, which actually submits the request to the Web server and obtains the resulting data.

If the user passed in the -d parameter, the script's next step is to display the received protocol headers. At callout C, the script sets the output filehandle to STDERR. Thus, even if the user has redirected the script output to a file, the protocol headers will be dumped to the standard error filehandle (typically the screen).

At callout D, the script calls the request object's QueryInfo() method to obtain the list of protocol headers. The script passes the method in the HTTP_QUERY_RAW_HEADERS_CRLF constant, which returns a string containing all the protocol headers, delimited by a carriage return and line feed (\n). To find out where you can obtain a full listing of the constants that a script can pass in the OpenRequest() and QueryInfo() methods, see the sidebar "Flag Constants." A foreach loop processes each protocol header, splitting it into the protocol header name and its value and calling the write command to display the information on screen. Finally, at callout E, the script calls the request object's ReadEntireFile() method to obtain the actual Web page data that the server sent.

Win32::Internet Rocks

I've shown you just the tip of the iceberg when it comes to the WinInet library's capabilities. If you need to interact with Internet servers from scripts running on a Win32 machine, investigating this handy extension is well worth your time. And because the extension comes with ActiveState's Win32 Perl distribution, you most likely already have it installed. This extension doesn't really get much attention, but it certainly should. Let me know what you're doing with it and what you think about it.

Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish