Culling Web Pages with ActiveX

Downloads
4819.zip

Many scripts in the Windows Scripting Host (WSH) user community focus on managing user accounts and resources. However, WSH scripts aren't limited to these functions. WSH is a powerful tool that lets you access scripting components across the entire range of Windows applications, including Internet Explorer (IE).

With WSH's support for Microsoft's Component Object Model (COM), you can also access ActiveX components. The Internet offers a wide array of ActiveX components, so you'll likely find an ActiveX object that performs the task you need to accomplish.

In this article, I'll show you how to use an ActiveX object to create a script that controls Internet Explorer (IE) and accesses content from Web pages. (If you're unfamiliar with how to use ActiveX components, see "Using ActiveX Objects to Extend WSH's Functionality," January 1999.) Along the way, I'll discuss two workhorse Visual Basic Script (VBScript) functions—InStr and Mid—that you can use to manipulate string variables.

The Task
I access certain Web pages frequently, because their content changes daily. To save time, I decided to create a script that goes to these Web pages, pulls specific snippets of text (i.e., news headlines), and places those snippets into a customized digest page within IE.

When I started this project, I wanted to write a script that loaded each Web page directly into IE, where the script would copy and paste the headlines into the digest page. However, I ran into a problem: I could use VBScript with IE objects to access the contents of a Web page for display, but the IE objects didn't include properties and methods for manipulating the text in that page.

To solve this problem, I had to use IE objects to display the script's output and an ActiveX object to read the Web pages. The result is the WebExample.vbs script in Listing 1. (You can download Listing 1 from http://www.winntmag.com/newsletter/scripting.)

The ActiveX component I used was Microsoft's Internet Transfer Control (ITC), an Object Linking and Embedding (OLE) custom control (OCX) file. This multipurpose tool ships with the Microsoft Office 97 Developer Edition and with most versions of Visual Basic and Visual Studio. Several commercial ActiveX components also provide HTTP services. If you don't have a Microsoft package that includes ITC, a third-party ActiveX component might better meet your needs. (For information about how to find and install OCX files, see "Using ActiveX Objects to Extend WSH's Functionality.")

Whether you use ITC or a similar ActiveX component, the approach and the code are similar. You need to set the stage, create the customized digest page, and fetch and copy the text.

Setting the Stage
You begin WebExample.vbs with an Option Explicit statement, which requires that you declare your variables. You use Dim statements to declare six variables: oIE, oInetCtrl, i, n, length, and buffer. The prefix o in oIE and oInetCtrl specify they are object variables.

The i and n variables are shorthand for integer and number, respectively. Following the practice commonly used in C and C++, you typically use i and n variables to count or to remember where you are in an array or a loop. These variables don't have meaning in terms of the script's purpose. They simply represent a value.

The length variable represents a string's length. The buffer variable is a long string variable.

Creating the Customized Digest Page
The code to create the digest page begins with the creation of an instance of the object you want to use. Specifically, you use VBScript's Set statement with the CreateObject function to create an instance of IE's top-level object ("InternetExplorer.Application") and assign it to the oIE variable.

Next, you use several IE object methods and properties to load a Web page, make that page visible, and write a header to it. To load a Web page, you use oIE's Navigate method. In this case, you must load a blank page, as the argument "about:blank" specifies, because you need to add text to it. If you were to put a URL as the argument, the Web page at that URL would load. However, as I mentioned previously, you wouldn't be able to read the contents of this page from within your program.

To make the IE window visible to users, you need to set oIE's Visible property to 1. A value of 0 means the page is invisible to users.

To insert the header, you use the oIE's Document property to access the Document object. You then use the Document object's writeIn method to add the Customized Digest Page header. The writeIn method automatically follows the text with a carriage return. You again use the writeIn method to add the paragraph break tag (<p>) so that the digest page has a line break in it. (Web browsers ignore line breaks in HTML pages.)

When you write the header, keep in mind that the strings of text you'll write need to constitute an HTML formatted page. For the HTML purist, formatting the page involves writing in a document header between <HEAD> tags, including a <TITLE> tag, and so on. However, you don't have to observe all the formatting formalities if you don't want to. Because you won't be saving this page, the only formatting tags you must add are header tags (<H1> or

) and paragraph break tags (<p>).

Fetching and Copying the Text
Now you can create an ITC object and use it to fetch a Web page. To create the object, you create an instance of the ITC object ("InetCtls.Inet.1") and assign it to the oInetCtrl variable. Because you're working with a network-oriented process that might fail, you tell the object how long you're willing to wait before you decide that a given request has failed. In this script, you set the timeout for 200 seconds.

The ITC object offers two methods for fetching a Web page: the Execute method and the OpenURL method. Although both methods work in theory, neither I nor the others working on this problem in the WSH newsgroup at news://msnews .microsoft.com/microsoft.public.scripting.wsh could make the Execute method work. Fortunately, the OpenURL approach works, and you can use it to return the requested page as a string.

Next, you read the entire HTML page into the buffer. Because you loaded the page as a string, you can use InStr and Mid to pull out a selected portion of the page. InStr lets you find a substring within a larger string. Mid lets you copy a substring from a specified section of a string to another string.

In the first Web page you're culling from (i.e., "http://home.microsoft .com"), the headlines are in a section that begins with the heading MSNBC News. Thus, you're trying to find the "msnbc news</td>" substring in the buffer. (The <\td> tag specifies the end of a cell in an HTML table.) To find this substring, you set the i variable to InStr and specify two required arguments: the larger string (i.e., buffer) and the substring (i.e., "msnbc news</td>"). InStr returns the location of the start of the substring as an integer, counting from 1. If the script doesn't find the substring, InStr returns a 0.

This script assumes that the Web site developers are constructing the Web pages the same way each day. In the case of WebExample.vbs, this assumption isn't too risky, because the developers use automated processes to create the two news sites you're culling from. If you're reasonably careful about what text string you choose to help find the targeted headlines, the script can usually find them each time. However, if developers don't use automated processes to create the Web pages you're culling from, this assumption might not hold true.

Next, you use VBScript's If...Then...Else statement to check for the return of 0. If the script doesn't find the substring, the script sets the buffer variable to a string that announces the problem. If the script finds the substring, the script performs these steps:

The script nudges i up 20 characters so that i points to the beginning of the headlines. To determine how far you nudge i up, you count the number of characters in the text you're culling from. This step is necessary because InStr returns a pointer to the start rather than the end of the substring.

The script uses InStr to find the endpoint. You again use InStr, but this time you include an optional argument that specifies the start position, because you don't want InStr to begin its count from 1. In this case, you want the search to start at the headlines (a point that i now marks), so you put (i, buffer, "</td>") as the arguments.

The script determines the length of the string. You subtract the start point (the point that i marks) from the endpoint (a point that n marks) to determine the string's length.

The script copies the string. You use the Mid function with three arguments: the string to copy from (buffer), the start point (i), and the number of characters to copy (length).

With the right segment of text in the buffer, you can copy the text to the digest page. You use the oIE's Document property to access the Document object. You then use the Document object's write method to write the text in the buffer variable to your digest page. Unlike the writeIn method, the write method doesn't add a carriage return. Finally, you use the write method to add a row of hyphens to separate the text you've culled from different Web pages.

This script uses both the write and writeIn methods. You can use either method when buiding a digest page, because HTML doesn't read a carriage return as a sign to move to the next line. HTML reads only <p> tags as a sign to move to the next line. However, if you plan to look at your page in a text editor, use writeIn, because the page will be more readable.

The rest of the script repeats this process to get the headlines, which are in a table format, from a different Web page. Screen 1 shows the completed digest page when you display it in IE. The digest page isn't a true Web page, because you don't store it as a file. Instead, WSH uses OLE automation to control the browser and write the content directly to the display window.

A Handy Tool
The WebExample.vbs script culls information from the Web. You can also use this type of script to cull text from local text files or cull data from a database. I'm still tinkering with my version of this script. When I'm finished, I'll have a comprehensive snapshot of information I need waiting for me each morning. With a few tweaks of the example script, you can have an information snapshot, too.

Comments

Plain text