All this “surfing the Web with PowerShell” is pretty cool, but I must confess that I’ve avoided something important in this whole area: forms. I've shown you how to scrape screens and make sense out of JSON and XML from Web services, but an awful lot of good data on the Web can only accessed by filling out a form, clicking a button, and then scraping the resulting screen. Now, some websites might take offense if I were to show you how to automate querying their forms, so I've created the world's simplest form-based page. You can find it at http://www.minasi.com/addit.htm.
This page presents you with two fields: addend1 and addend2. Just fill them with integers, then click the SUBMIT! button, and you will be rewarded with a message like
This page is useful because it actually does something and motivates two observations about accessing form-based data. (No, it's not fancy, and it blows up if you don't feed it integers but, again, we're not going to annoy some poor fool's site with this example.)
First, notice that this page uses not one but two URLs. The original form is http://www.minasi.com/addit.htm, but once you click the SUBMIT! button, you get the answer at a page that the address bar identifies as http://www.minasi.com/addit-worker.asp. That's going to be important soon.
Second, you know that to get to the page with the sum of the two proffered integers, you have to click the SUBMIT! button. But how on Earth to do that with PowerShell? I must admit that the "how to click" thing puzzled me for a few hours, but then I just did what I always do when I've got an under-the-hood network question: I ran Microsoft Network Monitor. By looking at the traffic stream, I realized that I needed to collect only a few bits of data:
- The URL that my system used when responding to my button click, which turned out to be http://www.minasi.com/addit-worker.asp
- The body of the request, which looked like addend1=3&addend2=9&B1=SUBMIT%21
- The method type POST (rather than GET, which is a bit more common)
Now, if you've been following my Web harvesting examples for a while, you already have the tools to take a shot at automating this, but let's take a moment and see how to assemble the Invoke-WebRequest.
First, you need a Uniform Resource Identifier (URI). When I first started tackling a forms-based URI, I had no luck because my first tries focused on the initial "forms" page, www.minasi.com/addit.htm, rather than the "results" page, www.minasi.com/addit-worker.asp. Remember, then, that in most cases, your target URI is the results page, not the form page. My first PowerShell statement, then, defines the URI:
$URI = 'http://www.minasi.com/addit-worker.asp'
Second, take a look at the body:
That's just one of those name/value pair strings that you saw included in a URI in my first swing at RESTful Web services, but in this case it goes in the body because the RESTful query was a simple GET; in contrast, this is a POST. (Recall that POSTs are often more desirable because nosy network sniffers can't see what's in an HTTP body when the message is over SSL, but the URIs are visible on SSL.) This just breaks out to three name/value pairs:
Addend1 = 3 Addent2 = 9 B1 – SUBMIT%21
(and as hex character 21 is "!," there's our SUBMIT!) You can store that in your next variable, $Body:
And so you can then construct this command:
$page = Invoke-WebRequest $URI -body $Body -method POST
Once it runs, recall that you can get its status code from $Page.StatusCode, and we're typically hoping for a 200. To see the text that would have been displayed in the browser, just display the page's content:
which looks like
That works, but let's try to do better. Do we really need the B1= variable? If there were more than one button on the page (maybe there would be one labeled not SUBMIT! but Add, and a different one labeled Multiply), then yes, we'd care—or, more correctly, Addit-Worker.ASP would care. But in this case, we can drop B1 altogether from the body's name/value string.
Are we done yet? Not really, because we need to take the content, which in this case would be a simple Sum= 12, and extract just the 12, but that's easy enough to do with the regular expression (regex) captures we've already done, like so:
$page.content -replace "Sum= (\d+).*",'$1'
So we've cracked our first form, but there's quite a bit more to do—for example, a great "sniffer" tool named Fiddler and support for an essential part of most forms: cookies! Next month, we'll take those up. See you then!