Harvesting the Web with PowerShell and Invoke-WebRequest

Harvesting the Web with PowerShell and Invoke-WebRequest

Let Powershell do your surfing for you

One of the first things you probably do every day is surf the Web to see what’s going on: You check Twitter, the weather, a favorite news site, and so on. There have always been news aggregators of various kinds (remember RSS?), but sometimes it would be nice to just grab arbitrary bits of data from various websites all in one place. Sounds good, you say, but wouldn’t you have to do some coding? Well, maybe a little, but it’s be easy—with PowerShell’s Invoke-WebRequest cmdlet.

In my case, I’m writing this article in the summer, and I’m a short drive from the sea. I’m kind of a beach snob, and I really want to make the trek only when the ocean is 72 degrees or warmer, within an hour of low tide, and preferably within a few days of the new moon or full moon. (What, you don’t body surf? Just Web surf? Interesting.) Just to keep this brief, let’s see how to use Invoke-Webrequest to get water temperature.

I’m near a place in North Carolina named Duck, and fortunately the Army Corps of Engineers runs a data-gathering site there called the Field Research Facility (FRF) that reports water temperature. That home page has a box in its upper-left corner containing several stats, including water temperature. We’ll grab that.

Figure 1: FRF Water Temperature

Get the Page

To tell PowerShell to retrieve the page and, while we’re at it, store it to a variable that I’ll call $webpage (so we can easily manipulate it later), I’d type

$webpage=Invoke-webrequest www.frf.usace.army.mil

Now, if you were to pipe that to Get-Member (which is always an excellent idea the first time you’re working with a new PowerShell noun/object), you’d see that Invoke-WebRequest has tried to parse the page into pieces: $webpage.images lists the downloadable images, $webpage.headers contains the header info, .forms and .inputfields are empty (the home page doesn’t have any forms), .links collects all the hyperlinks on the page, and .content and .rawcontent get the actual HTML code. Assuming you have a C:\scripts folder, you can then save that to a text file by typing

$webpage.RawContent | Out-File "c:\scripts\webpage.txt" ASCII -Width 9999

or you can copy it to the clipboard, then paste it into Notepad by typing

$webpage|clip.exe

Parse Out the Data

At this point, the text of the Web page is a mess, just a bunch of HTML. What I want, however, is to find and extract just two digits: that Fahrenheit water temperature. To do that, I must figure out a unique pattern to the text that PowerShell can use to pull out those two digits.

I start searching for that pattern by looking at the home page. I’m guessing that Pier End shows up only once in the page. A Notepad search verifies that there is indeed only one and it lives in this line of HTML:

Water Temp Pier End 21°C    70°F

A quick count shows that my two digits start 45 characters after Pier End. My pattern, then, looks like

  • Skip all text until you see “Pier End.”
  • Skip ahead 45 characters after that.
  • Capture the next two characters and remember them.
  • Skip whatever’s left, and remember only the captured text.

I can implement that pattern in PowerShell with something called a regular expression or regex. It uses a feature built into every PowerShell string variable, causing PowerShell to find every instance of some string and replace it with another. For example, this would tell PowerShell to replace the word Pier with Dock throughout this string:

PS C:\scripts> $a="Pierdockpierdock"
PS C:\scripts> $a -replace "pier","dock"
Dockdockdockdock

That -replace thing is kind of weird looking, as if the string variable gets a cmdlet parameter or something, but that’s the syntax. And speaking of weird, get ready for the regex. It looks like Martian, but it’s powerful.

  • Skip all text until you see “Pier End” is written [\S\s]*Pier End—[\S\s] matches anything and the asterisk (*) matches as many of them as it finds. “Pier End” just matches “Pier End.”
  • Skip ahead 45 characters after that is written .{45}—The period (.) matches any character except an end-of-line, and {45} specifies 45 of them (as opposed to the asterisk, which doesn’t care how many.)
  • Capture the next two characters and remember them is written (\d\d)—The parentheses say to “capture” them into something named $1, and \d means “any digit.” The expression (\d{2}) would have worked also.
  • Finally, again, [\S\s]* just says to skip whatever’s left. The result? Type the following to get the temperature, and notice how the regex looks when it’s all glued together:
$webpage=(Invoke-WebRequest www.frf.usace.army.mil).rawcontent
$webpage -replace"[\S\s]* Pier End.{45}(\d\d)[\S\s]*",'$1'

Now, understand, this was just a first attempt, and my regex was a mite rough-edged, but it gets the job done. Next month, more Web harvesting and a bit of regex-ing advice!

Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish