This month, I want to return to my “Web harvesting” project, but I want to attack it from a different direction—one that will lead to what I think is a cleaner answer and that will let you examine something called Web services and how PowerShell enables you to do some heavy-duty Web data harvesting. With that in mind, let’s take a look at exactly what Web services are and what they can do for you.
In “Harvesting the Web with PowerShell and Invoke-WebRequest,” I talked about creating a PowerShell script to collect wind, water temperature, and tide information into one short table. Of course, I could have just fired up a Web browser and visited a few sites to get my numbers, but I wanted my PC to do it automatically for me. That let me introduce PowerShell’s Invoke-WebRequest cmdlet. That cmdlet downloaded a page containing water and wind data, and then a little work with an ugly-but-powerful regular expression plucked out the two specific digits showing the Fahrenheit temperature, and the job was done. But that really seemed like the hard way. The bottom line was that my “screen scraping” was necessitated by the fact that most websites are meant to be viewed by human eyes, not interrogated by some script.
Web Services: Computer-to-Computer Web Surfing
Seeing a new realm beyond scraping, the industry has developed several standards for what I’d call “computer-to-Web-server communications” but whose real name is Web services. Just as you could open a browser and visit particular URLs to check your stocks, get tomorrow’s weather, or perhaps do a currency conversion, there are sometimes other URLs that produce not human-pretty pages but instead data-rich XML documents, so it would be a crime to miss out on “harvesting” from Web services. Here, then, are a few essential basics of Web services.
First, understand that there might not be a Web service for what you’re looking for. Building readable Web pages and corresponding Web services might not be in the Web owner’s budget.
SOAP or REST?
Essentially every Web service is unique as to how you query it, but they basically tend to fall into two different types: Simple Object Access Protocol (SOAP) and Representational State Transfer (REST). Web services are basically just another set of computer-to-computer protocols like Microsoft’s RPC, but Web services have sought from the beginning to remain platform-independent. Fortunately, the early designers already had the services of TCP/IP, HTTP, DNS, and so on. What they didn’t have was a good structure for Computer 1 to ask Computer 2, “May I have some data, and how exactly will you package it to deliver it to me?”
The original answer was an XML-packaged request/response standard called SOAP. (One of the founders did a famous SOAP talk in Barcelona in 2001 while sitting largely unclothed in a bathtub. Really.)
Here’s a completely minimal (and imaginary) SOAP request message:
<?xml version="1.0" encoding="utf-8"?> <soap12:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:soap12="http://www.w3.org/2003/05/soap-envelope"> <soap12:Body> <GetTime xmlns="http://timecurrentime88.com/WebServices"> <Request>EST</Request> </GetTime> </soap12:Body> </soap12:Envelope>
Yes, it’s ugly, but after a quick look you can see that this service returns the current time, and the time zone requested is EST. Even better, XML loves hierarchies, so it can handle structured data. The Web service then responds in SOAP, as well. Further, most SOAP-based Web services have another XML file built in something called Web Service Description Language (WSDL), which explains how to format SOAP requests and responses for that particular Web service. So we’ll do some XML spelunking, but it’s not that bad, because another cmdlet—New-WebServiceProxy—reads the WSDL and simplifies things a bit.
SOAP’s alternative, REST essentially lets you query a server with simple URLs. In a RESTful (that’s the proper adjective) version of my imaginary SOAP example, our query might just be a simple URL like
Onward to the Harvest!
So you’ll query most Web services either via XML SOAP messages or REST-type URLs, and you’ll get your answers either as simple text, CSV files, XML, or JSON. Both SOAP and REST are useful, and both are prevalent on the Web, and fortunately PowerShell has plenty of XML- and JSON-friendly tools. Next time, we’ll start harvesting them with PowerShell!