More Harvesting: File Downloads with PowerShell

More Harvesting: File Downloads with PowerShell

Mark shows how to use simple PowerShell cmdlets to download pictures and other files from the Internet.

Last month, in “Harvesting the Web with PowerShell and Invoke-WebRequest,” I described a case in which I wanted to automatically download the contents of a Web page and use a text pattern recognition tool called a regular expression (regex) to pull out two text digits from that page. This month, I’ll show you two other “Web harvesting” tools that let you grab not just a few numbers off a Web page but entire files.

In last month’s article, I demonstrated how to use the Invoke-WebRequest cmdlet (a cmdlet which, by the way, has an alias of “IWR”) to grab a website’s home page. For example, to get the contents of the www.minasi.com home page, you could type

$page=iwr www.minasi.com

You could then type $page.rawcontent and see the HTML of my home page. In reality, though, “seeing the home page” is just another way of saying “downloading the text file containing the site’s home page,” a file with a name of index.html, default.htm, default.asp, or something similar. I could even save the contents to a file with the –outfile parameter, as in

iwr "www.minasi.com" -OutFile C:\scripts\testit.txt

(This assumes that you have a folder named C:\scripts; IWR won’t create a folder automatically.) In the same fashion, you could download any Web page from a site, or, more usefully, any file of any kind on a website, as long as that file has an URL. For example, try this out:

iwr "www.minasi.com/picture.jpg" -OutFile C:\scripts\testit.jpg

Open the file at C:\scripts\testit.jpg, and you’ll see that you successfully downloaded an image file. Thus, if you have a favorite site that posts a “Picture of the Day” at www.somedomain.com/POTD.jpg, you could have a scheduled task that pulls that file down every day, and you could even use the send-mailmessage cmdlet we’ve covered before to send it to your mailbox. Or, if the picture of the day isn’t always the same name, you could concoct a little script to capture the images from that page using the fact that IWR exposes an attribute called images when it downloads a Web page. If you already tried that $page=iwr www.minasi.com command I showed before, try typing

$page.images

That will offer names and information about the images on the home page. The images attribute has an attribute of its own named src that contains the relative URL of each image. Therefore, to get a nice, compact list of the image files on the home page, you could simply type

[iwr www.minasi.com].images.src

Building upon that, a very quick and dirty one-liner to download every image on the page might look like

$n=1;[iwr www.minasi.com].images.src | %{$picname=$_;$fn="File"+$n;$n=$n+1;iwr ["www.minasi.com/"+$picname] -outfile ["c:\scripts\"+$fn+".jpg"]}

IWR downloads files quite well, but it’s worth knowing that PowerShell can also direct file transfers through Background Intelligent Transfer Service (BITS). You might be familiar with BITS: It’s the tool that downloads Windows Update files. Its original purpose was to allow low-bandwidth background transfers, specifically to ensure that Patch Tuesday didn’t crash the Internet with zillions of computers simultaneously downloading large update files at top speed. BITS can also transfer files via not only HTTP/HTTPS but SMB—unlike IWR.

You start transferring a file over BITS with the Start-BitsTransfer cmdlet, which takes a source filename and a destination folder or complete file specification as its first two positional parameters. To transfer the image file you pulled down before, you’d type

start-bitstransfer http://www.minasi.com/picture.jpg C:\scripts\picture.jpg

Alternatively, you need only specify the destination folder, as in

start-bitstransfer http://www.minasi.com/picture.jpg C:\scripts

In that case, Start-BitsTransfer just retains the file’s name in the copy that it puts in C:\scripts.

Note that although I could just type www.minasi.com for IWR, Start-BitsTransfer needs the http:// prefix so that it can distinguish between whether you’re requesting an HTTP transfer or an SMB transfer.

BITS has another nice feature that IWR lacks: the ability to download files in the background. That can be useful if you want to download a large file or a large number of small files, and you don’t want to wait for the transfers to complete before you can use the command prompt again. To use this option, just add the -asynchronous option and capture the command into a variable so that you can monitor the status of the background job. For example, try

$j=(start-bitstransfer http://www.minasi.com/picture.jpg C:\script -asynchronous)

Then, just type

$j.JobState

and it will respond Transferred when done. Next month, we’ll start working with websites designed to be harvested: Web services. See you then!

Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish