Recursively Download Entire Web Sites

This month, let's take a look at Wget, a time-saving tool that you can use to fetch remote files or entire Web pages from Web and FTP servers. This tool can help you audit your Web site and gain insight about how to further secure it. In many respects, Wget is similar to last month's tool, Curl: It's an open-source command-line tool, available under the GNU license, that uses HTTP, HTTP Secure (HTTPS), or FTP to download remote files. However, Wget includes unique features that Curl doesn't, such as the ability to recursively download entire Web sites rather than single files. In this article, I examine how to use Wget for common administrative tasks. It's a tool that can dramatically assist your server buildouts and automated downloads.

Download and Install Wget
You can download the freely available Wget from the GNU site (http://www.gnu.org/software/wget/wget.html). Wget binaries are available for most Linux, UNIX, and Windows OSs, or you can compile your own from the source code. You can also link to a separate Web site (http://xoomer.virgilio.it/hherold) hosting the most recent Windows version (Wget 1.10.1, August 2005) from the GNU Web site. As with many Linux-and UNIX-based command-line tools for which source code is available, you'll find many flavors that you can download, including Windows ports with GUIs. (Try a Web search for the keywords Wget, Windows, and GUI to find quite a few Windows GUI variants.)

In this article, I look at the basic GNU version, which is an updated and direct port of the Linux/UNIX version. As you continue to discover the tool's features, you might find other versions and interfaces that better suit your particular needs. For example, wGetGui (http://www.jensroesner.de/wgetgui) shows most of the Wget parameters in an accessible graphical window but doesn't lend itself to scripting or batch jobs as well as the command-line versions do. WinWGet (http://www.cybershade.us/winwget) extends the base Wget functionality to support download jobs, letting you create multiple jobs through a slick Windows front end.

To configure Wget to use HTTPS, you'll need to install the OpenSSL libraries, which are simple DLL files included in the Wget download package. Extract the Wget.zip file and add the Wget directory to your path statement, or copy the wget.exe, libeay32.dll, and ssleay32.dll files to a folder in your path directory (e.g., C:\windows, C:\windows\system32). The Wget download files include a Windows Help file (wget.hlp): To access a listing of the tool's numerous command-line parameters at any time, you can type

wget  help

Using Wget is straightforward. To fetch a remote Web page, open a command prompt and type the command

wget http:///
 index.html

If you don't know the default Web page, you can simply type

wget http://

and Wget will download the default home page for you, just as if you typed the URL into your browser. By default, Wget saves this file into the directory from which you executed the command. To specify a different path, you can use the -P path parameter. If the file is an HTML file, you can open it by using a text editor (e.g., Notepad) after you retrieve it. If you use your Web browser to open the file, you'll likely see only a partially rendered copy of the Web site because, in this example, we retrieved only the default Web page and no supporting files such as graphics or style sheets.

Wget is much more powerful when you use it with its command-line parameters. For example, through the use of the -r parameter, Wget supports recursive retrieval. The command

wget -r http://

forces Wget to crawl the site and download every Web page, graphic, and linked Web page it encounters—to the default recursion depth of five levels. In other words, it gets the first Web page, finds the links on that page, retrieves those Web pages, and repeats the process until it reaches the fifth level down. To limit how deeply Wget crawls, you can use the -l n flag to specify a depth. For example, the command

wget -r -l 2 http://  .targetwebsite.com>

searches two levels down into the Web site. Be careful how you specify recursion: You could fill your hard disk with retrieved Web pages. Also, you might irritate Web administrators because the tool will attempt to retrieve every file it finds as fast as it can, which can put a load on the site, depending on your (and the site's) available bandwidth. To specify a courtesy wait time (in seconds) between downloading pages, you can use the -w parameter.

Wget stores retrieved files in a new directory named after the Web site. In the previous example, Wget would create a directory named \www.targetwebsite.com and save a local copy of the Web site in this directory. Wget creates this directory in the same directory from which you run the tool, unless you specify an alternative destination by using the -P path parameter. For example,

wget -r -l 2 -P C:\wgetstuff
  http://

instructs Wget to download the Web site to C:\wgetstuff\www.targetwebsite.com. In this folder, you would see the actual contents of the Web site—for example, files such as index.html, directories named /images or /css, and any other main or supporting files for that particular site.

Because Wget uses the links discovered in retrieved pages to fetch new Web pages and files, the tool doesn't require that the target Web site have directory browsing or site indexing enabled. Suppose the default page of a Web site is index.html, and it contains links to three images. Wget will save a total of four files, regardless of whatever else the actual directory might contain. Wget recursively fetches any linked file—even to other sites. When it does so, it creates a directory for every file linked to the original up to the recursion limit. The tool creates these directories to other Web sites in parallel with the Web site directory from your original Wget request.

Wget supports more advanced Web browser functions, such as authentication, cookies (both user and session cookies), and Web proxies. You can configure many browser parameters, such as the referrer and user agent. You can specify POST methods. And you can instruct Wget to obey a site's robots.txt file or Robots META tag instructions.

Selecting Specific Files and Directories
Using the -I directory list and -A file extension list parameters, you can instruct Wget to download from only specified directories or file types. For example, the command

wget -r -I /images
  http://<www.targetwebsite.com>

downloads only the contents of the /images directory. The command

wget -r -A jpg,gif
  http://<www.targetwebsite.com>

downloads only JPEG and GIF images. These parameters can be helpful for filtering what is actually downloaded, which saves both hard-disk space and bandwidth.

Wget Caveats
In essence, Wget creates a local copy of the remote Web site, but in doing so, it reveals some limitations—particularly when you're retrieving files from sophisticated Web sites. Because Wget crawls a Web site only for links, it probably won't retrieve all available data by visiting the site directly. For example, Wget won't be able to navigate scriptable events, such as clicking a button to download a file (rather than a direct link of the file to the Web page). Similarly, if a Web page is generated dynamically, Wget will be able to fetch only the static HTML returned to it at that time for that browser type. Wget works best when it's retrieving simple files or downloading basic Web sites.

FTP Sites
Wget is also a good alternative to the built-in Windows FTP client because it supports wildcards and maintains the date and time stamp of the original file. To set up a scheduled task to download only the newest files in a remote FTP directory, you can use the command

wget -N ftp://<ftp.mytargetsite
  .com>/path/to/files/*.*

The -N parameter instructs Wget to retrieve only files that are newer than any previous retrieval. In this example, Wget logs on to the remote FTP server through an anonymous connection (you can use ftp://user:password@host/path to specify a user and password, if necessary), navigates to a specific directory, and attempts to download only the latest files.

Automate Administrative Tasks
Wget is particularly useful if you want to use only the command line to download files remotely to servers. For example, you could Telnet into a remote Windows server and paste in the command:

wget -P C:\downloads
  "http://<URL to a Microsoft
  patch>"

to download a Microsoft patch to that Windows server. The quotes ensure that the URL passes correctly into the Wget program, which can be essential depending on the complexity of the URL. Also, this mechanism works only for traditional file downloads that you can access by using a static URL. For example, software downloads using Microsoft's new interactive ActiveX control Genuine Windows won't work with tools such as Wget.

Get Wget
Use Wget to dramatically speed up the configuration of a new computer and ensure that the source file locations are well documented. For example, you can copy a command that downloads a new piece of software or patch into a configuration document and know later exactly what software is installed and where it was downloaded. If the computer ever needs to be rebuilt, you can copy the commands from the build document into a console to quickly get the installation files back onto the rebuilt computer. Wget is a simple yet useful tool that can help you manage and work with data from remote systems.

Jeff Fellinge ([email protected]) is a contributing editor and the director of information security and infrastructure engineering at aQuantive.

Comments

Plain text