Toolbox: Transfer Web Files from the Command Line

Curl helps you fetch, parse, and control Web-based data

Jeff Fellinge

November 20, 2005

9 Min Read
ITPro Today logo

This month we take a look at Curl, the open-source tool that lets you send and receive Web pages from the command line. Curl makes it easy to automate many security and administrative tasks, such as fetching a Web page for analysis or downloading a security patch from the Web at the command line.

Installing Curl
Curl is included in most UNIX distributions and provides binaries and source code for most other OSs. Even developers using the open-source PHP: Hypertext Preprocessor language (PHP) can take advantage of Curl, which offers a more secure method of accessing Web content directly from their PHP scripts.

When used in conjunction with Secure Sockets Layer (SSL) Web sites, Curl depends on the OpenSSL package. Curl is available in two versions: one with SSL and one without. Because you'll probably want to work with SSL-protected data, I recommend that you install the SSL version.

With the SSL version, you must download and install the OpenSSL package separately and before you use Curl. You can get OpenSSL for Windows binaries from the Gnu-Win32 SourceForge project at http://gnuwin32.sourceforge.net/packages/ openssl.htm. Check out this site—it also provides a lot of useful UNIX tools ported to Windows.

Download and install the OpenSSL package, then copy the two DLLs into your system32 directory:

copy "C:Program  FilesGnuWin32 binlibeay32.dll" %windir%system32 copy "C:Program  FilesGnuWin32 binlibssl32.dll" %windir%system32 

Now install Curl. You can find the SSL-supported Curl binaries for Windows at http://curl.haxx.se/latest.cgi? curl=win32-ssl-sspi. The most recent version, curl-7.15.0-win32-ssl-sspi.zip, contains the curl.exe file and supporting documentation you'll need.

After you've installed Curl, test that it's working by typing

curl http://isc.sans.org/ infocon.txt 

at a command line. If a color word (e.g., green) appears, Curl is working. In this very basic example, Curl fetches the Infocon status from the SANS Institute's Internet Storm Center Web site. Green means that the Internet is operating normally and no significant threat is known. If instead of green you see yellow, orange, or red, put down this article and visit http:// isc.sans.org to learn about the heightened Internet threat level. If you get an error, check your Curl installation.

Simply put, Curl fetches a Web page, then returns the page's HTML to the console. But Curl does more. Curl has built-in error checking. For example, typing

curl http://noserverhere 

returns the error Curl: (6) Could not resolve host: noserverhere; Host not found. You can use error return codes in your scripts to test whether a Web page is accessible or whether a Web server is responding. For example, if you use Curl to fetch a Web page nightly—say, that day's Web site statistics—you could include in your script code that looks for error codes. Then, if Curl returns with a code Curl: (7) couldn't connect to host, you could send an alert or email notification right away.

Fetching Encrypted Data
One of the most important benefits of Curl is that it supports SSL. When Curl requests HTTPS pages, the pages are encrypted when traversing the network, then Curl displays unencrypted text. Curl also checks certificates—the certificate expiry date, whether the host name matches the host name in the certificate, and whether a root certificate is trusted—and warns you if a certificate isn't fully legitimate. You can specify a particular certificate file by using the --cacert file parameter. To disable certificate checking, use the -k parameter. (Alternatively, you can use the --insecure option.)

Not Just for the WWW
Curl provides more than simple Internet file transfers. You can use Curl to get a quick directory of an FTP site by typing

curl ftp://myftpsite 

To see the site's subdirectory, type

curl ftp://myftpsite/subdir/ 

To download a file, simply specify the filename in the URL. The following example downloads a file called readme.txt directly from the command prompt and displays the file on your screen:

curl ftp://ftp.microsoft.com/ deskapps/games/readme.txt 

It's often easier to script Curl for grabbing FTP files than to use an interactive FTP command.

By default, Curl displays output directly to the console, but you can redirect the output to a file by using the -o and -O parameters. (Of course, you'll want to redirect binary files to disk, unless you want to see binary code scroll across your screen.)

Specify -o when you want to get the page and store it in a local file. Specify -O to store the retrieved page in a local file and have Curl get the name of the remote document. (If the URL doesn't specify a filename, this action will fail.) If you use Curl to make a request to a Web site that has no filename and you want to save the output to a file, you can specify a filename on the command line, like this:

curl -o whoisOutputFile http://www.arin.net/whois/ 

Authentication
Curl supports basic, integrated, and digest authentication. Generally (but not for every site), you can access pages behind forms-based authentication by using Curl's post functionality, which I'll show you in a moment. This means that you can push form data such as username and password to a remote Web site that requests this information on a Web page. You can send credentials by using the -u parameter or by inserting them into the URL, as is traditionally done in FTP, like this:

curl ftp://username: password@myhtmlsite 

Curl can extend this FTP-type support to HTTP, as in this example:

curl http://username:password  @myhtmlsite/default.htm 

Curl also provides broad support for accessing Web pages via a proxy server. This means that you can configure Curl to use a Web proxy server using basic, digest, or NTLM authentication.

Read the Manual
I can't begin to detail all of Curl's many features, which include uploading files (-T), viewing just the HTTP header information (-I), viewing everything in verbose mode (-V), and silencing output (-s). To learn more about Curl's features, I recommend that you read the manual at http:// curl.haxx.se/docs.

Putting It All Together
Now that you've seen some of the basics of using Curl, let's take a look at a simple example that uses Curl to fetch data from a Web site based on specified input. We'll create a simple Whois tool that shows the simplicity and utility of Curl and demonstrates posting data to a Web site using the Curl -d parameter. In this example, Curl sends an IP address to the Arin Whois Web site and then gets the results back from that site. (Whois looks up information about the owner of an IP address.)

Before beginning, it's important to examine the Web site that you want to use with Curl because every site is coded differently and Curl might not work the same way with each site. Visiting the site first gives you the information needed to run Curl with it. In this example, I use a browser to visit the Web site http://www.arin.net/ whois/, and I note that the site has a single data-input field where users enter the IP address they want to look up. This field is part of a Web-based form and we need to get its details. In this example, I use the Perl script formfind.pl (http://cool.haxx.se/ cvs.cgi/curl/perl/contrib/formfind? rev=HEAD&content-type=text/ vnd.viewcvs-markup). Formfind.pl extracts the form data into easy-to-use output and simplifies having to manually find the data in the HTML. Of course, you'll need to run Formfind on a computer that has Perl installed. (For a good Win32 Perl package, check out ActiveState ActivePerl at http:// www.activestate.com.)

Let's walk through the example. First, fetch the Web site containing the form that prompts for information:

curl -o whoisOutputFile http://www.arin.net/whois/ 

This example grabs the Whois page from http://www.arin.net and saves it in the whoisOutputFile text file, which contains the same HTML that your browser would render if you visited the site.

Next, find and isolate the form data:

./formfind.pl < whoisOutputFile 

Formfind displays the form variables and their optional values. In this example, the output is pretty straightforward and looks like Figure 1.

Notice the Input form data named queryinput. This is the text field where we want Curl to send the IP address that we want to look up. (The specific IP address doesn't matter—in this example, I use a Microsoft address.) Using Curl's -d parameter, we send the IP address that we want to look up to the field queryinput, like this:

curl -d "queryinput=   207.46.133.140"   http://ws.arin.net/ cgibin/whois.pl 

The -d parameter instructs Curl to look for form data—in this case, queryinput, which is the IP address we want to look up. Also notice that Curl's target has changed; the form expects to post to a new URL, which is actually the script whois.pl. You can see this new target in the formfind output in Figure 1.

In this example, we also get the HTML of the Whois answer, but it's clouded by a bunch of HTML tags. By default, the Curl status message shows the document size, percentage complete, and transfer speed. Let's clean up the output a bit and filter for the name of the organization that owns the IP address by using the -s parameter to suppress the Curl status. We also want to run the command through grep so that only the OrgName is returned, like this:

curl -s -d "queryinput=   207.46.133.140"   http://ws.arin.net/ cgibin/whois.pl   | grep OrgName 

For this example, the output shows that OrgName is Microsoft Corp.

Finally, we'd rather not hard-code the IP address, so let's wrap this request into a simple script named arin.bat:

@echo off curl -k -s -d "queryinput=   %1" http://ws.arin.net/cgi-bin/whois.pl | grep OrgName 

Figure 2 shows the output we get when we run the script from the command line.

Using Curl Is Easy
You've seen how Curl fetches and parses data from a remote Web site and lets you view and control Webbased data from the command line. Curl includes a range of commandline options. You can find particular options quickly by using --help. For example, type

curl --help | grep proxy 

to find all related proxy options.

Even though Curl is easy to use, it isn't always the best tool choice. For example, if you want to crawl a Web site and grab multiple files, you might want to consider instead using the command-line Internet file grabber wget, which supports wildcards. I'll examine wget in an upcoming article.

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like