An Approach to Parallelization, Part 1: How to Use the Process Farm

Downloads
6123.zip

Editor's note: This article is the first part of a two-part series about how to use parallelization techniques in Perl scripts. It introduces you to the concept of parallelization and discusses how to use one approach to parallelization: the Process Farm system.

In my job, I have to perform many systems administration tasks in which I'm limited by network latency rather than network bandwidth or local computer power. These tasks typically involve running a program on one machine to gather information from many other machines. To speed up the execution of these tasks, I created the Process Farm system, which provides an approach to parallelization (i.e., running several processes simultaneously).

For example, suppose you have to ping every address on a Class C subnet with 250 addresses. The simplest approach (excluding pinging the broadcast address) is to sequentially ping every address. If only 30 percent of the addresses are in use and you wait 1 second before deciding whether an address isn't in use, the pinging process takes about 3 minutes (175 seconds) to ping a subnet. The limitation, however, isn't the local CPU or network bandwidth but rather network latency.

One solution is to ping multiple addresses simultaneously—in other words, run several ping processes in parallel. For example, you can spin off 25 processes, have each process ping 10 addresses, and funnel the information back into one process for reporting. The Process Farm uses this parallelization approach but at a more sophisticated level. The system creates a parent process and a pool of child processes. After each child process finishes a task, the parent process assigns that child process a new task. The parent process also tallies the information that the child processes return. Using 25 child processes to ping a 250-address Class C subnet with 30 percent coverage takes about 16 seconds. Thus, despite the startup time for the child processes (typically between one-third and one-half of a second per process), you still achieve a tenfold improvement in execution time over pinging each address sequentially.

Process Farm Advantages and Limitations
As the Class C subnet example shows, you can achieve significant improvements in execution speed with the Process Farm. This system also offers other advantages, including

Ease of use. You simply write the function you want to run in parallel, include it in a child process, and add boilerplate code to the parent process.
Efficient handling of variable job lengths. The Process Farm assigns jobs one by one to the child processes, making it ideal for scripts that have unpredictable execution times.
Low probability of child process orphaning. The Process Farm includes code that makes orphaning a rare event.
Versatility. The Process Farm isn't limited to pinging addresses. Many systems administration tasks can benefit from this parallelization approach. For example, I use the Process Farm daily to search for machines that are available for updates that require machines without any logged-on users.

The Process Farm code is useful in many situations. However, it has two limitations:

Startup time. Although the code uses asynchronous startup, it takes time to start Perl, compile the script, and connect through the TCP socket. On a dual 200MHz Pentium Pro system with 128MB of RAM, the child process startup time is about one-third of a second. Thus, if you spin off 30 child processes, the startup time is about 10 seconds.
Memory utilization. An empty child process uses about 2.3MB of RAM; a child process that performs a ping function uses about 2.6MB of RAM. If you spin off 30 child processes that ping IP addresses, you use 78MB of RAM. Therefore, keep memory utilization in mind when selecting the number of child processes to use.

Installing the Process Farm
The requirements for the Process Farm are few: ActivePerl build 504 or later and the latest build of the Process Farm. With some code tweaking, you might be able to use the Process Farm with other builds of Perl, but for out-of-the-box compatibility, your best bet is a recent build. You can get ActivePerl for free from ActiveState Tool (http://www.activestate.com).

After you install ActivePerl, you need to obtain and install the Process Farm. You can download the files from Process Farm.zip on the Win32 Scripting Journal Web site or from ActiveState Tool at http://opensource.activestate.com/ authors/tobyeverett/proc farm/procfarm.zip. Create the directory C:\Perl\site\lib\ProcFarm (where C:\Perl is the location you installed ActivePerl), and unzip the ProcFarm.zip file into that directory. The directory will include these items:

MyRpcChild.pl, a file defining subroutines that the child process uses
MyRpcParent.pm, a module defining the ProcFarm::MyRpcParent objects that the ProcFarm::MyRpc Pool object uses to communicate with the child processes
MyRpcPool.pm, a module defining the ProcFarm::MyRpcPool object that supplies the primary interface to the thread pool
MyRpcPort.pm, a module defining the ProcFarm::MyRpcPort object that stores information about the TCP port to which the child processes connect
Demos, a directory in the ProcFarm directory

You use the downloaded file and modules with two scripts—a parent process script and a child process script—that you must create. Because creating the child process script is easier, let's look at that script first.

Creating the Child Process Script
The child process script is a freestanding Perl program that connects to the parent process script. Listing 1 contains an example of a child process script, PingChild.pl. Here's how PingChild.pl works. The script first brings in the Net::Ping module, which the script uses later to ping IP addresses. (The Net::Ping module is part of the standard Perl library.) The script then includes boilerplate code that every child process script needs. As Callout A in Listing 1 shows, the boilerplate code brings in MyRpcChild.pl, calls the init subroutine, and uses a while statement to loop around a call to the main_loop subroutine. (MyRpcChild.pl defines both the init and main_loop subroutines.) The init subroutine initiates a connection to the parent process; the main_loop subroutine waits for a command from the parent process, executes it, and returns the command's result.

You can add as many subroutines as you want to a child process, but you can't name them init or main_loop. In addition, you must make sure that the subroutines deal properly with all error conditions. The Process Farm is intolerant of child processes that terminate prematurely. Thus, you need to decide which return values you want to use to communicate problems to the parent process and use the return statement in conjunction with those values. For more complicated child processes, I frequently have the subroutine return a list with two values. The first value specifies whether the command executed successfully, and the second value is either the result or an error message.

The rest of the code in Listing 1 defines the ping subroutine. This subroutine creates a Net::Ping object ($p), uses $p to ping the specified address, and waits 1 second for a response. To improve performance, the subroutine uses a global variable for $p and creates the Net::Ping object only once.

Creating the Parent Process Script
The parent process script is a Perl program that performs many tasks. This script

Accepts parameters from the user (either from the command line, by prompting, or by reading them from a file)
Creates a pool of child processes
Builds a list of jobs to execute
Tells the pool to have the child processes execute those jobs
Outputs the accumulated data

Listing 2 contains an example of a parent process script, PingParent.pl. This script first brings in the MyRpcPool module. The script then uses the code at Callout A to parse the command-line parameters and load them into variables. Next, PingParent.pl determines how many child processes to create. Presuming that you have adequate memory, the optimum number of threads is a function of the square root of the number of jobs to run. The equation is

where n is the number of jobs to be run, t_j is the time required by a child process to execute one job, and t_s is the startup time associated with each child process. I derived this equation by taking the total time equation

where x is the number of threads and solving it for x when the first derivative with respect to x is 0. You can estimate t_j from how the job typically behaves. Typically, t_s is between one-third and one-half of a second, depending on the number and complexity of the modules that the child processes use. Thus, for the aforementioned ping process, t_j/t_s is about 2.

One way to verify that your estimate of t_j/t_s is accurate is to compare the time it takes to start the child processes with the time it takes to the execute the jobs. If you're starting the optimal number of threads, the two values will be roughly equal.

If you have a fixed amount of memory, the optimum number of threads has a hard upper limit. Starting 20 PingChild.pl processes requires about 75MB of RAM, which is a safe amount for a machine with 128MB that is also doing other work (e.g., running Microsoft Outlook).

You can use a simple procedure to determine how much RAM each child process requires. Begin this procedure by creating a parent process that spins off a fixed number of child processes and waits for user input (e.g., $_ = ;) before exiting. Start the Task Manager, and note the memory utilization value on the Performance tab. Execute the parent process, and again note the memory utilization value. Divide the difference between the memory utilization values by the number of child processes to get the estimate.

After PingParent.pl determines how many child processes to create, it prints an informative message and calls the set_timer subroutine. The set_timer and get_timer subroutines time the script's various phases. The set_timer subroutine calls the Win32::GetTickCount module (which is part of the standard Perl library) to get the current Win32 tick count (i.e., the number of milliseconds since the Win32 system started) and stores the value in a global variable ($start_count). The get_timer subroutine makes another call to Win32::GetTickCount and returns the number of seconds that have elapsed since set_timer was last called.

At Callout B in Listing 2, PingParent.pl creates a ProcFarm::MyRpcPool object. PingParent passes four parameters to the new method call:

The number of child processes to create ($poolsize)
The TCP port to listen on (9000)
The child process name (PingChild.pl)
The working directory to use (Win32::GetCwd)

When you specify the TCP port in the second parameter, keep in mind that several parent processes can't use the same TCP port if they're executing simultaneously. Even if you use separate TCP ports, executing several Process Farm-based scripts simultaneously is a recipe for disaster unless you have a lot of memory. In addition, you need to avoid TCP ports that other services or programs use.

Instead of using the Win32::GetCwd function in the fourth parameter, you can hard-code the working directory in the script. Which approach you use depends on the situation. If you intend to move the script, using Win32::GetCwd is easier, but you must be in the same directory as the script to run it. If you're accessing a standardized library of child processes, hard-coding the working directory is easier.

After creating $RpcPool, PingParent.pl prints a message informing the user that it created the pool and the amount of time it took to create it. This information is useful for impromptu benchmarking. The script then resets the timer.

The lines at Callout B in Listing 2, called the marshaling loop, load up the job queue. First, the marshaling loop uses a foreach statement to loop a scalar variable ($i) over the range of the last octet. Next, the loop concatenates the first three octets with a period and the current final octet ($i) to form the current IP address. Finally, the marshaling loop adds a waiting job to the ProcFarm::MyRpcPool object, which will store all the information needed to execute the job until it gets assigned to an execution thread.

The add_waiting_job method accepts three or more parameters: a unique identifier that identifies the return data ($ip_addr), the name of the subroutine that the child process executes ('ping'), and any parameters this subroutine needs ($ip_addr). The unique identifier is crucial because it becomes the key to the hash that will store the values returned when the subroutine executes.

Once PingParent.pl loads the job queue, it tells $RpcPool to execute all the jobs that the marshaling loop added. The parameter passed to the do_all_jobs method specifies the number of seconds to wait between passes through the pool looking for finished jobs.

When $RpcPool finishes executing all the jobs, PingParent.pl prints a report detailing which IP addresses were in use. To print this report, Ping Parent.pl obtains the return data from $RpcPool, puts the data into a hash (%ping_data), clears $RpcPool's return data area, and prints a report based on the data in %ping_data. Although the code to clear the return data area isn't necessary in this example, I included it in case you want a pool to hold several sets of jobs. Because child processes can have several subroutines, you can amortize the child processes' startup cost over multiple job sets if you execute the child processes from the same parent.

The code at Callout C in Listing 2, called the post-processing loop, prints the report. The code is self-explanatory, except for line

if ($ping_data\{$ip_addr\}->\[0\]) \{

Because a subroutine can return more than one value, the hash stores the returned values as anonymous arrays. The previously defined unique identifiers access these arrays. The ->\[0\] code pulls out the first value from the referenced array.

PingParent.pl ends with the code for the set_timer and get_timer subroutines. The parent process script is now ready to use.

Demonstrating the New Scripts
To see how PingParent.pl and PingChild.pl work together, open a command prompt and go to the C:\Perl\site\lib\ProcFarm\Demos directory. Pick a Class C subnet, and run the script with a command such as

perl PingParent.pl 135.40.94.2-254

The parameter specifies the range of IP addresses to ping. This command results in PingParent.pl creating a pool of PingChild.pl child processes, which will collectively ping the specified range of addresses. When the PingChild.pl processes finish all their jobs, the PingParent.pl will output a list of active IP addresses.

Applying the Process Farm to Existing Scripts
If you're proficient in Perl, you can incorporate the Process Farm into nonparallelized scripts that you've already written. If you don't feel comfortable writing nonparallelized Perl code, you'll have problems writing the more complicated parallelized version. If you want to improve your Perl skills, I recommend that you read Programming Perl, Second Edition by Larry Wall, Tom Christiansen, and Randal L. Schwartz (1996) or Learning Perl on Win32 Systems by Randal L. Schwartz, Erik Olson, and Tom Christiansen (1997). O'Reilly and Associates (http://www.oreilly.com) publishes both books. When you feel comfortable writing nonparallelized Perl code, follow these eight steps:

Determine whether your script can benefit from parallelization. If your script contains a loop that takes up most of the execution time and spends a lot of time waiting on external feedback, the body of that loop is a candidate for parallelization. If the loop isn't a foreach statement, rewrite it so it is.
Split the body of the foreach statement into three sections. In the first section, include the code that marshals the parameters. In the second section, include the code that calls the subroutine, the subroutine, and the code that returns the subroutine's results to the third section. In the third section, include the code that processes the subroutine's results. The subroutine can't involve any user interaction (e.g., waiting for input, printing output to the screen); it must operate solely on passed-in parameters. In addition, the subroutine can't access any global variables (e.g., arrays, hashes).
In subsequent steps, you transform the second section into a child process script and the first and third sections into a parent process script. In the parent process script, the first section becomes the marshaling loop and the third section becomes the post-processing loop.
Create the child process script. Place the second section in a separate Perl script. Add the boilerplate code (i.e., Callout A in Listing 1) and any necessary use statements.
Begin editing the parent process script. Bring in the MyRcpPool module by specifying
```
use ProcFarm::MyRpcPool;
```
at the top of the script.
Create the pool object. Use the ProcFarm::MyRpcPool->new statement to create the pool object. Before this statement, add code to determine how many threads to create (or use a fixed pool size).
Create the marshaling loop from the first section. Add the code that invokes the add_waiting_job method to the first section. Note that the loop variable for the foreach statement makes a handy key. The command is whatever you titled your subroutine, and the parameters are the marshaled parameters the command needs to do its job.
Add the code for the do_all_jobs and get_return_data methods after the marshaling loop.
Create the post-processing loop from the third section. The foreach statement in the post-processing loop will typically be identical to the foreach statement in the marshaling loop, except that this loop's body performs the post-processing of the return data.

More Script Examples
On the Win32 Scripting Journal Web site, you'll find more examples of parent process and child process scripts in the Web-exclusive sidebar "Getting Creative with Process Farm Scripts." The examples include

A version of PingParent.pl that provides feedback immediately instead of waiting until all the jobs finish
A version of PingParent.pl that starts up the child processes immediately and lets you use the pool to ping multiple ranges of IP addresses
A version of PingChild.pl that includes a more robust ping subroutine and two additional subroutines

What's Next?
In the second part of this two-part series, I'll discuss how the internals of the Process Farm system work. I'll also describe the process I followed to develop this system.

Comments

Plain text