Developing a Server Failure Notification System, Part 2

Downloads
5883.zip

Editor's note: This article is the second part of a two-part series. The first part, which appeared in the July 1999 issue, looked at the requirements, input files, main block, and one of the five modules for ServerTester.bat, a script that notifies IT staff of server failures. This second part discusses the remaining modules and how to execute ServerTester.bat.

The :SHARTEST Module
Listing 1 contains the :SHARTEST module. The :SHARTEST module begins like the :PINGTEST module: It displays a message specifying that share testing is under way, parses the ShareList.txt file for a Uniform Naming Convention (UNC) path, sets the captured information to the sharepath variable, and calls the :SHARETEST procedure, which tests to see whether that share exists. After the module applies the :SHARETEST procedure to each line in ShareList.txt, the script returns to the :MAIN block.

The :SHARETEST procedure uses an IF NOT EXIST statement, which you can use to determine whether a file or directory exists. Here's how the procedure works:

In case the test fails, the procedure obtains the current date and the time for the page message and log file.
The procedure uses an IF NOT EXIST statement with the share's UNC path to check for the existence of that share. (The quotes around the UNC path in the statement handle spaces if they exist.) If the share doesn't exist, the procedure initiates the :SHARFAIL subroutine in step 3. If the share exists, the procedure initiates the :FINSHARE subroutine in step 4.
If the share doesn't exist, the procedure initiates paging with the :SHARFAIL subroutine. This subroutine works the same as the :NAMEFAIL and :IPFAIL subroutines, except that it identifies the type of failure as Share is Offline in the page and in FailureLog.txt.
If the share exists, the procedure initiates the :FINSHARE subroutine, which sends the script to the first line in the :SHARTEST module, which parses the next line in the Share
List.txt file to obtain another share path for testing.

The :SERVTEST Module
The :SERVTEST module uses the Netsvc utility with the /query switch to determine whether the specified services are running. (For information on how to use the Netsvc utility, see the sidebar "The Versatile Netsvc Utility," page 14.) As Figure 1, page 12, shows, the /query switch can return several possible results, including Service is running, Service is stopped, Service is paused, and Error code 53. Thus, the module tests the third word of the utility's output for three conditions: running, stopped, or paused. If the module finds no response or an error code (such as Error Code 53), it assumes the service is unavailable.

Listing 2, page 12, contains the :SERVTEST module, which starts by displaying the message that service testing is under way. The module then parses the ServicesList.txt file for the server name, friendly service name, and real service name, setting them to the server, friendlyname, and servicename variables, respectively. After the module applies the :CHECKSVC procedure to each line in ServicesList.txt, the script returns to the :MAIN block.

The :CHECKSVC procedure tests whether each service is running, stopped, or paused. Here's how the procedure works:

The procedure obtains the current date and the time for possible use in the page message and log file.
The procedure initiates the Netsvc utility, capturing the third word of the utility's results. After setting the captured information to the condition variable, the procedure calls the :STATUS subroutine.
The :STATUS subroutine uses an IF statement to test whether the condition variable's string matches the 'running' string. The /i switch specifies that the string comparison is case-insensitive. If no match occurs (i.e., the service isn't running), the procedure continues to the IF statement in step 4. If a match occurs (i.e., the service is running), the procedure goes to the :ENDSTAT subroutine in step 9.
The :STATUS subroutine performs an IF statement to test whether the condition variable's string matches the 'stopped' string. If a match occurs (i.e., the service has stopped), the procedure goes to the :STOPPED subroutine in step 5. If no match occurs, the procedure continues to the next line, which sends the procedure to the :TRYPAUSE subroutine in step 6.
The :STOPPED subroutine pages the on-call staff members, giving them a message specifying the server name, friendly service name, type of failure (i.e., Service Stopped), and date and time of failure. After the subroutine records the recipients' pager personal identification numbers (PINs), server name, friendly service name, type of failure, and date and time of failure in the FailureLog.txt file, the script goes to the :ENDSTAT subroutine.
The :TRYPAUSE subroutine determines whether the service has paused. The subroutine uses an IF statement to test whether the condition variable's string matches the 'paused' string. If a match occurs (i.e., the service has paused), the procedure goes to the :PAUSED subroutine in step 7. If no match occurs, the procedure continues to the next line, which sends the script to the :TRYUNAV subroutine in step 8.
The :PAUSED subroutine works the same as the :STOPPED subroutine, except that it identifies the type of failure as Service Paused in the page and in FailureLog.txt. The subroutine finishes by sending the script to the :ENDSTAT subroutine.
The :TRYUNAV subroutine makes the assumption that the service is unavailable. Thus, it initiates a page and records the event in FailureLog
.txt, identifying the type of failure as Service Unavailable. The subroutine finishes by sending the script to the :ENDSTAT subroutine.
The :ENDSTAT subroutine sends the script to the first line in the :SERVTEST module, which parses the next line in the ServicesList.txt file to obtain another service for testing.

The :FILES Module
The :FILES module determines whether the four input files are available. As Listing 3 shows, the module begins by displaying the message Testing for existence of files that the script depends on for input. The module then uses the IF NOT EXIST statement to check whether each file is in the proper location. If a file isn't in the proper location, the script proceeds to the :ERROR2 module. The script's flow returns to the :MAIN block after testing for the input files.

The :ERROR2 module displays a message stating that you need to verify that the files exist, making sure that their path and syntax are correct. The module also displays the syntax and example lines for each input file.

The :GETPIN Module
The :GETPIN module in Listing 4 parses the OnCallList.txt file to determine the IS staff members who get to come in at 3:00 a.m. to get a critical server back online. Here's how the module works:

The module displays a message that it's obtaining the date and determining the pager PINs.
The module uses the Date command with the /t switch to obtain the system date, which looks like
```
Sat 06/05/1999
```
Because the day of the week isn't included in OnCallList.txt, the module parses the Date command results and captures only tokens 2, 3, and 4, setting them to the date1 variable.
Using the Findstr command, the module parses the OnCallList.txt file to find the string that matches the date1 variable's string. The /i switch specifies that the search is case-insensitive. The module captures the information in token 4 (i.e., the pager PINs of the rotating on-call recipients) and sets it to the recipientweekend variable. If the recipientweekend variable is empty (which the two single quotes signify), the script goes to the :ITS_A_WEEKDAY procedure in step 4. If the recipientweekend variable isn't empty, the script goes to the :ITS_A_WEEKEND procedure in step 5.
The :ITS_A_WEEKDAY procedure parses the OnCallList.txt file to find the string that matches the string "Weekday". The module captures the information in token 2 (i.e., the pager PINs of the default on-call recipients) and sets it to the recipient variable. The script then goes to the :TESTING subroutine.
The :ITS_A_WEEKEND procedure parses the OnCallList.txt file to find the string that matches the string "Weekend". The module captures the information in token 2 (i.e., the pager PINs of the default on-call recipients) and sets it to the recipient variable. The procedure adds the recipient variable to the recipientweekend variable, which the module set previously. The script proceeds to the next line, which is the :TESTING subroutine.
The :TESTING subroutine uses the Goto command to send the script to the :MAIN block.

Running ServerTester.bat
The ServerTester.bat script already includes the necessary filenames and syntax. However, you need to customize the script's paths and paging code before you run it. The paths you need to customize are

The paths to the four input text files
The path to the FailureLog.txt file
The path to the Netsvc and sleep.exe utilities
The path to Internet Explorer (IE)

If you choose to use the AutoExNT service rather than sleep.exe, you need to rename ServerTester.bat to autoexnt.bat and place the script in the System32 directory before you install the AutoExNt service.

The ServerTester.bat script uses the Skytel paging-service provider to send text messages to pagers. ServerTester.bat sends input to a Common Gateway Interface (CGI) Perl script on Skytel's Web site (http://www.skytel.com); the CGI Perl script, in turn, initiates the page. If you use Skytel, you don't need to customize the code's syntax. If you use another paging-service provider, you need to customize the code's syntax. Your paging-service provider can help you develop the syntax for sending input to its system.

After you create your input text files and customize the paths and paging code in ServerTester.bat, you're ready to run the script. You need to run ServerTester.bat on a server or workstation that gives you a client's view of the target servers you plan to monitor and that has a dependable link to the Internet to handle the failure-triggered paging events. If you use the script to monitor servers across a WAN, the script detects network link failures but leaves you without accurate information about resource availability at the remote location. Running an instance at each remote location gives you better granularity of failure detection. If you have many clients geographically separated from the server room, you need to locate the script on a machine at the remote location to capture server status as your clients see it.

To ensure that the script will work in an emergency, I recommend that, at least once a week, you insert an incorrect IP address or server name to trigger a page. Just give your on-call staff advance notice. Your server failure notification system is now in place.

Comments

Plain text