American Power Conversion (APC), a global provider of UPS systems, uses SQL Server as the back end for a number of applications, including Siebel call center, sales, and field-service modules; Web infrastructure; and homegrown manufacturing software. Worldwide, the company has more than 60 SQL Servers containing databases exceeding a total of 1TB in disk storage. Each SQL Server runs approximately 10 jobs every day, so the whole system runs close to 600 jobs daily, performing a variety of functions that include database backups, script transfers for Web sites, and auditing. To perform daily routine checks, DBAs had to log into every server. Thus, a DBA would spend an average of an hour every day checking different SQL Servers for failed jobs. APC needed a centralized management tool that would keep track of failed jobs, alert DBAs worldwide, and publish the alerts on the Web. The company also needed to monitor front-end Web servers not only for failed jobs, but also for performance metrics such as CPU utilization, memory utilization, and status of services. In addition, internal and external auditors wanted to see daily reports of jobs running on all SQL Servers globally and their status. To meet all these needs, I decided that I needed to build a monitoring and alerting tool to centrally monitor all tiers of our Web and customer relationship management (CRM) infrastructure, which includes hardware load balancers, Web servers, the OS, and databases. Our solution would also let internal and external auditors see from a centralized location all jobs running on all tiers and their success or failure.
The solution wouldn't be easy, but it was possible. The tools I used to create this monitoring solution were SQL Server, Visual Studio .NET, T-SQL, Data Transformation Services (DTS), and SQL Server 2000 Reporting Services. To bring information from all these parts together, I created a centralized monitoring Web site that had several links that would display the information DBAs needed about SQL Servers, Siebel applications, APC's Web infrastructure, and OS scheduled tasks. Figure 1, shows what the monitoring site's home page looks like.
SQL Server monitoring. To create a comprehensive SQL Server monitoring solution, I used SQL Distributed Management Objects (SQL-DMO) to create a Windows Service that polls all APC SQL Servers worldwide every 10 minutes. The service stores information—such as failed jobs, time of failure, and reason for failure—in a SQL Server table. Then, per auditor requirements, the service pulls status information for all jobs running on all SQL Servers globally and stores the information in a different table. After the information is in the database, the solution groups the servers by region and sends an email message to the DBAs who are on call at that time, notifying them of the jobs' status. Finally, the service publishes the information on the central monitoring Web site.
Siebel monitoring. I created this part of the solution a couple of years ago. My Seibel monitoring tool gathers information about Siebel jobs by using a combination of Windows NT batch files that run commands to send status information to text files. I used DTS T-SQL scripts (because they weren't delimited) to parse the text files and store them in a SQL Server table. Because the data wasn't delimited, I had to take the raw data and parse it to get meaningful information from it.
Another important part of my Siebel-monitoring tool is monitoring replication of Siebel data between geographically dispersed servers in the United States, Ireland, and Singapore. The only way to do this was to monitor specific folders in the Siebel installation that contained .dx files (native Siebel replication files).
If the number of files was greater than 50 at any location, the increase would indicate a backlog in the replication process. To capture this information, I used Visual Basic .NET code that runs in a Windows Service and stores the required information in a SQL Server table, just as I did with my SQL Server monitoring tool.
Because I already had the Siebel monitoring tool, all I needed to do to integrate it with my new monitoring solution was to take the data from the SQL Server table, send alerts, and publish the information on the monitoring Web site. The Siebel page contains information not only about jobs running in Siebel, but also the status of outbound email messages to field service engineers that workflows (Siebel application processes) running within Siebel generate.
Web infrastructure monitoring. I subdivided the Web monitoring part of the solution into three categories: hardware load balancers, Web servers, and SQL Server back end. APC's six hardware load balancers maintain a workload balance for several servers in a farm (or group of servers) that have a unique farm name. Each of these servers has its own IP address, and at any time, users might be connected to servers that are active in a farm. Servers around the world are divided into two regions: EMEA (Europe) and NAM (North America). I could get all the information our DBAs needed about a hardware load balancer by using the specific Management Information Base (MIB) commands. However, the information is in separate text files. I used T-SQL to combine the text-file information into one table, which I then published on the central monitoring Web site. The Web page shows DBAs each farm, the servers in the farm (servers can be in more than one farm), and each server's status, IP address, number of attached users, and region.
The second part of Web monitoring was getting information about Web servers. I set up a Windows Service that polls all Web servers every 2 minutes. The service gathers data about CPU utilization, memory utilization, Microsoft IIS connections, and the status of Web Services and Cold Fusion Services and stores this data in a SQL Server table. The Windows Service then uses T-SQL to aggregate the information and publish it on the monitoring Web site. DBAs use the published information to calculate resource availability and usage. I also created another Windows Service that gathers similar information from the Web back end for reporting purposes. I used Reporting Services to compile the information the Windows Service collects into graphs and charts for trending. Reporting Services automatically publishes the charts on the monitoring Web site every month so that management can access them.
OS scheduled tasks. Monitoring OS tasks—such as scheduled batch-file executions and custom data-transfer jobs—was the last piece I added to APC's monitoring solution. I gathered task-status information from Windows NT scheduler, Macromedia Cold Fusion Task Scheduler, 24X7 WinPro Scheduler, and NSI Double-Take scheduler, primarily in the form of log files and COM-based access. Then, I used T-SQL to parse the information I'd gathered and compile it into scheduled DTS packages and store the information in SQL Server. The monitoring Web site gets OS information that the DBAs need from SQL Server.
Benefits of the Solution
Once we'd successfully deployed the complete monitoring solution, APC saw the benefits immediately. Support calls logged at our Help desks for problems related to Siebel, SQL Server, or the Web dropped by 80 percent because we were alerted to problems well before end users noticed them. APC could regularly meet its service level agreements (SLAs). Availability of the SQL Server and Siebel platforms increased to 99 percent. Before this tool was in place, jobs sometimes failed repeatedly, and the failures would often go unnoticed because of the sheer number of servers DBAs had to monitor. Now, failed jobs are detected and fixed immediately. Our Web commerce revenue is directly tied to the availability of our commerce Web infrastructure, so this solution has helped increase the overall revenue for the company—which, in the end, is what really matters. Overall, we had happy end users, increased revenue, and increased platform availability.
|2004 SQL Server Magazine Innovator Award Winners|
Senior SQL Server DBA,
American Power Conversion
West Kingston, Rhode Island
1ST RUNNER UP
Senior Technical Architect, Teksouth
2ND RUNNER UP
3RD RUNNER UP
Data Warehouse Developer
Data-Tier Architect, Parker Hannifin
and Michael Meier
Osterreichische Kontrollbank AG
Systems Analyst, Unisys
CITCO Technology Management
Fort Lauderdale, Florida
Senior Database Consultant
Marlow, United Kingdom
City of Aurora, Colorado