The old adage An ounce of prevention is worth a pound of cure holds true today in the realm of network monitoring. Monitoring your servers, the applications that run on them, and your network devices can alert you to problems and give you a chance to fix them before your users notice. By monitoring your network and keeping a history, you can draw on this data to provide accurate information to users who might have an exaggerated notion of how often a particular problem has occurred. Just as important, network monitoring lets you know exactly what's happening on your network, as well as who's accessing it and when. So, there are two types of monitoring. In this article, I refer to the former as operations monitoring and the latter as security monitoring.
Large enterprises sometimes divide these two types of monitoring into separate processes performed by operations and information security staff, but small-to-midsized businesses (SMBs) tend to implement one overall monitoring process, for several reasons. Regardless of budget and staff size, SMB networks typically don't need the level of operational monitoring that larger enterprises require. SMB networks don't run as close to capacity as enterprise networks do, and they're much simpler to maintain. Also, SMB networks aren't as highly engineered and don't need the detailed trend analysis and reporting that slower-moving enterprises require.
In this two-part series, I'll identify the various devices and systems that you should monitor in an SMB for both security and operations purposes. In Part 1, I identify the most common data-monitoring sources, including Windows event logs, Syslog, and SNMP, and in Part 2, I'll show you how to build a barebones network-monitoring solution by using free or inexpensive tools.
What to Monitor?
I refer to the data from monitored devices as telemetry. Which protocols and data formats do systems commonly use to report telemetry? How can you monitor all these separate sources of data and create alerts and reports to transform the data into real information? As you'll see, one of your most crucial tasks is deciding which data should generate real-time alerts, which data should be covered by daily or weekly reports, and which data should simply be archived.
For security purposes, you want to monitor any network device (e.g., firewalls, gateways, VPN appliances, wireless Access Points—APs) involved in your perimeter security, as well as any servers that host information or processes that require confidentiality or integrity. For operations purposes, you want to monitor any device or server whose availability is vital to the business. When it comes to Windows, you need to monitor not just the OS but the important applications running on top of the OS, such as Microsoft Exchange Server, Microsoft ISA Server, Microsoft IIS, and Microsoft SQL Server. You might also want to monitor higher-level applications (e.g., Microsoft SharePoint Portal Server) if those applications are likely to detect important security- or operations-related events that might go unnoticed by the lower-level databases on which they run.
Sources of Telemetry
In terms of Windows servers, the principle source of security telemetry is the Security event log, and the most important sources of operations telemetry are the System and Application event logs. If you're a regular user of the Microsoft Management Console (MMC) Event Viewer snap-in, you know that all Windows event logs follow the same event (.evt) file format—in which each event record contains the same standard fields (e.g., date, time, event source, category, event ID)—followed by a description field that contains free-form data unique to the event ID in question. Any monitoring application that supports Windows event logs will let you create alerts and reports based on source, category, and event ID, but ideally you should also be able to filter records based on data within the event's description.
Network devices such as routers, switches, wireless APs, and firewalls invariably report telemetry through the SNMP or Syslog protocol. SNMP was designed in the late 1980s to help manage the many devices on a burgeoning Internet. SNMP managers collect telemetry from agents through UDP port 162. Managers can use SNMP Get commands to request specific telemetry data, called variables, or they can passively wait for agents to report any important events through Trap messages. For the purposes of operations and security monitoring, collecting Traps is sufficient. (You can graduate to polling agents with Get commands when you're collecting telemetry for heavy-duty trend analysis and capacity planning.)
Syslog is the standard for event logging in the UNIX world. The advantage of Syslog over Windows event logging is that the entire process of consolidating the event streams of multiple systems into one monitoring system is an integral part of Syslog. In fact, Syslog is a network protocol as well as a log format, and by default uses UDP port 514. Each Syslog message has date, time, priority, hostname, and message fields. Technically, the priority is a number between 0 and 191. However, most Syslog applications display priority as the two subvalues that comprise it: Facility and Level.
Facility. Syslog was originally designed for monitoring BSD UNIX, and Facility was used to identify the UNIX process that reported an event. Values 0 through 15 correspond to key UNIX processes, and values 16 through 23 (called Local0 through Local7) were created for applications and devices. Table 1 provides a list of all the Facility values and their names. Most network devices use the Local0 through Local7 values (e.g., Cisco devices use Local6 and Local7), but not all of them. My Xincom Twin Wan router uses just about every low Facility value in the book.
Level. The other element of a Syslog message priority is the Level, which ranges from 0 through 7. The Level identifies the severity of the message, as Table 2 shows.
Performance and Health
For complete operations monitoring, you should eventually consider utilizing performance-object monitoring and server health checks from a separate computer or service provider. If you aren't familiar with performance objects, you can explore them with the MMC Performance snap-in. The difference between event-log monitoring and performance-object monitoring is as follows: You go to the event logs to obtain information about any part of the system experiencing a problem, and you go to the performance-objects to verify that certain parameters remain within acceptable ranges. For example, you would use performance objects to monitor disk space because the system log would alert you only when the volume has gotten so close to capacity that you're already experiencing problems.
Another common Windows performance-object check is monitoring CPU utilization for certain levels over extended periods of time (e.g., above 90 percent for 10 minutes). You must use caution with CPU utilization checks, however; it's easy to mistake legitimate utilization for a runaway process and thus generate a false positive. A terrific aspect of performance objects is that other applications can create their own performance objects and publish telemetry data specific to the application. For example, Active Directory (AD), SQL Server, and Exchange Server have their own performance objects.
The absence of error events in a log and performance values within acceptable thresholds are good indicators that things are working correctly. However, you might still have problems that the indicators haven't revealed. Server health checks are the most effective way to periodically ensure that servers and applications are online and successfully processing requests. Server health checks are reliable because they perform a test transaction. Many application providers and service providers across the Internet let you set up regular test transactions against the server that occur at intervals you specify. For a Web server, you might periodically request a given Web page and make sure the page is returned successfully. Or for a SQL Server machine, you might periodically execute a query and check the results.
However, even health checks can miss problems. For example, a simple ping every 5 minutes can tell you that the OS and network stack are up and running, but that's no indication that the actual application is healthy. I've seen hung servers respond to pings. Similarly, simply requesting an HTML page from a server doesn't positively prove that the associated e-commerce Active Server Pages (ASP) application is running.
Therefore, you should try to make health checks as functional as possible. If your health-check application or service supports it, you might create a test account on an e-commerce application and use the account to test the action of adding an item to the shopping cart.
Another caution: Keep the health-check application separate from the production environment you're monitoring. If you choose to host a health-check application and you make the mistake of running the health-check application on the same server as the one you're monitoring, you won't—for example—know when the server or server's Internet connection is down because the application won't be able get the message out. However, if you run the health-check application from a separate server—and as long as the server is accessible from the Internet, from a different network—the only way the critical application could be unavailable without your knowledge is if both the production and monitoring environments are down simultaneously.
What Do You Need?
So, how do you monitor all these devices, servers, logs, SNMP traps, and Syslog events? Obviously, you need a tool or two that fit your budget and support all the various elements you need to monitor. The higher-end monitoring products on the market—such as Argent Guardian and Microsoft Operations Manager (MOM)—let you monitor all performance objects, the Windows event logs, SNMP traps, and Syslog event streams, and can even perform a variety of health checks. Some of the smaller, less expensive packages—such as Engagent's Sentry II, Prism Microsystems' EventTracker, and Dorian's Event Log Management suite—cover a subset of telemetry sources and some limited performance-object monitoring.
If you're ready to buy a tool, be sure to identify all the elements you need to monitor and find a tool that covers them. If you end up with a tool that doesn't cover a key area—for example, SNMP monitoring—you might be able to find a freeware or inexpensive shareware utility to help fill in the gap. In Part 2 of this series, I'll talk about some of these tools, which you can combine into an effective network-monitoring toolkit.