Inside Event Tracing for Windows

The Event Tracing for Windows (ETW) subsystem is a powerful tool for your performance-testing and diagnostics arsenal. Microsoft created ETW to give developers a mechanism with which to determine their applications' performance effects on Windows Server 2003, Windows XP, and Windows 2000 platforms. However, administrators can use ETW to find out what's happening in their internal Windows systems, Microsoft applications (e.g., Microsoft IIS), and third-party applications and troubleshoot any problems they might find. ETW can also help administrators with capacity planning by letting them monitor a system under real workloads to see how it performs for a given set of transactions. Before I show you the ETW tools, techniques, and an example of real-world usage, let's look at the ETW architecture.

The ETW Architecture
You might be asking why Microsoft would develop yet another way to obtain performance data when you already have the Performance Monitor tool. The answer is that Microsoft wanted to provide a low-overhead facility for application developers and administrators to monitor applications and processes in production environments without causing the system to slow to a crawl. Running Performance Monitor against production servers 24 x 7 and not expecting it to significantly affect the performance of those servers isn't realistic. However, Microsoft designed ETW to do just that. ETW accomplishes this feat by using an architecture that combines both kernel-mode and user-mode components optimized to quickly write events to a log or consumer application.

As Figure 1 shows, the ETW architecture consists of several components: an event tracer, a log or consumer application, providers, and a controller application. The event tracer creates kernel-mode or user-mode logging sessions that perform the actual trace. Windows 2003 and XP can support as many as 62 logging sessions, whereas Win2K supports as many as 31 sessions. Each session has a dedicated pool of buffers to which the event tracer writes event data. Windows creates buffers on a per-processor basis, so if an application raises an event on processor 0, the event tracer writes the event to that processor's trace buffer. You can specify the minimum and maximum number of buffers that Windows should create for a given logging session. By default, Windows creates the buffers in the nonpaged pool memory area of the system, which means that Windows won't page the event data in the buffers to disk if the system is under memory pressure. This default is a performance advantage for the logging session because the buffers will always be in memory when they're needed. However, when you set up the logging session, you can specify that the buffers use paged-pool memory instead.

The event tracer puts a timestamp on each event. The timestamp contains the following information:

time of the event
process ID under which the event occurs
thread ID under which the event occurs
user-mode CPU time
kernel-mode CPU time

When you set up a tracing session, you can specify the timestamp's resolution. You can specify from 100 millisecond (ms) resolution to 100 nanosecond resolution, depending on the mechanism the event tracer uses to get the timestamp. The lower the resolution, the less the impact on overall system performance during a tracing session. However, you'll see less resolution in your logged data.

When the buffers are full, the event tracer writes the event data to the log or the consumer application. If a consumer application requires real-time logging, the event tracer writes the event data to the consumer application in near real time. On a typical system, ETW's overhead is about 5 percent of CPU to log 20,000 events per second. The event tracer logs events in special binary files with an .etl extension. These binary files aren't human-readable, but you can use the tools I describe later to convert them into text, HTML, or comma-separated value (CSV) files.

The heart of the ETW subsystem is the provider. Providers can be user-mode applications or kernel-mode drivers. Providers are the source of event data—they're the applications that send events to the event tracer. So, if a developer is creating an application that performs transactions on a database, he or she would write a provider that posts events to the event tracer each time a transaction starts and stops. Providers are active only when a session has been created within the event tracer that calls that provider.

In Windows 2003, XP, and Win2K, Microsoft supplies out-of-the-box providers for various system services, including Active Directory (AD), Lightweight Directory Access Protocol (LDAP), Internet Information Services (IIS) 6.0, ASP.NET, Netlogon, and Local Security Authority (LSA). Microsoft also supplies an in-the-box Windows kernel provider for system-level operations, such as process creation and deletion, thread creation and deletion, disk I/O, file I/O, TCP and UDP traffic, page faults, registry I/O, executable image loads, and context switches. I concentrate mainly on the in-the-box providers in this article. You can also write your own providers. If you're interested in doing so, check out the Microsoft Developer Network's (MSDN's) event-tracing documentation at http://msdn.microsoft.com/library/en-us/perfmon/base/event_tracing.asp.

The controller application creates logging sessions, sets their starting parameters (e.g., buffer sizes, timestamp resolution), and starts and stops logging sessions at a scheduled time. Microsoft provides several out-of-the-box controller applications, which I discuss in the "ETW Tools" section.

The key characteristic about the ETW architecture is that each component (e.g., event tracer, controller application) works independently. The provider doesn't need to know about the event tracer, the controller application doesn't need to know about the provider, and so on. This architecture lets you easily use ETW-generated data for a variety of purposes. ETW also provides application developers with a great framework for testing their applications' impact on Windows systems without having to build a logging function themselves.

ETW Tools
In Windows 2003, XP, and Win2K, you can create, manage, and report on ETW sessions two ways: You can use the Microsoft Management Console (MMC) Performance Logs and Alerts snap-in, or you can use command-line utilities. In Windows 2003 and XP, the OS includes the command-line utilities. In Win2K, the command-line utilities are part of the Microsoft Windows 2000 Server Resource Kit and differ slightly from those in Windows 2003 and XP. In the context of the ETW architecture, the Performance Logs and Alerts snap-in and some of the command-line utilities are controller applications. Other command-line utilities provide the consumer application function.

XP's tools are basically the same as those in Windows 2003, except XP includes fewer out-of-the-box providers. So, although the following discussion mentions only the Windows 2003 tools, the discussion applies to XP as well.

Performance Logs and Alerts snap-in. If you start the Performance Logs and Alerts snap-in and navigate to the Trace Logs node, you can create a new trace session by right-clicking the node and selecting New Log Settings. After you give the new logging session a name, the session's Properties dialog box appears. On the General tab, you can choose the providers you want to use. Unless you're ready to sift through a huge amount of data, I recommend that you limit the number of providers you choose per session. For example, if you need to log kernel events and AD events, create two separate logging sessions. Trying to make sense of large amounts of trace data can be daunting. Later, if necessary, you can combine reports from two provider sessions with tools that Microsoft provides.

On the General tab, you can enable tracing on the metrics provided by the Windows kernel provider (i.e., the system provider) or other nonsystem providers that are registered on the system. For example, you can use the Windows kernel provider to enable network TCP/IP tracing, as Figure 2 shows. If you want to use nonsystem providers to enable tracing, you select the Nonsystem providers option, then click Add to select the providers. By clicking Provider Status, you can learn which providers are registered on the current system. Finally, both Windows 2003 and XP provide a Run As option that lets you run the trace from a user ID different from the one you're currently logged on as.

On the Log Files tab, you can change the naming scheme of the log and how the event tracer writes trace data to the trace log. You have two types of logging from which to choose: sequential or circular. With sequential trace logging, the event tracer writes data to the log until the log reaches the maximum size you specify. If you don't specify a maximum size, the event tracer will continue logging until the disk fills up or you stop the logging session. With circular trace logging, the event tracer overwrites data when the log reaches the maximum size you specify.

The Schedule tab lets you set a start time and stop time for the logging session. You can also manually start and stop a logging session from the Performance Logs and Alerts snap-in's UI. Regardless of whether you manually start a logging session or schedule it, the session uses the Performance Logs and Alerts Windows service to perform the trace logging. This service needs an Administrator or equivalent account to run, so you must be logged on using such an account. (In Windows 2003, you can also be a member of the Performance Logs built-in group.) Otherwise, the tracing session won't run.

On the Schedule tab, you can also choose to execute a script after a logging session has stopped. For example, you might want to run a script that moves the log to a different folder so that you can process the log data at a later time.

The Advanced tab lets you change the defaults for allocating event tracer buffers. You can specify a different buffer size as well as the minimum and maximum number of buffers to allocate. Adjusting these values is usually a trade-off between capturing events without dropping them and consuming more memory on the system. For high-volume captures such as file I/O or registry I/O, a 128KB buffer size with a minimum of 50 buffers and maximum of 100 is typical, but you should experiment.

Command-line utilities. If you want to start a trace session from the command line, you have several options, which unfortunately differ depending on the OS version you're running. The Win2K resource kit includes the tracelog.exe utility, which lets you create and manage tracing sessions for the Windows kernel provider. (The resource kit doesn't have a command-line utility that lets you create and manage tracing sessions for nonsystem providers.)

Tracelog.exe also lets you set up a temporary trace logging session. If you use tracelog.exe to set up a temporary trace logging session, you won't see that session in the Performance Logs and Alerts snap-in's UI.

The following command uses tracelog.exe to enable TCP/IP tracing on a Win2K box:

tracelog -start -noprocess –nothread
-nodisk -b 128 -min 50 -max 100
-f c:\perflogs\netfile.etl

(Although this command appears on several lines here, you would enter it all on one line when you type it at the command line. The same holds true for the other multiline commands in this article.) The Tracelog -start command creates a temporary logging session. When tracelog.exe creates a trace session, it enables, by default, four types of tracing: process creation and deletion, thread creation and deletion, network TCP/IP, and disk I/O. Because you want only TCP/IP tracing, the command includes the -noprocess, -nothread, and -nodisk parameters to disable process creation and deletion tracing, thread creation and deletion tracing, and disk I/O tracing, respectively. The -b parameter sets the trace buffer size to 128KB, and the -min and -max parameters specify the minimum and maximum number of buffers to allocate. The -f parameter tells tracelog.exe to send its binary output to the netfile.etl file in the perflogs folder. Note that the Tracelog -start command creates and starts the logging session in one step. If you want to stop the logging, you simply type the command

tracelog -stop

In Windows 2003, Microsoft provides a better command-line utility called logman.exe that you can use to manage trace sessions. Unlike tracelog.exe, logman.exe lets you use nonsystem providers. (Unfortunately, logman.exe won't work on a Win2K machine.) Logman.exe has several useful features, including the ability to list the currently registered providers on a box. To use this feature, you type the command

logman query providers

You can use the information in the resulting list to enable a provider for a particular logging session. For example, suppose you want to use logman.exe to enable AD tracing on a Windows 2003 machine. After you use the Logman query command to obtain the name of the provider you want to use (in this case, Active Directory: Core), you need to create a logging session. Logman.exe provides a huge array of command-line parameters. If you want to keep the command simple and accept all the defaults, you can use the following command to create an AD tracing session:

logman create trace AD_trace
-o c:\perflogs 
-p "Active Directory: Core"

This command tells logman.exe to create a new trace logging session called AD_trace. The trace keyword preceding the session's name is necessary because you can also use logman.exe to create Performance Monitor sessions, which produce a different kind of log. The -o parameter tells logman.exe to place the binary output file in the perflogs folder. The -p parameter specifies the name of the provider to use. You can use either the provider's name or the provider's globally unique identifier (GUID).

To start the logging session, you type the command

logman start AD_trace

When you use logman.exe to create a trace logging session, the session appears in the Performance Logs and Alerts snap-in's UI. You can use the UI to modify, start, or stop the session.

Techniques for Sifting Through the Data
Now that you know how to create event trace data, let's look at how you can transform that data into meaningful information. Microsoft provides a couple of command-line utilities for converting binary .etl files into usable data. The Win2K resource kit contains a utility called tracedmp.exe that lets you convert one or more .etl files into a .csv file that you can open in Microsoft Excel or another spreadsheet application for further analysis. Windows 2003 and XP include a command-line utility called tracerpt.exe. Both tracedmp.exe and tracerpt.exe are consumer applications within the ETW architecture. Although tracerpt.exe is part of Windows 2003 and XP, you can use this utility to process Win2K logs. Because tracerpt.exe includes some features not in tracedmp.exe, I recommend that you use tracerpt.exe whenever you need to process a Win2K trace log. However, Microsoft doesn't offer tracerpt.exe as a separate download. The only way you can obtain this utility is to purchase Windows 2003 or XP.

Note that if you run tracerpt.exe from a Windows 2003 or XP workstation but you're running the utility against a log created on Win2K server, tracerpt.exe might not be able to identify the transaction GUIDs registered on the Win2K server. Tracerpt.exe looks in the Windows Management Instrumentation (WMI) namespace on the machine on which it's running to determine the events to which a particular transaction GUID corresponds. If the machine on which you're generating the report doesn't have the same event trace providers as the machine on which you're running the trace, tracerpt.exe won't be able to get information about the events required to create the report files. Thus, if you're running tracerpt.exe against a log file created on a Win2K server that's logging AD information, I recommend that you run tracerpt.exe from a Windows 2003 domain controller (DC) that also has the AD providers registered in WMI.

Both tracedmp.exe and tracerpt.exe convert the binary .etl file into a .csv file. Both utilities also provide a text file that contains summary information about the logging session (e.g., the time it ran, how long it ran) and the events captured (i.e., each event's name, type, and number of occurrences as well as the GUID of the provider that created the event). However, tracedmp.exe creates the summary report automatically, whereas tracerpt.exe gives you the option to manually create it. Tracerpt.exe also gives you the option of creating a workload report. The workload report summarizes the processes and transactions traced during the logging session. For example, if you were monitoring process activity, the workload report would contain summary data on all processes that were running, the CPU time of each process, and the number of threads that each process used.

To use tracerpt.exe to generate a .csv file, summary report, and workload report from a binary log, you can use the command

tracerpt myfirstlog.etl 
mysecondlog.etl 
-o -summary -report

where myfirstlog.etl and mysecondlog.etl are the names of the .etl files you want to convert. For the -o, -summary, and -report parameters, you can specify filenames after each parameter. By default, if you don't provide a filename, tracerpt.exe creates the dumpfile.csv file for the detailed .csv file (the -o parameter), the summary.txt file for the summary report (the -summary parameter), and the workload.txt file for the workload report (the -report parameter). If you provide multiple .etl files as input, as I did in the example just given, tracerpt.exe concatenates the data in the .csv file, summary, and workload reports. The ability to combine reports is useful if you want to correlate data from multiple trace sessions taken within the same time frames. The version of tracerpt.exe that comes with Windows 2003 includes the ability to modify the output format of the summary and workload reports. You can generate XML, text, or HTML files.

The .csv file is the core output from a logging session. It contains each event that occurred during a session, the time it occurred, and the amount of kernel and user CPU time the event consumed. In addition, the .csv file contains any data associated with the transaction. For example, a Windows kernel network trace will include the following associated data: the TCP or UDP Send or Receive event (which shows the source and destination IP address and port) and the size of the bytes exchanged. You can import .csv files into Excel to make them more readable, but the reality is that tracing sessions can generate a lot of data. Thus, you need a more focused approach to get value from them. Let's look at how you can use ETW in the real world.

How to Leverage Event Tracing in Real Life
The goal of event tracing is to obtain actionable data about your environment. For example, suppose you have to add a large number of users to AD and you're worried about how the current hardware supporting your Windows 2003 DC will handle the additional authentication traffic. You can use event tracing to get a good picture of the current authentication workload on the server and how many resources each logon event requires. You can then use that information to determine how much more load the additional users will create.

You can set up the trace a couple of ways. The first approach is to schedule an AD trace session to start during the peak logon time in a production environment and collect data long enough to determine each logon event's impact on the DC. The second approach is to enable tracing within a controlled lab environment to determine the average overhead per logon transaction, then extrapolate the average overhead to the number of users you expect to accommodate on the system. Running an AD trace session in a production environment will have a performance impact on the DC. Although the session will likely cause less than 5 percent CPU overhead, some administrators might not be able to run a trace session in their production environments. Thus, let's take a closer look at the lab approach instead.

To run the AD trace session in a lab environment, let's use the Performance Logs and Alerts snap-in to set up two logging sessions on a Windows 2003 DC. One logging session will use the nonsystem providers to capture events that represent AD workload on a DC during a logon. The second session will use the system provider to capture kernel statistics. This additional kernel data ensures that you'll generate an accurate and complete workload report.

To set up the first session, choose the Nonsystem providers option in the snap-in and add all the AD-related providers (i.e., Active Directory: Kerberos, Active Directory: Netlogon, Active Directory: SAM, Local Security Authority, NTLM Security Protocol, and Windows NT Active Directory Service). To set up the second session, choose the Events logged by system provider option, and select, at a minimum, the Process creations/deletions, Thread creations/deletions, Disk input/output, and Network TCP/IP check boxes.

After you set up the two trace logging sessions, manually start each session, then log on to AD on the Windows 2003 machine. After the logon process finishes, stop the two logging sessions. Use tracerpt.exe to combine the two resulting .etl files and generate the CSV, summary, and workload reports.

If you examine a typical .csv file, you'll notice a lot of different events. These events represent transactions that Microsoft has coded into its providers, and the events correspond to a set of actions that are being performed. As the sample .csv entry in Figure 3 shows, such events aren't easily understandable.

The first line in a .csv file contains header information that identifies each data field. For example, the sample entry in Figure 3 contains seven data fields:

Event Name—specifies the name of the event. In this case, the event's name is GetASTicket, which is the client's request for the Kerberos Authentication Service.
Type—specifies the type of event. Typically, you'll see Start or End in this field, which marks the beginning and the end of an event, respectively.
TID—lists the thread ID (in hexadecimal notation) under which the current event is performing.
Clock-Time—specifies the time that the event occurred. The time appears in Integer8 format, which is a 64-bit value that holds the number of 100-nanosecond intervals that have occurred since 12:00 a.m., January 1, 1601. Although the Integer8 format is obscure, Microsoft uses it frequently for services, such as WMI and AD.
Kernel(ms)—specifies the time, in clock ticks, used for this event, broken out by kernel-mode CPU time. Clock ticks translate into real time based on the clock cycle of a typical PC, which is 1 clock tick every 10ms. For example, 100 clock ticks would equal 1 second (100 * 10) in real time. Note that the CPU time represented here and for the user time is somewhat misleading because the value listed is the time from the point at which the current thread is started to the time when the event tracer fires the event. Thus, you'll have to rely on the workload report to generate CPU times.
User(ms)—specifies the time, in clock ticks, used for this event, broken out by user-mode CPU time.
User Data—contains any associated data that Microsoft has included in the transaction. In this case, Microsoft has provided four additional pieces of information. The first value, 0x00000025, is an instance ID flag that isn't relevant to this discussion. The second value, "administrator", specifies the user ID requesting the authentication service, whereas the third value, "krbtgt/DOMAINA", is the name of the Kerberos ticket-granting service running on the domain called DOMAINA. The final value, "DOMAINA", is the name of the destination domain.

The question you might be asking now is, "How do I make heads or tails of the event data?" Unfortunately, Microsoft hasn't documented the data that AD and related providers supply. Fortunately, all event-tracing providers are registered in the WMI namespace. Thus, you can use tools such as WMI CIM Studio to find out each provider's transactions and the data fields in those transactions. WMI CIM Studio is available for download from the Microsoft Download Center (http://www.microsoft.com/downloads/details.aspx?displaylang=en&familyid=6430f853-1120-48db-8cc5-f2abdc3ed314). You can find the registered providers in the root\WMI\EventTrace path.

Figure 4, page 36, shows the information that WMI CIM Studio provides for the GetASTicket transaction. As you can see, WMI CIM Studio refers to the transactions as classes and the data fields as properties. Although WMI's property names don't exactly match the heads for the data fields in the .csv file, the property names are similar enough to let you determine the data fields in the .csv file.

If you don't want to work with the raw data in the .csv file, you can use the workload report. Figure 5 shows the workload report that tracerpt.exe generated from the AD logon trace. To provide this report, tracerpt.exe transformed the .csv file's raw data into meaningful statistics, such as the total number of transactions. To calculate the total number of GetASTicket transactions, for example, tracerpt.exe paired up GetASTicket's Start and End Type events in the .csv raw data. Then, tracerpt.exe counted the number of pairs.

Another statistic in the workload report is Response Time(ms), which represents the total time spent in a transaction. A response time of 0 means that the report wasn't formatted to show the number of decimal places required to present a value that's quite small (e.g., 0.00023). Hence, tracerpt.exe rounds the number up to 0.

The workload report in Figure 5 shows that 14 different types of transactions occurred during a user logon. Each of those transactions takes some time to execute and takes up CPU time. In addition, some transactions execute more than once. The .csv file tells you how long each transaction takes between start and end and how many clock ticks each transaction takes. Thus, by using the information in both reports, you can total CPU times to determine the CPU load on the DC for one user logon session. For example, if you import the .csv file into Excel, you can sort by transaction name and add up the kernel and user CPU times for each start-end pair to obtain the total CPU time for each transaction type. This result would provide more granularity than the statistic in the workload report.

Keep in mind that you're looking only at AD and its related subsystems, such as Kerberos and the LSA process. This trace doesn't capture other logon-related events, such as downloading a user profile or grabbing Group Policy information. If necessary, you can use some of the other providers to capture data on these events. For example, you can use the Windows Kernel provider to capture file I/O metrics.

Also keep in mind that you shouldn't simply multiply the load generated by one user logging on to the DC by the number of users to determine overall load. Load is typically nonlinear because AD and other transactions benefit from such processes as memory caching. So, to get a true view of load as you scale up users, repeat the load-tracing test with an incremental increase in users to determine how number of users relates to overall CPU load.

The Power of Event Tracing
Event tracing provides many opportunities to get detailed information about what's happening on your Windows system, where time is being spent, and how many resources are being consumed for a given operation. Using the ETW facility in test and production environments can give you insight about the performance of your systems and applications—a level of insight that Performance Monitor just can't touch.

Comments

Plain text