TraceErrors Process

Downloads
102445_2.zip
102445sqlsig.zip

Editor note: Yaniv Etrogi has created a script to deploy the sqlsig CLR function. It is linked above as 102445sqlsig.zip.

I implemented the TraceErrors process to gain more control over the production servers my team is responsible for. Knowing about every error that occurs in SQL Server can make a difference. In a perfect world, all exceptions would be logged, reported, and emailed to the relevant development team, but we all know that things don’t work like this.

In reality, not all exceptions are caught or reported. One common scenario is exceptions being cached but not logged, known as a swallowed exception. When things go wrong, the DBA is expected to explain what went wrong, even though in many cases the problem is caused by the application being serviced by the database and not by the database itself.

With the TraceErrors process, I receive a daily email with an HTML report of all exceptions that occurred in the database engine in the last 24 hours. This lets my team be proactive in many cases, approaching the relevant person or group responsible for the module generating an error, saving time and speeding resolution.

In addition to the daily report mail, having the traces always up and running has been proved to be very useful, especially if you don't have a third-party analyzer product that tracks and saves all database activity so you can search it. I'm not suggesting that you don't need an analyzer, but in environments where there aren't any such tools, TraceErrors has greater benefits.

With the traces always running, you can immediately go to the last trace file and load it to a table if any problem occurs, and you can then query this table. You could also double-click the trace file to open SQL Server Profiler and automatically load the file to view the data in the Profiler UI and use Ctrl+F to search and investigate the data. The Microsoft article "Using SQL Server Profiler" has more information about analyzing trace data.

I limit my trace files to 50MB, a size that I've found convenient. The timeframe that a 50MB file can cover will depend on the workload your server handles, so you can increase the limit to fit your workload.

The load operation consumes I/O and CPU resources, so it's best to schedule it for off hours. If you don't have off hours, be sure to schedule at off peak hours or copy the trace files to another server and perform the load there. Just to give you an idea, it takes me 14 minutes to load and process 50 50MB files containing a total of 9 million events on a two-year-old server with four CPUs, 8GB RAM, and local SCSI 15,000 RPM disks configured in RAID-1. It takes only six minutes on a new server with four CPUs, 16GB RAM, a fast SAN to load and process trace files with the same size that contain 24 million events.

Regardless of the time that your trace data covers, you control the period of time covered by the report using the @Hours input parameter passed to the TraceErrorsLoad stored procedure, which processes the raw traced data. I use 24 hours for @Hours, because the report is generated once daily. Setting @Hours to 24 ensures that no data older than 24 hours is worked on.

You can download the code for TraceErrors by clicking the link above.. To implement TraceErrors, first create a database named PerfDB. Create the four stored procedures included in the .zip file in the PerfDB database. Then create the TraceErrorsLoad job, which executes the TraceErrorsLoad stored procedure, and the TraceErrorsStatus job, which executes the TraceErrorsStatus stored procedure.

Process Description

TraceErrors is made up of four stored procedures. Figure 1 gives a brief description of the procedures. TraceErrorsCreate creates a trace file and sets its status to started. The procedure uses SQL Trace’s stored procedures. First it uses sp_trace_create to create a trace definition in a stopped state. Next, it adds (subscribes to) events and columns using sp_trace_setevent. Following that, sp_trace_setfilter is used to filter out unwanted data. Finally, it calls sp_trace_setstatus to change the trace status to started.

Figure 1: Procedures used by TraceErrors

TraceErrorsCreate—Defines the trace and starts it.
TraceErrorsLoad—Loads the trace files into a table and manipulates the data, making it ready for retrieval.
TraceErrorsStatus—Verifies that the trace is running and active.
TraceErrorsReport—Retrieves the final results and returns a table that can be mailed by the calling application.

The stored procedure TraceErrorsLoad reads the captured data from the trace files, loads it to a database table, and manipulates the data, making it ready and available for retrieval. The stored procedure TraceErrorsReport simply retrieves the data and displays it to the calling application. The stored procedure TraceErrorsStatus assures that the trace is always up and running.

TraceErrorsCreate

The TraceErrorsCreate procedure creates the trace. The trace itself defines the events that you subscribe to and, for each event, what columns get populated with data. Because this process deals with errors and exceptions, it captures Attention, Exception, and User Error message events from the Errors category. It also captures the RPC:Starting and SQL:BatchStarting events. These two events cover all server activity—any command sent to the database engine has to come in the form of a T-SQL Batch or Remote Procedure Call (RPC). This set of events captures and collects all the data required to track down any error and the statement (batch or RPC) that's responsible for generating that error.

TraceErrorsCreate receives two input parameters, @FileCount and @MaxFileSize, which define the number of trace files in the trace and the size of each file. The system stored procedure sp_trace_create is used within TraceErrorsCreate. The second parameter, @options, passed to sp_trace_create turns on the TRACE_FILE_ROLLOVER option. TRACE_FILE_ROLLOVER manages the trace files in a FIFO algorithm so that every time a trace file reaches the maximum size defined by the @MaxFileSize parameter, that file gets closed and the oldest file gets deleted while a new file is created and becomes the current active file. The third parameter, @tracefile, has to be modified to your desired path location on the server being traced. Make sure that there's enough disk space, and it's recommended you not use the system drive.

To reduce the number of events captured and keep the trace files small, I filter out the Profiler and Replication Distribution Agents applications by setting a filter on the ProgramName column (ColumnId 10). I also eliminate events triggered by sp_reset_connection, which is executed by .NET applications, which I have no interest in. This is done by filtering on the TextData column (ColumnId 1). I filter on the HostName column (ColumnId 8) to exclude events generated on the specific server being traced. This eliminates a lot of data related to replication and some other data, such as data generated while using SQL Server Management Studio on the server being traced. Keep in mind that your main interest is tracking errors caused by applications connected to the server, not by maintenance activity done by a DBA.

TraceErrorsLoad

The TraceErrorsLoad stored procedure is executed by a SQL Server Agent job. The procedure uses the fn_trace_gettable system function to load the captured data from the trace files to a database table. I use a SELECT...INTO command that is minimally logged and performs better than an INSERT...SELECT command.

Now that all of the data is in a database table, a common table expression (CTE) aliased as CTE is used to find the rows the procedure needs. It finds the rows that contain the errors captured in the trace and stores the results in the TraceErrorsFiltered table. The data can then be retrieved from that table by the TraceErrorsReport procedure.

The CTE performs a self LEFT JOIN on an instance of the table TraceErrorsFull, which is referenced at the FROM clause as T1. This causes instance T1 to be preserved. The CTE joins another instance of the table, aliased as T2, based on the condition that T2's EventSequence equals T1's EventSequence. For every row found in T1 with an Exception event, you get the matching T2 User Error Message, because the User Error Message (EventClass 162) is always fired following an Exception event. The Exception event contains the error number and the User Error Message contains the error description, so it's convenient to view this data side by side. The WHERE clause narrows the search to rows generated by Exception and Attention events, and only for rows with a StartTime greater than defined by the @Hours input parameter.

The last section of the code puts the statement (TextData column) that caused the error into the report—or, at least, the last statement executed by the SPID that generated the error. This section references TraceErrorsFull, aliased as T3, and uses the CROSS APPLY operator to combine this set of data into the outer select along with the set of data returned from the CTE. This instance of the table is joined to the CTE on a matching SPID and works only on the rows generated by the RPC:Starting and SQL:BatchStarting events. It works only on the previous EventSequence of that SPID. It chooses these rows using the ORDER BY EventSequence in descending order with the condition that T3's EventSequence is smaller than the CTE EventSequence.

The clustered index on the StartTime column lets the SELECT statement use a Clustered Index Seek operator and directly access the range of data defined in the WHERE clause. The nonclustered index on the EventSequence is used to satisfy the JOIN condition. Note that the EventSequence column is guaranteed to be unique on a given server. Restarting the SQL Server service zeroes the EventSequence column's data.

EventClass 61 (OLE DB Errors) and EventClass 67 (Execution Warnings) aren't processed by the INSERT statement and won't appear in the report. I still capture this data, however, because it can sometimes be valuable. Remember that the raw data remains available in the TraceErrorsFull table until the next time the procedure runs.

TraceErrorsReport

In the TraceErrorsReport stored procedure I use sqlsig, a CLR user-defined function (UDF) created by Itzik Ben-Gan. With help from Ray Maor of DBSophic, I modified sqlsig to remove some unwanted characters, such as CR and LF (ASCII 13 and 10), that caused problems rendering the tabular result set into the HTML report used by the calling application.

The UDF works on the TextData column of the rows generated by the RPC:Starting and SQL:BatchStarting events. It replaces specific characters, such as parameters, with a hash sign, which lets it aggregate the data, preventing the same error from showing up too many times in the report.

The @Rows input parameter passed to the TOP clause of the SELECT statement allows you to control the number of rows returned. Note that the returned result set isn't sorted in any order. My first choice would be to sort the results by the StartTime column, but to do so I would have to include the StartTime column in the SELECT list. Including StartTime would make the DISTINCT operator totally worthless, because a SELECT statement that uses the DISTINCT operator requires the columns in the ORDER BY clause to be part of the SELECT list.

I didn't include important columns such as StartTime, EventSequence, and SPID for the sake of aggregation. However, this data is captured and available at in the TraceErrorsFull table, where it remains until the next time the procedure runs, so the data is available to you while you're investigating a specific case.

The MIN_StartTime and MAX_StartTime columns come from TraceErrorsFull, which holds the raw trace data. I included them in the report only so that you know the timeframe of events that the report worked on. You can omit these columns or modify the procedure to return an additional result set with just these two columns. The Hours column shows the difference in hours between these two columns.

TraceErrorsStatus

The TraceErrorsStatus stored procedure is aimed at making sure the trace is running and active. The procedure is invoked by a job that has two schedules. The first schedule executes the stored procedure when the SQL Server Agent service is started, which results in the trace being started. The second schedule runs every hour to make sure that the trace exists and is in an active state. Alternately, the procedure can be invoked by a startup stored procedure instead of the first schedule. This eliminates the SQL Server Agent service as a point of failure and starts the trace a little earlier.

Once your trace is created, you can view its state and additional information by querying the system views. Use the query in Listing 1 as a starting point.

Listing 1: Viewing Trace Metadata

-- Get all traces
SELECT
	 \[id\], \[status\], event_count, dropped_event_count, \[path\]
	,max_size, max_files, is_rollover, is_shutdown, buffer_count
	,buffer_size, file_position, start_time, last_event_time
FROM sys.traces;


-- Get Events
SELECT
	 e.name AS \[event\]
	,c.name AS \[column\]
FROM fn_trace_geteventinfo(2) ei --< edit TraceId
INNER JOIN sys.trace_events e ON ei.eventid = e.trace_event_id
INNER JOIN sys.trace_columns c ON ei.columnid = c.trace_column_id;


-- Get Filters
SELECT
	 columnid
	,c.name AS \[column\]
	,logical_operator
	,comparison_operator
	,\[value\]
FROM fn_trace_getfilterinfo(2) ei  --< edit TraceId
INNER JOIN sys.trace_columns c ON ei.columnid = c.trace_column_id;


-- Get SubClass values
SELECT
	 c.name AS \[column\]
	,e.name AS \[event\]
	,s.subclass_value
	,s.subclass_name
FROM sys.trace_columns c
INNER JOIN sys.trace_subclass_values s ON c.trace_column_id = s.trace_column_id
INNER JOIN sys.trace_events e ON e.trace_event_id = s.trace_event_id
WHERE e.name LIKE 'Audit Login';


-- Get Events / Columns combination
SELECT
	 e.name AS \[event\]
	,c.name AS \[column\]
FROM sys.trace_event_bindings b
INNER JOIN sys.trace_events e ON e.trace_event_id = b.trace_event_id
INNER JOIN sys.trace_columns c ON c.trace_column_id = b.trace_column_id
WHERE e.name LIKE 'Audit Login'
ORDER BY e.name;

Why Capture?

Capturing SQL exceptions at the very lowest level, the database engine, is an important aspect of monitoring production systems. Capturing the exceptions provides you with your own set of logs so that you don't need to rely only on logs generated by applications, which can sometimes be unreliable or, in some environments, inaccessible to DBAs. Capturing RPC:Starting and SQL:BatchStarting events is truly a necessity on production systems, where a DBA needs to find out quickly what happened when things go wrong. Knowing what happened is an important step in preventing the same problem from being repeated.

Related Reading:

Related Microsoft Articles:

Comments

Plain text