Data Tales #3: The Case of the Stubborn Log File

Anyone who has worked with SQL Server for any length of time knows the common errors that happen. Before a new user even completes their question, you often know how it's going to end. One of those questions is always about database log files that keep growing and growing until they fill all available disk space. Sound familiar? We've all heard about that one and most of us would have a straightforward checklist for what to look for.

The first item on my checklist is to find out whether the log files are being managed at all. The number one cause of uncontrolled log growth is the lack of any ongoing maintenance of the log files, so before even connecting to the site, I had told the onsite DBA (let's call him Phil) to check that log file backups were occurring. Phil assured me that a reasonable maintenance solution was already in place and that log file backups were scheduled to run every two hours.

Phil was working with a relatively small database of around 90GB. Previously his log file would stay at around 12GB in size. He had a 100MB filegrowth increment configured. The LUN that contained the log file was just under 40GB, so he thought that he had plenty of room. Unfortunately, the drive kept filling up.

The second item on my checklist is to check whether or not transactional replication is being used. If the replication log reader isn't reading the log and sending details to the distribution database, the log can't be truncated even if the log is being backed up.

A trickier problem with replication and log files is when the option sync with backup has been used on the publication database or distribution database. In that case, not only does log truncation wait for the replication log reader to copy the details to the distribution database, it also waits for the distribution database to be backed up. Either way, Phil assured me that transactional replication wasn't being used.

The first step in regaining control over the log file is usually to get to a point where you have room to work. As I could not look at that database for a few hours, I suggested he set it to Simple recovery model and to see if he could shrink the file using DBCC SHRINKFILE. He used the following command:

DBCC SHRINKFILE('ABCTest_log', 12000);

Phil told me that he needed to use this command a number of times but that eventually, the file did shrink to his target of about 12GB. However, a short time later, the log file was back at nearly 40GB and had again filled the disk.

It was time to connect and take a look.

The third item in my checklist is to look for long running transactions. Regardless of which recovery model has been selected, the log file will continue to grow while there is a long-running active transaction. The quickest way to check this and to also find out why a log file can't be truncated (and reused) is to query sys.databases via:

SELECT log_reuse_wait_desc FROM sys.databases WHERE name = N'ABCTest';

This showed the following:

This confirmed that a transaction was currently the cause. To determine whether the transaction really was long running, you can execute the following command:

USE ABCTest;
GO

DBCC OPENTRAN;
GO

This command returns information similar to the following:

Transaction information for database 'ABCTest'.

Oldest active transaction:
    SPID (server process ID): 61
    UID (user ID) : -1
    Name      : user_transaction
    LSN       : (31:53:4)
    Start time    : Aug 13 2015  3:50:44:593PM
    SID       : 0x01050000000000051500000045a0308d2069feee8da059d8bc1f0000
DBCC execution completed. If DBCC printed error messages, contact your system administrator.

I prefer to see most transactions complete within a few seconds at most, at least in any online processing application. I was able to see that the start time was more than an hour ago. This means that even though the database was in Simple recovery model, the log file would contain all the details for the last hour. This was the reason that it just continued to grow.

It was time to work out which command was causing the issue. I was able to do this by using the SPID value of 61 that was also returned by DBCC OPENTRAN. The sys.dm_exec DMVs are useful for finding this information:

SELECT st.text AS SQLText,
   qp.query_plan AS QueryPlan
FROM sys.dm_exec_requests AS r
CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) AS st
CROSS APPLY sys.dm_exec_query_plan(r.plan_handle) AS qp
WHERE r.session_id = 61;

Executing this command showed me two things. The first was that the problematic query was:

(@LogExpiryDate datetime,@LogExpiryTime decimal(6,0)) 
DELETE FROM ACTIVITY_LOG 
WHERE LOG_DATE < @LogExpiryDate 
OR (LOG_DATE = @LogExpiryDate AND LOG_TIME <= @LogExpiryTime)

The second was that the query plan was as follows:

(Note that I have removed the actual table name to protect the guilty application vendor J)

All this query is meant to do is to delete older data that is no longer needed. The query is run periodically. I'm sure when the developer tested it, the performance was near instantaneous. The problem is that when the volume of data increased, the problems started. Notice that there is no suitable index to allow the DELETE statement to locate the rows to be deleted. The query is reading the entire table. As soon as more than 5000 rows are affected, the locking would also escalate to a table-level lock.

I have seen this type of query causing problems many times. In fact, some years back I wrote about the need to fix this same type of query in ASP.NET session state databases. The standard ASP.NET supplied procedure deletes expired sessions using this same simplistic method and can lead to major blocking problems.

What was even worse in this case, is that the deletion code was set to be executed every half hour. As soon as the query took longer than half an hour to execute, another copy of the same query would also start running, and be blocked by the first query. This was easily seen in Activity Monitor:

You can imagine that this leads to a downward spiral that will lead to significant problems.

The final issue was that the I/O subsystem was unable to keep up with all the activity either. A quick check of the latency showed numbers far in excess of the recommended 5 to 10 milliseconds for data files:

It was time to fix the issues. Apart from speaking to the SAN administrator about how slow the SAN had become under load, two things needed to be done to avoid the problem.

First, a suitable index needed to be created on the ACTIVITYLOG table. In this case, a simple index on LOG_DATE and LOG_TIME worked fine:

CREATE INDEX IX_ACTIVITYLOG_DeletionPerf_20150813
ON dbo.ACTIVITYLOG (LOG_DATE, LOG_TIME);

Second, it was important to make sure that the older data was deleted in much smaller chunks. There are very few uses that I have found for the TOP operator in INSERT, UPDATE, and DELETE statements because they cannot also contain an ORDER BY clause, but the ability to perform deletions in smaller chunks can be very useful. The deletion code would be rewritten as:

WHILE (1 = 1)
BEGIN
    DELETE TOP(100) 
    FROM dbo.ACTIVITYLOG 
    WHERE LOG_DATE < @LogExpiryDate;
    IF @@ROWCOUNT < 100 BREAK;
END;

WHILE (1 = 1)
BEGIN
    DELETE TOP(100) 
    FROM dbo.ACTIVITYLOG 
    WHERE LOG_DATE = @LogExpiryDate
    AND LOG_TIME <= @LogExpiryTime;
    IF @@ROWCOUNT < 100 BREAK;
END;

The first loop deletes rows prior to the target date, 100 rows at a time, until all target rows have been deleted. The second loop deletes rows on the target date prior to the target time. While it would have been possible to perform this action in a single loop, particularly had the system contained a single datetime column rather than separate dates and times, it was much easier to optimize the query and indexing where these were separate queries.

The final steps were to reshrink the log file to a sensible size, change the database to Full recovery model, perform a full database backup, and to re-enable the two-hourly log backups.

Comments

Plain text