If you've worked with databases of any size, chances are you've also had to wrestle with issues associated with deleting large volumes of non-partitioned data. By non-partitioned data, I mean data that isn't partitioned by something like a date field.
Partitioned data is stored as separate sets of data. A good example is monthly sales data. If the data is partitioned by month, you might have 13 separate partitions—one for each of the 12 prior calendar months and one for only current data. When you have partitioned data, getting rid of millions or even billions of rows is a matter of switching and dropping data partitions. No data is physically Ldeleted from the tables as it would be by DELETE statements, but all the space allocated to a table is deallocated and the data on those pages is physically deleted as if you had issued a DELETE FROM SomeTable command. In this case, partition management is a matter of updating the data dictionary tables. In terms of overall data deletion performance, managing partitioned data by dropping and switching partitions is about as fast as removing data can get.
Non-partitioned data lives in either a heap or clustered index. There are no boundaries defined for splitting the data as there are in partitioned tables. Because of this, deletions become significantly more complicated, especially when dealing with millions of rows. Unless you can physically truncate a non-partitioned table, which is a data dictionary modification, deleting rows from tables can potentially generate an extremely large log file. The deletion process might also adversely affect database performance at several different levels. Truncating a table in SQL Server is a matter of deallocating used pages. Because these page deallocations are a logged action, it's possible to roll back a TRUNCATE operation. In other databases, such as Oracle, this isn't the case.
In an ideal world, all data sets would be partitioned and deleting huge amounts of data would be little more than a data dictionary update. Because we don't live in an ideal world, I'll discuss some alternative strategies for purging large amounts of non-partitioned data, particularly as they apply to mission-critical 24 ´ 7 ´ 365 environments.
Goals of Purging
When an organization operates around the clock, you're left with a very small window for database maintenance. The challenge is that the database tables must still be maintained for peak database performance. There are a number of goals to balance when designing a database purge process:
Minimize system impact. The point of purging is to delete unwanted data from the database, but a paramount design consideration is doing this in such a way that it doesn't impair your system while the process is running. Building a purge process that "first does no harm" isn't as easy as it would seem.
It's important to define what "minimizing system impact" means in the context of your design. It might be that your purge process can't consume excessive amounts of CPU, or perhaps it can't cause unnecessary database blocking. Start by defining the things your purge process should not impact, then determine what it will impact.
Delete data as quickly as possible. The counterpart to minimizing system impact is deleting data as quickly as possible. These goals aren't at odds with each other, but you do need to balance them in the design process.
Avoid blowing out the database transaction logs. In mission-critical systems, you should always be running database backups using the full recovery model. This has ramifications that become exceedingly important when trying to purge large volumes of data. You probably won't be able to issue a DELETE FROM SomeTable command, which is a fully logged operation, because doing so can increase the size of the transaction log to the point where it might run out of space. Truncating tables isn't always an option either. There might be current data that you need to retain.
I've had extremely large delete operations blow out at very bad times, resulting in hours of rollback activity. Fortunately, none of those blowouts occurred in mission-critical environments, but it underscores an important issue that you need to keep in mind as you design your database purge process.
Approaches to Purging
Given the purging goals, how do you go about designing the purge process? When I need to purge billions of rows that contain terabytes of data, I often use batched DELETE operations (aka batched deletes). Although there are several different ways to approach batched deletes, I'll discuss and provide code samples for four methods:
1. Truncation followed by a batched delete
2. Batched delete using a sliding window
3. Row-at-a-time batched delete
4. Batched delete using a sliding window with an inner join
The code samples use two tables—a parent table named Table1 and a child table named Table2—that contain a decent amount of test data. I wrote a script to create these tables. It cross-joins sys.tables against itself to seed Table1. After adding an identity column to Table1 and defining that column as the primary key, the script creates Table2, which is the child table for these demonstrations. The script then adds the foreign key constraints from Table2 to Table1. It also adds two indexes to Table1.
If you'd like to test the five methods, you can use this script to generate Table1 and Table2. In my test database, it generated 1.9 million rows. Your execution might generate more or less rows, depending on the number of tables in your database. BatchedDeleteExamples.sql contains the script as well as the code samples for the four methods. (The script creates backup tables so that you can restore Table1 and Table2 between running each method's code.) You can download BatchedDeleteExamples.sql by clicking here.
When you need to physically remove all the data from a table, the TRUNCATE command is a very fast means to get rid of the data. As Microsoft mentions on its Data Definition Language (DDL) Statements (Transact-SQL) web page , the TRUNCATE command is classified as a DDL statement. In SQL Server, a TRUNCATE operation does minimal logging to record page deallocations, which is what allows a TRUNCATE operation to be rolled back.
Note, however, that if there are foreign key constraints, you can't truncate a parent table. You could drop the foreign key constraints, truncate the table, and add the constraints back, but that's not always possible in a production system, especially one that needs to run 24 ´ 7 ´ 365. (Most DBAs would be quite upset if you tried to perform this procedure in a running database.) Plus, dropping the foreign key constraints carries the risk that a dropped constraint might not be added back or properly revalidated if there's a failure during the purge process, both of which are generally unacceptable risks in production databases.
When you want the speed of truncation but your tables have foreign key constraints, you can use method 1. With this method, you use a TRUNCATE command to truncate the child table and a batched DELETE operation to delete rows from the parent table in large chunks instead of one row at a time. The net effect is that you quickly get rid of a significant portion of the data, then less quickly get rid of the related rows in the parent table. Listing 1 shows a simple example of how to implement method 1.
Method 2 uses a sliding window to batch the DELETE operations. This method gathers all the primary keys to be deleted by querying the source table in a single pass. Those keys are put into a temporary table, which drives the delete process. With this design, the source table is never touched again by the purge process. This can be beneficial when the number of rows to be purged is less or significantly less than the number of rows in the source table. It also removes any need to access the source table again to determine what rows need to be purged. By gathering this data in one pass instead of repeatedly coming back to the source table, you can potentially increase system concurrency.
The temporary table uses a monotonic primary key based on an identity column that is navigated using low and high boundary markers to move between sets of rows to be deleted. As each rowset is deleted from the target table, the window of rows to be deleted is advanced until there are no more rows marked for deletion. Because each DELETE operation is an implicit transaction by itself, there's no need to surround the DELETE command with BEGIN TRANSACTION and END TRANSACTION.
Listing 2 shows a basic implementation of method 2. When you have a very large number of rows to delete, method 2 can work well.
In method 3, which Listing 3 shows, the delete operations occur one row at a time. This method tends to offer the highest possible concurrency in a highly active system at the cost of generally poor or very poor overall performance.
Like method 2, method 3 gathers the primary keys of the rows to be deleted and puts them in a temporary table. However, instead of immediately deleting all the rows, the delete process is run continuously so that the deletes occur a row at a time in trickle fashion.
Because commit operations can be resource intensive, you can modify the code so that it commits the delete operations after n number of rows are deleted. This modification can also significantly speed up the process because an entire set of delete operations is committed at one time as opposed to committing one delete operation at a time.
In method 4, a sliding window with an inner join is used to batch the DELETE operations. This method is possible because of T-SQL's ability to join tables in a DELETE operation. This is a T-SQL extension, so method 4 won't work in non-SQL Server databases such as Oracle or DB2.
Like method 2, method 4 uses a temporary table of keys to create a sliding window that breaks the DELETE operations into batches. Once again, the purpose of the temporary table is to pre-aggregate the key values to be deleted and avoid repeatedly accessing the source table to gather this information. However, in method 4, the temporary table is joined to the table from which you're deleting rows. Here's what this join operation looks like:
DELETE FROM Table2 FROM #RowsToDelete INNER JOIN Table2 ON #RowsToDelete.PKValue = Table2.RecID
As you can see, the code joins the temporary table with the key values using an inner join based on the key values in the temporary table and in the child table. To see what the rest of the code in the method looks like, see Listing 4.
Method Assessment and Fine-Tuning
The performance of these batched delete processes is easily assessed. I tested the methods on an entry-level dual processor server-class machine with a data set of 1,909,924 rows of parent records and the same number of child records. SQL Server was capped at 1GB and was running on a single 300GB partitioned hard drive. The methods' run times were
- Method 1—2 minutes 22 seconds
- Method 2—1 minute 21 seconds
- Method 3—2 hours 14 minutes
- Method 4—1 minute 23 seconds
So, which method is best? It might seem that the method with the shortest runtime would be best, but the answer really depends on your goals for purge process. Are you trying to achieve high performance, high availability, or both?
Like databases, batched deletes can be fine-tuned to find the "sweet spot," which is the maximum number of rows that can be deleted at one time with one DELETE command without seeing a drop in overall throughput of the purge process. To find the sweet spot for the method you selected, you can use techniques such as:
- Dropping the foreign key constraints. When you're working with parent-child relationships, the delete process involves both the parent and child records. By dropping the child foreign key constraints before running a batched delete, you might be able to realize a significant performance gain because the optimizer doesn't have to run any SEEK operations against the child tables. You can verify this by looking at the execution plans of DELETE operations both with and without foreign key constraints. When you're done, you need to put the constraints back in, check them, and revalidate them to make sure they're trusted constraints. Revalidating constraints can have some potentially serious performance ramifications, particularly with huge data sets that have foreign key constraints. So, you need to investigate those ramifications before you drop any constraints. As noted previously, this method is generally inadvisable in a production system, but it definitely has a place in a developer's toolkit.
- Adjusting the batch size. In any of the methods, you can use a variable batch size and adjust how many DELETE operations are in a batch. If you use a larger batch size, you might find that you can delete two or three times as much data (or even more) in a single pass. You should stop increasing the batch size when you start to see a performance decrease rather than increase. When performance decreases, you have left the DELETE operation's sweet spot.
- Including needed indexes. If a purge process is missing critical indexes, be sure to add them to the appropriate tables. One or more might prove to be crucial. For example, in one purge processes I developed, a single missing index meant the difference between a 24-hour purge and a 2-hour purge.
Data Deletion in the Real World
I've introduced you to four methods for purging large amounts of non-partitioned data. These aren't the only methods you can use. As an exercise, see what creative approaches you can come up. For example, you can see another one of my creative approaches in the web-exclusive sidebar "The Pseudo-Parallel Approach to Purging Non-Partitioned Data." No matter what method you're considering, be sure to test it to make sure that it will meet your goals for the purge process and fine-tune it so that you get the best results possible.
Listing 1: Sample Implementation of Method 1
DECLARE @QuitLoop As Bit, @RowsDeleted As Int SET @QuitLoop = 0 TRUNCATE TABLE Table2 WHILE (@QuitLoop = 0) BEGIN DELETE TOP (10000) FROM Table1 IF (@@ROWCOUNT = 0) SELECT @QuitLoop = 1 END
Listing 2: Sample Implementation of Method 2
DECLARE @LoBound As Int, @HiBound As Int, @BatchSize As Int, @QuitLoop As Int SELECT @LoBound = 1, @HiBound = 5000, @BatchSize = 5000, @QuitLoop = 0 CREATE TABLE #RowsToDelete (RecID Int IDENTITY(1,1) PRIMARY KEY CLUSTERED, PKValue Int) INSERT INTO #RowsToDelete (PKValue) SELECT RecID FROM Table1 WHILE (@QuitLoop = 0) BEGIN DELETE FROM Table2 WHERE (RecID IN (SELECT PKValue FROM #RowsToDelete WHERE RecID BETWEEN @LoBound AND @HiBound)) IF (@@ROWCOUNT = 0) SELECT @QuitLoop = 1 SET @LoBound = @HiBound + 1 SET @HiBound = @HiBound + @BatchSize END DROP TABLE #RowsToDelete
Listing 3: Sample Implementation of Method 3
DECLARE @RecID As Int, @PKValue Int, @QuitLoop Int SET @RecID = 1, @QuitLoop = 0 CREATE TABLE #RowsToDelete (RecID Int IDENTITY(1,1) PRIMARY KEY CLUSTERED, PKValue Int) INSERT INTO #RowsToDelete (PKValue) SELECT RecID FROM Table1 WHILE (@QuitLoop = 0) BEGIN SELECT @PKValue = PKValue FROM #RowsToDelete WHERE RecID = @RecID DELETE FROM Table2 WHERE RecID = @PKValue IF (@@ROWCOUNT = 0) SELECT @QuitLoop = 1 SET @RecID = @RecID + 1 END DROP TABLE #RowsToDelete
Listing 4: Sample Implementation of Method 4
DECLARE @LoBound As Int, @HiBound As Int, @BatchSize As Int, @QuitLoop As Int SELECT @LoBound = 1, @HiBound = 5000, @BatchSize = 5000, @QuitLoop = 0 CREATE TABLE #RowsToDelete (RecID Int IDENTITY(1,1) PRIMARY KEY CLUSTERED, PKValue Int) INSERT INTO #RowsToDelete (PKValue) SELECT RecID FROM Table1 WHILE (@QuitLoop = 0) BEGIN DELETE FROM Table2 FROM #RowsToDelete INNER JOIN Table2 ON #RowsToDelete.PKValue = Table2.RecID WHERE (#RowsToDelete.RecID BETWEEN @LoBound AND @HiBound) IF (@@ROWCOUNT = 0) SELECT @QuitLoop = 1 SET @LoBound = @HiBound + 1 SET @HiBound = @HiBound + @BatchSize END DROP TABLE #RowsToDelete