Big Data-621247676 Thinkstock

Deleting Big Data: Which Option Works Best

Big data is critical to innovation, but what happens when you need to delete some of it? Here's what you need to know about large deletes.

In a previous article, I outlined how to gain hidden insights into Microsoft SQL Server transaction logs through use of specific Dynamic Management Objects. I did not think that I’d have the opportunity to use the information provided in that article to identify a course of action to one of the technology professionals I mentor.

The situation was this: Her client was logging activity used for billing into a table that had never been trimmed of unneeded, older data. This table was not used for auditing (the rules around long-term storage of that data are quite strict compared to what we’re discussing here). Rather, it was used to stage data to their billing system for processing. The table was approximately 20Tb and held six years of data. The use of the system had been steadily growing along with the success of her client’s company, so earlier years had fewer records. She was tasked with deleting all data in the table but for the last 13 months, with about 75% of the records targeted for deletion. The goal was to accomplish this task with as little impact on the overall system as possible. The IT pro asked me about different options for ordered deletes, which piqued my curiosity. This seemed like the perfect opportunity to put the Dynamic Management Objects that I wrote about to good use, to look for a couple of key factors: impact on the transaction log and spills to tempdb for processing.

Setting up the Test

I was interested in vetting out the options for “ordered” deletes, looking at both ordered deletes directly against tables and those same delete processes using views built on top of those tables. To add a twist to this test, I wanted to look at the differences between when an ordered delete is based on the clustering key or when it is based on a column that is specifically not the clustering key.

The data for this test is coming directly from the Wide World Importers sample database for Microsoft SQL Server. Using the code below, I set up the test environment by migrating a copy of that table into the new TBL_CI database. This database will represent ordered deletes directly against a table using the clustering key as the delete-ordering key. I then went on to backup and restore that database to create the three additional databases needed for the test as follows:

  • TBL_CI: Table-Based Delete Ordered by Clustering Key
  • TBL_XCI: Table-Based Delete Ordered by Column Other Than Clustering Key
  • VW_CI: View-Based Delete Ordered by Clustering Key
  • VW_XCI: View-Based Delete Ordered by Column Other Than Clustering Key

I then added a clustered index on the [temperature] column of temps table in both TBL_CI and VW_CI and clustered indexes on the [CollectionID] column in both TBL_XCI and VW_XCI on that table.

Finishing up the prep work, I created an ordered view on the same column of VW_CI.dbo.temps that our delete will be ordered on: [temperature]; VW_XCI.dbo.temps is given a similar view, but ordered on the [CollectionID] column instead. I backed up all four databases in case I wanted to repeat the test:

/* BUILD ENVIRONMENT */

/* 01_CREATE INITIAL DATABASE */

CREATE DATABASE [TBL_CI]

    (

        NAME = N'WorldPeace',

        FILENAME = N'C:\SQLData\WorldPeace.mdf',

        SIZE = 512MB,

        MAXSIZE = 1024MB,

        FILEGROWTH = 256MB

    )

    LOG ON

        (

            NAME = N'WorldPeace_log',

            FILENAME = N'C:\SQLData\WorldPeace_log.ldf',

            SIZE = 256MB,

            MAXSIZE = 1024MB,

            FILEGROWTH = 256MB

        )

GO


/* 02_SEED INITAL DB WITH VALUES FROM SAMPLE DB */

SELECT * INTO TBL_CI.dbo.temps

FROM WideWorldImporters.dbo.temps

GO


/* 03_BACKUP THE INITIAL DB AND RESTORE 3 COPIES:

    :: TBL_CI: Table-Based Delete Ordered by Clustering Key

    :: TBL_XCI: Table-Based Delete Ordered by Column Other Than Clustering Key

    :: VW_CI: View-Based Delete Ordered by Clustering Key

    :: VW_XCI: View-Based Delete Ordered by Column Other Than Clustering Key

*/

BACKUP DATABASE TBL_CI

TO DISK = 'C:\Backup\WorldPeace_original.bak'

WITH COMPRESSION;

GO


RESTORE DATABASE [TBL_XCI]

FROM  DISK = N'C:\Backup\WorldPeace_original.bak'

WITH  FILE = 1, 

    MOVE N'WorldPeace' TO N'C:\SQLData\TBL_XCI.mdf', 

    MOVE N'WorldPeace_log' TO N'C:\SQLData\TBL_XCI_log.ldf',

    STATS = 100;

GO


RESTORE DATABASE [VW_CI]

FROM  DISK = N'C:\Backup\WorldPeace_original.bak'

WITH  FILE = 1, 

    MOVE N'WorldPeace' TO N'C:\SQLData\VW_CI.mdf', 

    MOVE N'WorldPeace_log' TO N'C:\SQLData\VW_CI_log.ldf',

    STATS = 100;

GO


RESTORE DATABASE [VW_XCI]

FROM  DISK = N'C:\Backup\WorldPeace_original.bak'

WITH  FILE = 1, 

    MOVE N'WorldPeace' TO N'C:\SQLData\VW_XCI.mdf', 

    MOVE N'WorldPeace_log' TO N'C:\SQLData\VW_XCI_log.ldf',

    STATS = 100;

GO


/* 04_TBL_CI AND VW_CI -->

    CLUSTERED INDEX ON SAME COLUMN OF TOP N ROW DELETE */

CREATE CLUSTERED INDEX [ixTemps] ON TBL_CI.[dbo].[Temps]

(

    [Temperature] ASC

)WITH

    (DROP_EXISTING = ON,

    FILLFACTOR = 99)

ON [PRIMARY]

GO


CREATE CLUSTERED INDEX [ixTemps] ON VW_CI.[dbo].[Temps]

(

    [Temperature] ASC

)WITH

    (DROP_EXISTING = ON,

    FILLFACTOR = 99)

ON [PRIMARY]

GO


/* 05_TBL_XCI AND VW_XCI -->

    CLUSTERED INDEX ON DIFFERENT COLUMN OF TOP N ROW DELETE */


CREATE CLUSTERED INDEX [ixCollectionID] ON TBL_XCI.[dbo].[Temps]

(

    [CollectionID] ASC

)WITH

    (DROP_EXISTING = ON,

    FILLFACTOR = 99)

ON [PRIMARY]

GO


CREATE CLUSTERED INDEX [ixCollectionID] ON VW_XCI.[dbo].[Temps]

(

    [CollectionID] ASC

)WITH

    (DROP_EXISTING = ON,

    FILLFACTOR = 99)

ON [PRIMARY]

GO


/* 06_VW_CI -->

    CREATE VIEW ORDERED BY SAME COLUMN OF TOP N ROW DELETE */

USE VW_CI;

GO


CREATE VIEW dbo.view_temps AS

    (SELECT TOP 50000 * FROM dbo.temps ORDER BY temperature)

GO

 

/* 07_VW_XCI -->

    CREATE VIEW, ORDERED BY DIFFERENT COLUMN OF TOP N ROW DELETE

    BUT NEED TO BRING IN ALL RECORDS FROM UNDERLYING TABLE OR ELSE

    WE CAN'T COMPARE PROPERLY SINCE THE RECORD SETS WILL NOT MATCH */

USE VW_XCI;

GO

 

CREATE VIEW dbo.view_temps AS

    (SELECT TOP 100 PERCENT /* TOP 50000 */ * FROM dbo.temps ORDER BY CollectionID)

GO


/* 08_BACKUP EVERYTHING. (CAN RESTORE TO RE-RUN TESTS) */

BACKUP DATABASE [TBL_CI] TO  DISK = N'C:\Backup\TBL_CI.bak'

WITH NAME = N'TBL_CI-Full Database Backup', COMPRESSION,  STATS = 100

GO

 

BACKUP DATABASE [TBL_XCI] TO  DISK = N'C:\Backup\TBL_XCI.bak'

WITH NAME = N'TBL_XCI-Full Database Backup', COMPRESSION,  STATS = 100

GO


BACKUP DATABASE [VW_CI] TO  DISK = N'C:\Backup\VW_CI.bak'

WITH NAME = N'VW_CI-Full Database Backup', COMPRESSION,  STATS = 100

GO

 

BACKUP DATABASE [VW_XCI] TO  DISK = N'C:\Backup\VW_XCI.bak'

WITH NAME = N'VW_XCI-Full Database Backup', COMPRESSION,  STATS = 100

GO

 

Ordered Deletes and Impact on Logging and tempdb

Each of the following four scripts was run independently against each database as shown. I didn’t commit the transactions until I ran the DMO queries to identify the impact of each:

 

Ordered Deletes Directly Against a Table; 

Clustering and Delete Ordering Keys are Identical:

 

USE TBL_CI;

GO


BEGIN TRANSACTION TCI_Ordered


    DELETE FROM dbo.Temps

    WHERE Temperature IN

        (

            SELECT TOP 50000 Temperature

            FROM dbo.Temps

            ORDER BY Temperature

        );


--ROLLBACK TRANSACTION TCI_Ordered

--COMMIT TRANSACTION TCI_Ordered

 

Ordered Deletes Indirectly Against a Table - Using a View; 

Clustering and Delete Ordering Keys are Identical:

/* DELETE USING VIEW ORDERED BY COLUMN = TOP N */

USE VW_CI;

GO


BEGIN TRANSACTION VW_CI_Ordered


    DELETE FROM view_temps;


--ROLLBACK TRANSACTION VW_CI_Ordered

--COMMIT TRANSACTION VW_CI_Ordered


 

Ordered Deletes Directly Against a Table; 

Clustering and Delete Ordering Keys are Different:

/* DELETE USING TABLE CLUSTER KEY != TOP N */

USE TBL_XCI;

GO


BEGIN TRANSACTION TXCI_Ordered


    DELETE FROM dbo.Temps

    WHERE Temperature IN

        (

            SELECT TOP 50000 Temperature

            FROM dbo.Temps

            ORDER BY Temperature

        );


--ROLLBACK TRANSACTION TXCI_Ordered

--COMMIT TRANSACTION TXCI_Ordered


 

Ordered Deletes Indirectly Against a Table - Using a View; 

Clustering and Delete Ordering Keys are Different:

/* DELETE USING VIEW ORDERED BY COLUMN != TOP N */

USE VW_XCI;

GO


BEGIN TRANSACTION VW_XCI_Ordered


    DELETE FROM view_temps

    WHERE Temperature IN

        (

            SELECT TOP 50000 Temperature

            FROM dbo.Temps

            ORDER BY Temperature

        );


--ROLLBACK TRANSACTION VW_XCI_Ordered

--COMMIT TRANSACTION VW_XCI_Ordered

Before identifying the least-impactful option, I want to take a minute to talk about the two options that involved deletes against the view instead of the table. One view was created on the TOP 50,000 records in the underlying table, and those 50,000 records were selected through the TOP/ORDER BY construct so the view is defined on the complete subset of rows we wish to delete from the underlying table in each of these tests. The last view had to be created on the entire table. If we created the XCI view on CollectionID we would not have identical views::tables for testing. This means we can do a DELETE command without a predicate of any kind for VW_CI, since the predicate we use for the DELETEs was part of the view definition. The other three tests share a similar DELETE command.

Surprising Results

Looking at each of the options side-by-side yields interesting conclusions. In terms of the overall transaction log impact, we have a narrow winner: Using the view defined as the DELETE we want to perform.

 

 

The code for the analysis comes from the transaction log Dynamic Management Objects:

/* FOR IMPACT ON LOG SPACE */

SELECT TAT.[name] AS trx_name

    , SUM(TDT.database_transaction_log_record_count) AS trx_log_records

    , SUM(TDT.database_transaction_log_bytes_used) AS trx_log_bytes_used

    , SUM(TDT.database_transaction_log_bytes_reserved) AS trx_log_bytes_reserved

    , SUM(TDT.database_transaction_log_bytes_used_system) AS trx_log_bytes_used_system

    , SUM(TDT.database_transaction_log_bytes_reserved_system) AS trx_log_bytes_reserved_system

    , SUM(TDT.database_transaction_log_bytes_used +

        TDT.database_transaction_log_bytes_reserved +

        TDT.database_transaction_log_bytes_used_system +

        TDT.database_transaction_log_bytes_reserved_system) AS total_log_bytes

FROM sys.dm_tran_active_transactions AS TAT

    INNER JOIN sys.dm_tran_database_transactions AS TDT

        ON TAT.transaction_id = TDT.transaction_id

WHERE TAT.name LIKE '%ORDERED'

GROUP BY TAT.[name]

ORDER BY 7 DESC;

When it comes to impact on tempdb, there is only one option that doesn’t spill, and that just happens to be the same ordered view option:

 

 

/* FOR SHOWING IF THERE ARE SPILLS INTO TEMPDB */

/* FOR IMPACT ON LOG SPACE */

SELECT TAT.[name] AS trx_name

    , TAT.transaction_id

    , DB_NAME(TDT.database_id) AS the_db

    , CASE TAT.transaction_type

        WHEN 1 THEN 'R/W'

        WHEN 2 THEN 'R/O'

        WHEN 3 THEN 'SYSTEM'

        WHEN 4 THEN 'DISTRIB'

    END AS TAT_type

    , CASE TAT.transaction_state

        WHEN 0 THEN 'STILL INITIALIZING'

        WHEN 1 THEN 'INITIALIZED, NOT STARTED'

        WHEN 2 THEN 'ACTIVE'

        WHEN 3 THEN 'ENDED'

        WHEN 4 THEN 'DIST COMMIT IN PROGRESS'

        WHEN 5 THEN 'PREPARED, AWAITING RESOLUTION'

        WHEN 6 THEN 'COMMITTED'

        WHEN 7 THEN 'BEING ROLLED BACK'

        WHEN 8 THEN 'ROLLED BACK'

    END AS active_state

    , CASE TDT.database_transaction_state

        WHEN 1 THEN 'TRX NOT INITIALIZED'

        WHEN 3 THEN 'INITIALIZED, BUT NO LOG RECORDS YET'

        WHEN 4 THEN 'TRX HAS GENERATED LOG RECORDS'

        WHEN 5 THEN 'TRX PREPARED'

        WHEN 10 THEN 'TRX COMMITTED'

        WHEN 11 THEN 'TRX ROLLED BACK'

        WHEN 12 THEN 'STILL BEING COMMITTED-LOG RECORD BEING GENERATED, BUT NOT MATERIALIZED YET'

    END AS database_state

    , SUM(TDT.database_transaction_log_bytes_used +

        TDT.database_transaction_log_bytes_reserved +

        TDT.database_transaction_log_bytes_used_system +

        TDT.database_transaction_log_bytes_reserved_system) AS total_log_bytes

FROM sys.dm_tran_active_transactions AS TAT

    INNER JOIN sys.dm_tran_database_transactions AS TDT

        ON TAT.transaction_id = TDT.transaction_id

WHERE TAT.name LIKE '%ORDERED'

GROUP BY TAT.transaction_id

    , DB_NAME(TDT.database_id)

    , TAT.[name]

    , TAT.transaction_type

    , TAT.transaction_state

    , TDT.database_transaction_state

ORDER BY 1, 2, 3;

 

As a bonus, looking at overall resource utilization and duration--using a defined, ordered view--also shows as the clear winner:

 

SELECT DB_NAME(S.database_id) AS [db_name], S.cpu_time, 

    S.memory_usage, S.total_scheduled_time,

    S.total_elapsed_time, S.reads, S.writes, S.logical_reads

FROM sys.dm_exec_sessions AS S

    INNER JOIN sys.dm_exec_connections AS C

        ON S.session_id = C.most_recent_session_id

WHERE S.database_id >= 24 AND S.database_id <= 27

ORDER BY S.total_scheduled_time;

Conclusion

When it comes to deleting large recordsets, it appears that using a view defined on the data you wish to delete tends to perform better over directly deleting those records from the table. Enforcing the definition of the delete within the schema has its apparent advantages. I suggest you consider using this scenario when you’re faced with large deletes.

Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish