Suppose you have a table that contains some duplicate rows that you need to remove. As an example, run the following code to create a table called T1 and populate it with about one million rows with some duplicates:
SET NOCOUNT ON; USE tempdb; IF OBJECT_ID(N'dbo.T1', N'U') IS NOT NULL DROP TABLE dbo.T1; GO SELECT n AS col1 INTO dbo.T1 FROM dbo.GetNums(1, 1000000) AS Nums UNION ALL SELECT n AS col1 FROM dbo.GetNums(1, 3) AS Nums OPTION (MAXDOP 1);
In our simple example, there's only one column in the table; in reality, you'd probably have a few columns. A subset of the columns would typically make the logical key. Assuming there were no constraint to enforce the key, you'd end up with the duplicates.
The method I like to use to deduplicate data, especially when the number of rows that needs to be deleted is small, is the following:
WITH C AS ( SELECT *, ROW_NUMBER() OVER(PARTITION BY col1 ORDER BY (SELECT NULL)) AS rownum FROM dbo.T1 ) DELETE FROM C WHERE rownum > 1;
You write a query that computes a row number that's partitioned by the logical key (in our case col1) and ordered by (SELECT NULL), meaning that you don't care about order. The query will assign unique row numbers to different rows that have the same col1 value. You could think of the row number as marking a duplicate number. You then define a CTE based on that query, and you issue a DELETE statement against the CTE to delete all rows with a row number that's greater than 1. After the delete, you'll be left with only one row of each distinct col1 value.
When you're done, run the following code for cleanup:
IF OBJECT_ID(N'dbo.T1', N'U') IS NOT NULL DROP TABLE dbo.T1;