table data

Learn How to Deduplicate Data

Remove duplicate rows from a table

Download the Code iconSuppose you have a table that contains some duplicate rows that you need to remove. As an example, run the following code to create a table called T1 and populate it with about one million rows with some duplicates:

SET NOCOUNT ON;
USE tempdb;
IF OBJECT_ID(N'dbo.T1', N'U') IS NOT NULL DROP TABLE dbo.T1;
GO
SELECT n AS col1 INTO dbo.T1 FROM dbo.GetNums(1, 1000000) AS Nums
UNION ALL
SELECT n AS col1 FROM dbo.GetNums(1, 3) AS Nums
OPTION (MAXDOP 1);

In our simple example, there's only one column in the table; in reality, you'd probably have a few columns. A subset of the columns would typically make the logical key. Assuming there were no constraint to enforce the key, you'd end up with the duplicates.

The method I like to use to deduplicate data, especially when the number of rows that needs to be deleted is small, is the following:

WITH C AS
(
  SELECT *,
    ROW_NUMBER() OVER(PARTITION BY col1 ORDER BY (SELECT NULL)) AS rownum
  FROM dbo.T1
)
DELETE
FROM C
WHERE rownum > 1;

You write a query that computes a row number that's partitioned by the logical key (in our case col1) and ordered by (SELECT NULL), meaning that you don't care about order. The query will assign unique row numbers to different rows that have the same col1 value. You could think of the row number as marking a duplicate number. You then define a CTE based on that query, and you issue a DELETE statement against the CTE to delete all rows with a row number that's greater than 1. After the delete, you'll be left with only one row of each distinct col1 value.

When you're done, run the following code for cleanup:

IF OBJECT_ID(N'dbo.T1', N'U') IS NOT NULL DROP TABLE dbo.T1;
Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish