Dumping the Duplicates

Congratulations to Norman A. Armas, senior programmer analyst at Miami-based Apollo Ship Chandlers, and John McShane, DBA at Prophet 21 in Yardley, Pennsylvania. Norman won first prize of $100 for the best solution to the January Reader Challenge, "Dumping the Duplicates." John won second prize of $50. Here's a recap of the problem and the solution to the January Reader Challenge.

Problem

Jim wants to eliminate duplicate rows from a table's data, so he sets up a test table by writing the following query:

SELECT IDENTITY( int ) AS IDCol,
       a1.au_id, a1.au_fname, a1.au_lname,
       a1.phone, a1.address, a1.city, a1.state, a1.zip
   INTO #a
   FROM pubs..authors AS a1
CROSS JOIN pubs..authors AS a2
CROSS JOIN pubs..authors AS a3

Then, he creates the required indexes and a constraint on the table, as the following query shows:

CREATE CLUSTERED INDEX IX_a_name 
           ON #a( au_lname, au_fname )
ALTER TABLE #a ADD CONSTRAINT PK_a_IDCol 
           PRIMARY KEY NONCLUSTERED( IDCol )

Next, Jim wants to eliminate duplicate rows from this table for every combination of au_lname and au_fname columns. He needs to keep only one row from each pair of duplicates. Help him write an efficient solution that works in all SQL Server releases.

Solution

Jim knows that he can use a GROUP BY query to determine the table's duplicate rows, so he can write the following query:

SELECT a.au_lname, a.au_fname
   FROM #a AS a
GROUP BY a.au_lname, a.au_fname
HAVING COUNT(*) > 1

Then, he can incorporate the above query into a DELETE statement. Jim can keep one of each pair's duplicate entries by taking only the row with the highest IDCol value within every pair. The following DELETE statement eliminates the table's duplicate rows:

DELETE #a
   WHERE IDCol NOT IN ( SELECT MAX( a.IDCol )
                   FROM #a AS a
                  GROUP BY a.au_lname, a.au_fname )

Alternatively, Jim can rewrite the DELETE query above by using an EXISTS subquery or a LEFT JOIN. The performance of each of this query's variations depends on the data, indexes, and resources available on the server. If the number of rows Jim needs to delete is fairly large, he can execute the DELETE statement in batches to reduce locking and logging. Similarly, if the majority of rows on a large table were duplicated, recreating the table by using a SELECT INTO statement would be efficient.

Comments

Plain text