Editor's Note: Send your SQL Server questions and comments to SQL Server MVP Brian Moran at sav[email protected]
Deleting Duplicate Records
I have a table containing a set of columns that are supposed to be unique but that have duplicate values, which Table 1 shows. I'd like to end up with a table that looks like this:
How can I delete the duplicate records?
You can usually solve a problem in several ways with T-SQL, and the problem of deleting duplicate records is no exception. Listing 1 shows one solution that you might find helpful. This reasonably complex T-SQL query uses a correlated subquery. Correlated subqueries are a little bit like nested loops. Physical execution plans might be different from the following logical description of the query, but you can interpret this query in the following way:
- For each row of the dups table, run an existence check.
- Select the FirstName, LastName, and NameId columns from the dups table in the outer query.
- The existence check is true if the correlated subquery (which is the inner query) returns any rows--in other words, if the row that the outer query is processing is duplicated in the table.
- Delete the current row from dups if the existence check was true.
When the code checks the first row in Table 1, which has NameId 106, the EXISTS check returns false because the inner query doesn't return any rows. Yes, rows match based on FirstName and LastName, but none of the matching rows have a NameId less than 106. So, the code doesn't delete this row. Now, assume the current row of the outer query is the row with NameId 123. The EXISTS check returns true for this row because the inner query returns a row that matches FirstName and LastName and also has a NameId that is less than 123. Which row causes this match? The row with NameId 106 has the same FirstName and LastName as the row with NameId 123. And DupsInner.NameId 106 (which is the alias I use for the table in the inner query in Listing 1) is less than the NameId value of 123 that the code passed in from the outer query.
Correlated subqueries and existence checks can be confusing to the T-SQL novice. But they can be powerful, and learning to use them is worth the effort.