Sunday, September 2, 2012

Delete duplicate rows using CTE, Surrogate Key, and SELECT DISTINCT

Once again, this is from one of the chapters from the book SQL Server 2008 Bible. I have used other sources like MSDN, Stackoverflow and Google in general in writing this post :)

  Removing Duplicates

 Lets create a table and populate it with some values

USE AdventureWorks;
GO 

SELECT TOP 100 * INTO dba_PersonCopy FROM Person.Person
GO

--Lets see the data

SELECT * FROM dba_PersonCopy
GO

The Original table has Primary Key on Business_Entity_Id, but the table we generated using SELECT INTO will not inherit primary keys, indexes etc. Since we would like to deal with duplicates on a table that doesn't have a primary key this suits our purpose.So lets go ahead and add insert top 50 records a couple of times so that we have 200 rows, 100 of them duplicate and waiting to be removed. I added 100 records 50 at a time by using GO [Count] syntax. Check it out, its cool.

INSERT INTO dba_PersonCopy 
 SELECT TOP 50 * FROM Person.Person
GO 2

Lets check if the table indeed has duplicates

SELECT BusinessEntityID,
 Persontype,
 NameStyle,
 Title,
 FirstName,
 MiddleName,
 LastName,
 Suffix,
 EmailPromotion,
 AdditionalContactInfo,
 Demographics,
 rowguid,
 ModifiedDate,
 COUNT (*) AS DuplicateCount
FROM dba_PersonCopy
 GROUP BY BusinessEntityID,
  Persontype,
  NameStyle,
  Title,
  FirstName,
  MiddleName,
  LastName,
  Suffix,
  EmailPromotion,
  AdditionalContactInfo,
  Demographics,
  rowguid,
  ModifiedDate
 HAVING COUNT(*) >1;

At this point I faced an error.

Error: Msg 305, Level 16, State 1, Line 25
The XML data type cannot be compared or sorted, except when using the IS NULL operator.

It seems the demographics column and AdditionalContactInfo are the offending ones. While I need to study more on this problem, since its out of topic, I will just get rid of the columns

ALTER TABLE dba_PersonCopy DROP COLUMN Demographics;
ALTER TABLE dba_PersonCopy DROP COLUMN AdditionalContactInfo;
GO

Modified query to check duplicates

SELECT BusinessEntityID,
 Persontype,
 NameStyle,
 Title,
 FirstName,
 MiddleName,
 LastName,
 Suffix,
 EmailPromotion,
 rowguid,
 ModifiedDate,
 COUNT (*) AS DuplicateCount
FROM dba_PersonCopy
 GROUP BY BusinessEntityID,
  Persontype,
  NameStyle,
  Title,
  FirstName,
  MiddleName,
  LastName,
  Suffix,
  EmailPromotion,
  rowguid,
  ModifiedDate
 HAVING COUNT(*) >1;

Take a look at the DuplicateCount column and you'll see duplicates. So how do we go about eliminating them?

 1. Removing duplicates by Over() clause with a Row_number() function and a partition. Details about ROW_NUMBER() Function here

SELECT BusinessEntityID,
 Persontype,
 NameStyle,
 Title,
 FirstName,
 MiddleName,
 LastName,
 Suffix,
 EmailPromotion,
 rowguid,
 ModifiedDate,
 ROW_NUMBER() 
OVER 
(
 PARTITION BY
 BusinessEntityID,
 Persontype,
 NameStyle,
 Title,
 FirstName,
 MiddleName,
 LastName,
 Suffix,
 EmailPromotion,
 rowguid,
 ModifiedDate
 ORDER BY BusinessEntityID
) AS RowNumber
FROM dba_PersonCopy;

When you execute this, you'll see duplicates will have the RowNumber column greater than 1. We now need a CTE (Common table expression) to delete the duplicate rows by using a WHERE clause

 
BEGIN TRAN

;WITH DupesCTE 
AS
(SELECT BusinessEntityID,
 Persontype,
 NameStyle,
 Title,
 FirstName,
 MiddleName,
 LastName,
 Suffix,
 EmailPromotion,
 rowguid,
 ModifiedDate,
 ROW_NUMBER() 
OVER 
(
 PARTITION BY
 BusinessEntityID,
 Persontype,
 NameStyle,
 Title,
 FirstName,
 MiddleName,
 LastName,
 Suffix,
 EmailPromotion,
 rowguid,
 ModifiedDate
 ORDER BY BusinessEntityID
) AS RowNumber
FROM dba_PersonCopy
)
DELETE DupesCTE WHERE RowNumber >1
--ROLLBACK
-- 100 rows affected, duplicates removed!


Btw, I struggled for about half an hour with syntax issues here. It said there is a syntax error at the keyword 'AS'. The problem is, I missed a ";" at the beginnig! Thanks to the answer on this thread or I'd be stuck forever.

 2. Next method is to delete duplicates by using a Surrogate Key. We'll create a Key column to uniquely identify each column since such a column doesn't exist on the table already

ALTER TABLE dba_PersonCopy 
 ADD TheKey INT IDENTITY NOT NULL
 CONSTRAINT PK_PersonCopyDuplicates PRIMARY KEY;
SELECT * FROM dba_PersonCopy 

Now we issue a DELETE command to get rid of the duplicates

BEGIN TRAN
DELETE dba_PersonCopy
 WHERE EXISTS (
  SELECT * FROM dba_PersonCopy AS T1
   WHERE T1.BusinessEntityID = [dba_PersonCopy].BusinessEntityID
   AND T1.PersonType =[dba_PersonCopy].Persontype
   AND T1.NameStyle = [dba_PersonCopy].NameStyle
   AND T1.Title = [dba_PersonCopy].NameStyle
   AND T1.FirstName = [dba_PersonCopy].FirstName
   AND T1.LastName = [dba_PersonCopy].LastName
   AND T1.MiddleName = [dba_PersonCopy].MiddleName
   AND T1.Suffix = [dba_PersonCopy].Suffix
   AND T1.EmailPromotion = [dba_PersonCopy].EmailPromotion
   AND T1.rowguid = [dba_PersonCopy].rowguid
   AND T1.ModifiedDate = [dba_PersonCopy].ModifiedDate
   AND T1.TheKey > [dba_PersonCopy].TheKey)
--ROLLBACK

I'll just Drop the primary key and the column to test next example

ALTER TABLE dba_PersonCopy DROP CONSTRAINT PK_PersonCopyDuplicates;
ALTER TABLE dba_PersonCopy DROP COLUMN TheKey;

3. Deleting using SELECT DISTANT

SELECT DISTINCT 
 BusinessEntityID,
 Persontype,
 NameStyle,
 Title,
 FirstName,
 MiddleName,
 LastName,
 Suffix,
 EmailPromotion,
 rowguid,
 ModifiedDate INTO dba_PersonCopy_Noduplicates 
FROM dba_PersonCopy;

SELECT COUNT(*) FROM dba_PersonCopy_NoDuplicates
-- 100 rows remaining

I like the CTE option the best of all these methods