Once again, this is from one of the chapters from the book SQL Server 2008 Bible. I have used other sources like MSDN, Stackoverflow and Google in general in writing this post :)
Removing Duplicates
Lets create a table and populate it with some values
USE AdventureWorks;
GO
SELECT TOP 100 * INTO dba_PersonCopy FROM Person.Person
GO
--Lets see the data
SELECT * FROM dba_PersonCopy
GO
The Original table has Primary Key on Business_Entity_Id, but the table we generated using SELECT INTO will not inherit primary keys, indexes etc. Since we would like to deal with duplicates on a table that doesn't have a primary key this suits our purpose.So lets go ahead and add insert top 50 records a couple of times so that we have 200 rows, 100 of them duplicate and waiting to be removed.
I added 100 records 50 at a time by using GO [Count] syntax. Check it out, its cool.
INSERT INTO dba_PersonCopy
SELECT TOP 50 * FROM Person.Person
GO 2
Lets check if the table indeed has duplicates
SELECT BusinessEntityID,
Persontype,
NameStyle,
Title,
FirstName,
MiddleName,
LastName,
Suffix,
EmailPromotion,
AdditionalContactInfo,
Demographics,
rowguid,
ModifiedDate,
COUNT (*) AS DuplicateCount
FROM dba_PersonCopy
GROUP BY BusinessEntityID,
Persontype,
NameStyle,
Title,
FirstName,
MiddleName,
LastName,
Suffix,
EmailPromotion,
AdditionalContactInfo,
Demographics,
rowguid,
ModifiedDate
HAVING COUNT(*) >1;
At this point I faced an error.
Error: Msg 305, Level 16, State 1, Line 25
The XML data type cannot be compared or sorted, except when using the IS NULL operator.
It seems the
demographics column and
AdditionalContactInfo are the offending ones. While I need to study more on this problem, since its out of topic, I will just get rid of the columns
ALTER TABLE dba_PersonCopy DROP COLUMN Demographics;
ALTER TABLE dba_PersonCopy DROP COLUMN AdditionalContactInfo;
GO
Modified query to check duplicates
SELECT BusinessEntityID,
Persontype,
NameStyle,
Title,
FirstName,
MiddleName,
LastName,
Suffix,
EmailPromotion,
rowguid,
ModifiedDate,
COUNT (*) AS DuplicateCount
FROM dba_PersonCopy
GROUP BY BusinessEntityID,
Persontype,
NameStyle,
Title,
FirstName,
MiddleName,
LastName,
Suffix,
EmailPromotion,
rowguid,
ModifiedDate
HAVING COUNT(*) >1;
Take a look at the DuplicateCount column and you'll see duplicates. So how do we go about eliminating them?
1. Removing duplicates by Over() clause with a Row_number() function and a partition.
Details about ROW_NUMBER() Function
here
SELECT BusinessEntityID,
Persontype,
NameStyle,
Title,
FirstName,
MiddleName,
LastName,
Suffix,
EmailPromotion,
rowguid,
ModifiedDate,
ROW_NUMBER()
OVER
(
PARTITION BY
BusinessEntityID,
Persontype,
NameStyle,
Title,
FirstName,
MiddleName,
LastName,
Suffix,
EmailPromotion,
rowguid,
ModifiedDate
ORDER BY BusinessEntityID
) AS RowNumber
FROM dba_PersonCopy;
When you execute this, you'll see duplicates will have the RowNumber column greater than 1. We now need a CTE (Common table expression) to delete the duplicate rows by using a WHERE clause
BEGIN TRAN
;WITH DupesCTE
AS
(SELECT BusinessEntityID,
Persontype,
NameStyle,
Title,
FirstName,
MiddleName,
LastName,
Suffix,
EmailPromotion,
rowguid,
ModifiedDate,
ROW_NUMBER()
OVER
(
PARTITION BY
BusinessEntityID,
Persontype,
NameStyle,
Title,
FirstName,
MiddleName,
LastName,
Suffix,
EmailPromotion,
rowguid,
ModifiedDate
ORDER BY BusinessEntityID
) AS RowNumber
FROM dba_PersonCopy
)
DELETE DupesCTE WHERE RowNumber >1
--ROLLBACK
-- 100 rows affected, duplicates removed!
Btw, I struggled for about half an hour with syntax issues here. It said there is a syntax error at the keyword 'AS'. The problem is, I missed a ";" at the beginnig!
Thanks to the
answer on this thread or I'd be stuck forever.
2.
Next method is to delete duplicates by using a Surrogate Key. We'll create a Key column to uniquely identify each column since such a column doesn't exist on the table already
ALTER TABLE dba_PersonCopy
ADD TheKey INT IDENTITY NOT NULL
CONSTRAINT PK_PersonCopyDuplicates PRIMARY KEY;
SELECT * FROM dba_PersonCopy
Now we issue a DELETE command to get rid of the duplicates
BEGIN TRAN
DELETE dba_PersonCopy
WHERE EXISTS (
SELECT * FROM dba_PersonCopy AS T1
WHERE T1.BusinessEntityID = [dba_PersonCopy].BusinessEntityID
AND T1.PersonType =[dba_PersonCopy].Persontype
AND T1.NameStyle = [dba_PersonCopy].NameStyle
AND T1.Title = [dba_PersonCopy].NameStyle
AND T1.FirstName = [dba_PersonCopy].FirstName
AND T1.LastName = [dba_PersonCopy].LastName
AND T1.MiddleName = [dba_PersonCopy].MiddleName
AND T1.Suffix = [dba_PersonCopy].Suffix
AND T1.EmailPromotion = [dba_PersonCopy].EmailPromotion
AND T1.rowguid = [dba_PersonCopy].rowguid
AND T1.ModifiedDate = [dba_PersonCopy].ModifiedDate
AND T1.TheKey > [dba_PersonCopy].TheKey)
--ROLLBACK
I'll just Drop the primary key and the column to test next example
ALTER TABLE dba_PersonCopy DROP CONSTRAINT PK_PersonCopyDuplicates;
ALTER TABLE dba_PersonCopy DROP COLUMN TheKey;
3. Deleting using SELECT DISTANT
SELECT DISTINCT
BusinessEntityID,
Persontype,
NameStyle,
Title,
FirstName,
MiddleName,
LastName,
Suffix,
EmailPromotion,
rowguid,
ModifiedDate INTO dba_PersonCopy_Noduplicates
FROM dba_PersonCopy;
SELECT COUNT(*) FROM dba_PersonCopy_NoDuplicates
-- 100 rows remaining
I like the CTE option the best of all these methods