Once again, this is from one of the chapters from the book SQL Server 2008 Bible. I have used other sources like MSDN, Stackoverflow and Google in general in writing this post :)
Removing Duplicates
Lets create a table and populate it with some values
The Original table has Primary Key on Business_Entity_Id, but the table we generated using SELECT INTO will not inherit primary keys, indexes etc. Since we would like to deal with duplicates on a table that doesn't have a primary key this suits our purpose.So lets go ahead and add insert top 50 records a couple of times so that we have 200 rows, 100 of them duplicate and waiting to be removed. I added 100 records 50 at a time by using GO [Count] syntax. Check it out, its cool.
Lets check if the table indeed has duplicates
At this point I faced an error.
It seems the demographics column and AdditionalContactInfo are the offending ones. While I need to study more on this problem, since its out of topic, I will just get rid of the columns
Modified query to check duplicates
1. Removing duplicates by Over() clause with a Row_number() function and a partition. Details about ROW_NUMBER() Function here
When you execute this, you'll see duplicates will have the RowNumber column greater than 1. We now need a CTE (Common table expression) to delete the duplicate rows by using a WHERE clause
2. Next method is to delete duplicates by using a Surrogate Key. We'll create a Key column to uniquely identify each column since such a column doesn't exist on the table already
Now we issue a DELETE command to get rid of the duplicates
3. Deleting using SELECT DISTANT
I like the CTE option the best of all these methods
Removing Duplicates
Lets create a table and populate it with some values
USE AdventureWorks; GO SELECT TOP 100 * INTO dba_PersonCopy FROM Person.Person GO --Lets see the data SELECT * FROM dba_PersonCopy GO
The Original table has Primary Key on Business_Entity_Id, but the table we generated using SELECT INTO will not inherit primary keys, indexes etc. Since we would like to deal with duplicates on a table that doesn't have a primary key this suits our purpose.So lets go ahead and add insert top 50 records a couple of times so that we have 200 rows, 100 of them duplicate and waiting to be removed. I added 100 records 50 at a time by using GO [Count] syntax. Check it out, its cool.
INSERT INTO dba_PersonCopy SELECT TOP 50 * FROM Person.Person GO 2
Lets check if the table indeed has duplicates
SELECT BusinessEntityID, Persontype, NameStyle, Title, FirstName, MiddleName, LastName, Suffix, EmailPromotion, AdditionalContactInfo, Demographics, rowguid, ModifiedDate, COUNT (*) AS DuplicateCount FROM dba_PersonCopy GROUP BY BusinessEntityID, Persontype, NameStyle, Title, FirstName, MiddleName, LastName, Suffix, EmailPromotion, AdditionalContactInfo, Demographics, rowguid, ModifiedDate HAVING COUNT(*) >1;
At this point I faced an error.
Error: Msg 305, Level 16, State 1, Line 25 The XML data type cannot be compared or sorted, except when using the IS NULL operator.
It seems the demographics column and AdditionalContactInfo are the offending ones. While I need to study more on this problem, since its out of topic, I will just get rid of the columns
ALTER TABLE dba_PersonCopy DROP COLUMN Demographics; ALTER TABLE dba_PersonCopy DROP COLUMN AdditionalContactInfo; GO
Modified query to check duplicates
SELECT BusinessEntityID, Persontype, NameStyle, Title, FirstName, MiddleName, LastName, Suffix, EmailPromotion, rowguid, ModifiedDate, COUNT (*) AS DuplicateCount FROM dba_PersonCopy GROUP BY BusinessEntityID, Persontype, NameStyle, Title, FirstName, MiddleName, LastName, Suffix, EmailPromotion, rowguid, ModifiedDate HAVING COUNT(*) >1;Take a look at the DuplicateCount column and you'll see duplicates. So how do we go about eliminating them?
1. Removing duplicates by Over() clause with a Row_number() function and a partition. Details about ROW_NUMBER() Function here
SELECT BusinessEntityID, Persontype, NameStyle, Title, FirstName, MiddleName, LastName, Suffix, EmailPromotion, rowguid, ModifiedDate, ROW_NUMBER() OVER ( PARTITION BY BusinessEntityID, Persontype, NameStyle, Title, FirstName, MiddleName, LastName, Suffix, EmailPromotion, rowguid, ModifiedDate ORDER BY BusinessEntityID ) AS RowNumber FROM dba_PersonCopy;
When you execute this, you'll see duplicates will have the RowNumber column greater than 1. We now need a CTE (Common table expression) to delete the duplicate rows by using a WHERE clause
BEGIN TRAN ;WITH DupesCTE AS (SELECT BusinessEntityID, Persontype, NameStyle, Title, FirstName, MiddleName, LastName, Suffix, EmailPromotion, rowguid, ModifiedDate, ROW_NUMBER() OVER ( PARTITION BY BusinessEntityID, Persontype, NameStyle, Title, FirstName, MiddleName, LastName, Suffix, EmailPromotion, rowguid, ModifiedDate ORDER BY BusinessEntityID ) AS RowNumber FROM dba_PersonCopy ) DELETE DupesCTE WHERE RowNumber >1 --ROLLBACK -- 100 rows affected, duplicates removed!Btw, I struggled for about half an hour with syntax issues here. It said there is a syntax error at the keyword 'AS'. The problem is, I missed a ";" at the beginnig! Thanks to the answer on this thread or I'd be stuck forever.
2. Next method is to delete duplicates by using a Surrogate Key. We'll create a Key column to uniquely identify each column since such a column doesn't exist on the table already
ALTER TABLE dba_PersonCopy ADD TheKey INT IDENTITY NOT NULL CONSTRAINT PK_PersonCopyDuplicates PRIMARY KEY; SELECT * FROM dba_PersonCopy
Now we issue a DELETE command to get rid of the duplicates
BEGIN TRAN DELETE dba_PersonCopy WHERE EXISTS ( SELECT * FROM dba_PersonCopy AS T1 WHERE T1.BusinessEntityID = [dba_PersonCopy].BusinessEntityID AND T1.PersonType =[dba_PersonCopy].Persontype AND T1.NameStyle = [dba_PersonCopy].NameStyle AND T1.Title = [dba_PersonCopy].NameStyle AND T1.FirstName = [dba_PersonCopy].FirstName AND T1.LastName = [dba_PersonCopy].LastName AND T1.MiddleName = [dba_PersonCopy].MiddleName AND T1.Suffix = [dba_PersonCopy].Suffix AND T1.EmailPromotion = [dba_PersonCopy].EmailPromotion AND T1.rowguid = [dba_PersonCopy].rowguid AND T1.ModifiedDate = [dba_PersonCopy].ModifiedDate AND T1.TheKey > [dba_PersonCopy].TheKey) --ROLLBACKI'll just Drop the primary key and the column to test next example
ALTER TABLE dba_PersonCopy DROP CONSTRAINT PK_PersonCopyDuplicates; ALTER TABLE dba_PersonCopy DROP COLUMN TheKey;
3. Deleting using SELECT DISTANT
SELECT DISTINCT BusinessEntityID, Persontype, NameStyle, Title, FirstName, MiddleName, LastName, Suffix, EmailPromotion, rowguid, ModifiedDate INTO dba_PersonCopy_Noduplicates FROM dba_PersonCopy; SELECT COUNT(*) FROM dba_PersonCopy_NoDuplicates -- 100 rows remaining
I like the CTE option the best of all these methods
Hiya! Let me first tell that you definitely managed to create a marvelous portal. Also I can't wait to ask you one thing that I am interested about. Do you participate in some blogging competitions?
ReplyDelete