I'm looking for a schema-independent query. That is, if I have a users
table or a purchases
table, the query should be equally capable of catching duplicate rows in either table without any modification (other than the from
clause, of course).
I'm using T-SQL, but I'm guessing there should be a general solution.
I believe that this should work for you. Keep in mind that CHECKSUM() isn't 100% perfect - it's theoretically possible to get a false positive here (I think), but otherwise you can just change the table name and this should work:
;WITH cte AS (
SELECT
*,
CHECKSUM(*) AS chksum,
ROW_NUMBER() OVER(ORDER BY GETDATE()) AS row_num
FROM
My_Table
)
SELECT
*
FROM
CTE T1
INNER JOIN CTE T2 ON
T2.chksum = T1.chksum AND
T2.row_num <> T1.row_num
The ROW_NUMBER()
is needed so that you have some way of distinguishing rows. It requires an ORDER BY
and that can't be a constant, so GETDATE()
was my workaround for that.
Simply change the table name in the CTE and it should work without spelling out the columns.
I'm still confused about what "detecting them might be" but I'll give it a shot.
Excluding them is easy
e.g.
SELECT DISTINCT * FROM USERS
However if you wanted to only include them and a duplicate is all the fields than you have to do
SELECT
[Each and every field]
FROM
USERS
GROUP BY
[Each and every field]
HAVING COUNT(*) > 1
You can't get away with just using (*) because you can't GROUP BY *
so this requirement from your comments is difficult
a schema-independent means I don't want to specify all of the columns
in the query
Unless that is you want to use dynamic SQL and read the columns from sys.columns
or information_schema.columns
For example
DECLARE @colunns nvarchar(max)
SET @colunns = ''
SELECT @colunns = @colunns + '[' + COLUMN_NAME +'], '
FROM INFORMATION_SCHEMA.columns
WHERE table_name = 'USERS'
SET @colunns = left(@colunns,len(@colunns ) - 1)
DECLARE @SQL nvarchar(max)
SET @SQL = 'SELECT ' + @colunns
+ 'FROM USERS' + 'GROUP BY '
+ @colunns
+ ' Having Count(*) > 1'
exec sp_executesql @SQL
Please note you should read this The Curse and Blessings of Dynamic SQL if you haven't already
I have done this using CTEs in SQL Server.
Here is a sample on how to delete dupes but you should be able to adapt it easily to find dupes:
WITH CTE (COl1, Col2, DuplicateCount)
AS
(
SELECT COl1,Col2,
ROW_NUMBER() OVER(PARTITION BY COl1,Col2 ORDER BY Col1) AS DuplicateCount
FROM DuplicateRcordTable
)
DELETE
FROM CTE
WHERE DuplicateCount > 1
GO
Here is a link to an article where I got the SQL:
http://blog.sqlauthority.com/2009/06/23/sql-server-2005-2008-delete-duplicate-rows/
I recently was looking into the same issue and noticed this question.
I managed to solve it using a stored procedure with some dynamic SQL. This way you only need to specify the table name. And it will get all the other relevant data from sys tables.
/*
This SP returns all duplicate rows (1 line for each duplicate) for any given table.
to use the SP:
exec [database].[dbo].[sp_duplicates]
@table = '[database].[schema].[table]'
*/
create proc dbo.sp_duplicates @table nvarchar(50) as
declare @query nvarchar(max)
declare @groupby nvarchar(max)
set @groupby = stuff((select ',' + [name]
FROM sys.columns
WHERE object_id = OBJECT_ID(@table)
FOR xml path('')), 1, 1, '')
set @query = 'select *, count(*)
from '+@table+'
group by '+@groupby+'
having count(*) > 1'
exec (@query)