I have the following structure for the table DataTable
: every column is of the datatype int, RowID
is an identity column and the primary key. LinkID
is a foreign key and links to rows of an other table.
RowID LinkID Order Data DataSpecifier
1 120 1 1 1
2 120 2 1 3
3 120 3 1 10
4 120 4 1 13
5 120 5 1 10
6 120 6 1 13
7 371 1 6 2
8 371 2 3 5
9 371 3 8 1
10 371 4 10 1
11 371 5 7 2
12 371 6 3 3
13 371 7 7 2
14 371 8 17 4
.................................
.................................
I'm trying to do a query which alters every LinkID
batch in the following way:
- Take every row with same
LinkID
(e.g. the first batch is the first 6 rows here)
- Order them by the
Order
column
- Look at
Data
and DataSpecifier
columns as one compare unit (They can be thought as one column, called dataunit
):
- Keep as many rows from
Order
1 onwards, until a duplicate dataunit
comes by
- Delete every row from that first duplicate onwards for that
LinkID
So for the LinkID
120
:
- Sort the batch (already sorted here, but should still do it)
- Start looking from the top (So
Order=1
here), go as long as you don't see a duplicate.
- Stop at the first duplicate
Order = 5
(dataunit
1 10
was already seen).
- Delete everything which has the
LinkID=120 AND Order>=5
After similar process for LinkID
371
(and every other LinkID
in the table), the processed table will look like this:
RowID LinkID Order Data DataSpecifier
1 120 1 1 1
2 120 2 1 3
3 120 3 1 10
4 120 4 1 13
7 371 1 6 2
8 371 2 3 5
9 371 3 8 1
10 371 4 10 1
11 371 5 7 2
12 371 6 3 3
.................................
.................................
I've done quite a lot of SQL queries, but never something this complicated. I know I need to use a query which is something like this:
DELETE FROM DataTable
WHERE RowID IN (SELECT RowID
FROM DataTable
WHERE -- ?
GROUP BY LinkID
HAVING COUNT(*) > 1 -- ?
ORDER BY [Order]);
But I just can't seem to wrap my head around this and get the query right. I would preferably do this in pure SQL, with one executable (and reusable) query.
You can use the ROW_NUMBER()
window function to identify any rows that come after the original. After that you can delete and rows with a matching LinkID
and a greater than or equal to any encountered Order
with a row number greater than one.
(I originally used a second CTE to get the MIN order
, but I realized that it wasn't necessary as long as the join to order
was greater than equal to any order
where there was a second instance of the DataUnitId. By removing the MIN
the query plan became quite simple and efficient.)
WITH DataUnitInstances AS (
SELECT *
, ROW_NUMBER() OVER
(PARTITION BY LinkID, [Data], [DataSpecifier] ORDER BY [Order]) DataUnitInstanceId
FROM DataTable
)
DELETE FROM DataTable
FROM DataTable dt
INNER JOIN DataUnitInstances dup ON dup.LinkID = dt.LinkID
AND dup.[Order] <= dt.[Order]
AND dup.DataUnitInstanceId > 1
Here is the output from your sample data which matches your desired result:
+-------+--------+-------+------+---------------+
| RowID | LinkID | Order | Data | DataSpecifier |
+-------+--------+-------+------+---------------+
| 1 | 120 | 1 | 1 | 1 |
| 2 | 120 | 2 | 1 | 3 |
| 3 | 120 | 3 | 1 | 10 |
| 4 | 120 | 4 | 1 | 13 |
| 7 | 371 | 1 | 6 | 2 |
| 8 | 371 | 2 | 3 | 5 |
| 9 | 371 | 3 | 8 | 1 |
| 10 | 371 | 4 | 10 | 1 |
| 11 | 371 | 5 | 7 | 2 |
| 12 | 371 | 6 | 3 | 3 |
+-------+--------+-------+------+---------------+
We can try using a CTE here to make things easier:
WITH cte AS (
SELECT *,
COUNT(*) OVER (PARTITION BY LinkID, Data, DataSpecifier ORDER BY [Order]) - 1 cnt
FROM DataTable
),
cte2 AS (
SELECT *,
SUM(cnt) OVER (PARTITION BY LinkID ORDER BY [Order]) num
FROM cte
)
DELETE
FROM cte
WHERE num > 0;
The logic here is to use COUNT
as an analytic function to identify the duplicate records. We use a partition of LinkID
along with Data
and DataSpecifier
. Any record with an Order
value greater than or equal to the first record with a non zero count is then targeted for deletion.
Here is a demo showing that the logic of the CTE is correct:
Demo
This solution uses an APPLY
to find the minimum order by each Link.
Set up:
IF OBJECT_ID('tempdb..#YourData') IS NOT NULL
DROP TABLE #YourData
CREATE TABLE #YourData (
RowID INT,
LinkID INT,
[Order] INT,
Data INT,
DataSpecifier INT)
INSERT INTO #YourData (
RowID,
LinkID,
[Order],
Data,
DataSpecifier)
VALUES
('1', ' 120', '1', '1', ' 1'),
('2', ' 120', '2', '1', ' 3'),
('3', ' 120', '3', '1', ' 10'),
('4', ' 120', '4', '1', ' 13'),
('5', ' 120', '5', '1', ' 10'),
('6', ' 120', '6', '1', ' 13'),
('7', ' 371', '1', '6', ' 2'),
('8', ' 371', '2', '3', ' 5'),
('9', ' 371', '3', '8', ' 1'),
('10', '371', '4', '10', '1'),
('11', '371', '5', '7', ' 2'),
('12', '371', '6', '3', ' 3'),
('13', '371', '7', '7', ' 2'),
('14', '371', '8', '17', '4')
Solution:
;WITH MinOrderToDeleteByLinkID AS
(
SELECT
T.LinkID,
MinOrder = MIN(C.[Order])
FROM
#YourData AS T
OUTER APPLY (
SELECT TOP 1
C.*
FROM
#YourData AS C
WHERE
C.LinkID = T.LinkID AND
C.Data = T.Data AND
C.DataSpecifier = T.DataSpecifier AND
C.[Order] > T.[Order]
ORDER BY
T.[Order]) AS C
GROUP BY
T.LinkID
)
DELETE Y FROM
-- SELECT Y.* FROM
#YourData AS Y
INNER JOIN MinOrderToDeleteByLinkID AS M ON
Y.LinkID = M.LinkID AND
Y.[Order] >= M.MinOrder
The rows to be deleted from this are the following:
RowID LinkID Order Data DataSpecifier
5 120 5 1 10
6 120 6 1 13
13 371 7 7 2
14 371 8 17 4
... which correspond to the point where the tuple Data
-DataSpecified
start to repeat for a particular LinkID
.