SQL grouping similar values together

2019-09-04 10:29发布

问题:

I am having a problem to group data. I have used the group by clause all the while, but this time I want to group similar valued data together. The need is I have some jobs which have sequence numbers associated. If I can group the jobs together which have more or less the same sequence numbers then I can run a lesser number of jobs.

The data I have is like this.

    JobID         Sequence
    A01            8               
    A01            6
    A01            10 
    A02            5
    A02            10
    A02            4
    A02            2
    A03            8
    A03            3
    A03            6
    A03            10
    A04            5
    A04            4
    A04            2
    A04            9
    A04            10

From the above data it can be seen that A02 and A04 have more in common and A01 and A03 have more in common.

What query should be done to get similar data group together as below to group A02 and A04 together and A01 and A03 together, the order on the sequence column being immaterial.

    JobID         Sequence
    A01            8               
    A01            6
    A01            10 
    A03            8
    A03            3
    A03            6
    A03            10
    A02            5
    A02            10
    A02            4
    A02            2
    A04            5
    A04            4
    A04            2
    A04            9
    A04            10

Thanks for the time

david

PS - More explanation.

In the above list JobID A01 contains sequence (8,6,10) A02 contains sequence (5,10,4,2) A03 contains sequence (8,3,6,10) A04 contains sequence (5,4,2,9,10)

So Job A01 and Job A03 have similar sequence numbers and Job A02 and Job A03 have similar sequence numbers. I want to group them based on similar sequence numbers There are many other jobs which have sequences which might match some other job's sequence numbers. I just included 4 jobs to keep the list small.

回答1:

This is a little more complex of a problem then I feel like thinking all the way through right now, but I'll give you an idea to start with and maybe someone else can help you complete it...

Join the table to itself like so:

Select  A.JobID, A.Sequence, Count(*) 

from    TheTable A join
        TheTable B on A.JobID <> B.JobID and A.Sequence = B.Sequence

group by A.JobID

I haven't tested that so there could be typos, but you get the idea hopefully. Notice you're joining where the job is not the same, but the sequence is.



回答2:

Just inferring from other answers... something that may help.

Here you have for every two jobids how simmilar are:

http://sqlfiddle.com/#!3/c28be/9

Create table Data(Job nvarchar(10), seq int);

insert into data
SELECT 'A01' ,8  UNION ALL
SELECT 'A01',6 UNION ALL
SELECT 'A01',10 UNION ALL
SELECT 'A02',5 UNION ALL
SELECT 'A02',10 UNION ALL
SELECT 'A02',4 UNION ALL
SELECT 'A02',2 UNION ALL
SELECT 'A03',8 UNION ALL
SELECT 'A03',3 UNION ALL
SELECT 'A03',6 UNION ALL
SELECT 'A03',10 UNION ALL
SELECT 'A04',5 UNION ALL
SELECT 'A04',4 UNION ALL
SELECT 'A04',2 UNION ALL
SELECT 'A04',9 UNION ALL
SELECT 'A04',10;


select 
  d1.job as j1, 
  d2.job as j2, 
  count(*) cnt 
from Data d1 inner join Data d2 on (d1.seq = d2.seq and d1.job < d2.job)
group by d1.job, d2.job
;


回答3:

Building on Brandon Moores answer:

Data setup:

DECLARE @Data TABLE (JobId nvarchar(10), Sequence int)
INSERT INTO @Data(JobId, Sequence)
SELECT 'A01',8 UNION ALL
SELECT 'A01',6 UNION ALL
SELECT 'A01',10 UNION ALL
SELECT 'A02',5 UNION ALL
SELECT 'A02',10 UNION ALL
SELECT 'A02',4 UNION ALL
SELECT 'A02',2 UNION ALL
SELECT 'A03',8 UNION ALL
SELECT 'A03',3 UNION ALL
SELECT 'A03',6 UNION ALL
SELECT 'A03',10 UNION ALL
SELECT 'A04',5 UNION ALL
SELECT 'A04',4 UNION ALL
SELECT 'A04',2 UNION ALL
SELECT 'A04',9 UNION ALL
SELECT 'A04',10 UNION ALL
SELECT 'A05',100 

Find totals of all sequences each JobID has in common, order those by most to least, output all the data from each JobId depending on that order:

;WITH cte AS (
    SELECT  A.JobID, A.Sequence, Count(*) AS [SequencesInCommon]
    FROM    @Data A 
    LEFT OUTER JOIN
        @Data B on A.JobID <> B.JobID and A.Sequence = B.Sequence
    GROUP BY A.JobID, A.Sequence
),    
cte2 AS (
    SELECT JobID, SUM(SequencesInCommon) AS Total 
    FROM cte
    GROUP BY JobID
)    
SELECT  d.JobId, d.Sequence
FROM    cte2 c
INNER JOIN @Data d on c.jobID = d.JobID
ORDER BY c.Total ASC, c.JobID ASC

Gives:

JobId      Sequence
---------- -----------
A05        100
A01        8
A01        6
A01        10
A03        8
A03        3
A03        6
A03        10
A02        5
A02        10
A02        4
A02        2
A04        5
A04        4
A04        2
A04        9
A04        10

(17 row(s) affected)

Should do it :)