How to synthesize attribute for joined tables

2019-04-18 10:42发布

问题:

I have a view defined like this:

CREATE VIEW [dbo].[PossiblyMatchingContracts] AS
SELECT 
    C.UniqueID,
    CC.UniqueID AS PossiblyMatchingContracts
FROM  [dbo].AllContracts AS C
    INNER JOIN [dbo].AllContracts AS CC
        ON C.SecondaryMatchCodeFB = CC.SecondaryMatchCodeFB
            OR C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeLB
            OR C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeBB
            OR C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeBB
            OR C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeLB
WHERE C.UniqueID NOT IN
    (
        SELECT UniqueID FROM [dbo].DefinitiveMatches
    )
    AND C.AssociatedUser IS NULL
    AND C.UniqueID <> CC.UniqueID

Which is basically finding contracts where f.e. the first name and the birthday are matching. This works great. Now I want to add a synthetic attribute to each row with the value from only one source row.

Let me give you an example to make it clearer. Suppose I have the following table:

UniqueID  | FirstName | LastName  | Birthday

1         | Peter     | Smith     | 1980-11-04
2         | Peter     | Gray      | 1980-11-04
3         | Peter     | Gray-Smith| 1980-11-04
4         | Frank     | May       | 1985-06-09
5         | Frank-Paul| May       | 1985-06-09
6         | Gina      | Ericson   | 1950-11-04

The resulting view should look like this:

UniqueID | PossiblyMatchingContracts | SyntheticID

1        | 2                         | PeterSmith1980-11-04
1        | 3                         | PeterSmith1980-11-04
2        | 1                         | PeterSmith1980-11-04
2        | 3                         | PeterSmith1980-11-04
3        | 1                         | PeterSmith1980-11-04
3        | 2                         | PeterSmith1980-11-04
4        | 5                         | FrankMay1985-06-09
5        | 4                         | FrankMay1985-06-09
6        | NULL                      | NULL [or] GinaEricson1950-11-04

Notice that the SyntheticID column uses ONLY values from one of the matching source rows. It doesn't matter which one. I am exporting this view to another application and need to be able to identify each "match group" afterwards.

Is it clear what I mean? Any ideas how this could be done in sql?

Maybe it helps to elaborate a bit on the actual use case:

I am importing contracts from different systems. To account for the possibility of typos or people that have married but the last name was only updated in one system, I need to find so called 'possible matches'. Two or more contracts are considered a possible match if they contain the same birthday plus the same first, last or birth name. That implies, that if contract A matches contract B, contract B also matches contract A.

The target system uses multivalue reference attributes to store these relationships. The ultimate goal is to create user objects for these contracts. The catch first is, that the shall only be one user object for multiple matching contracts. Thus I'm creating these matches in the view. The second catch is, that the creation of user objects happens by workflows, which run parallel for each contract. To avoid creating multiple user objects for matching contracts, each workflow needs to check, if there is already a matching user object or another workflow, which is about to create said user object. Because the workflow engine is extremely slow compared to sql, the workflows should not repeat the whole matching test. So the idea is, to let the workflow check only for the 'syntheticID'.

回答1:

I have solved it with a multi step approach:

  1. Create the list of possible 1st level matches for each contract
  2. Create the base groups list, assigning a different group for for each contract (as if they were not related to anybody)
  3. Iterate the matches list updating the group list when more contracts need to be added to a group
  4. Recursively build up the SyntheticID from final group list
  5. Output results

First of all, let me explain what I have understood, so you can tell if my approach is correct or not.

1) matching propagates in "cascade"

I mean, if "Peter Smith" is grouped up with "Peter Gray", it means that all Smith and all Gray are related (if they have the same birth date) so Luke Smith can be in the same group of John Gray

2) I have not understood what you mean with "Birth Name"

You say contracts matches on "first, last or birth name", sorry, I'm italian, I thought birth name and first were the same, also in your data there is not such column. Maybe it is related to that dash symbol between names?
When FirstName is Frank-Paul it means it should match both Frank and Paul?
When LastName is Gray-Smith it means it should match both Gray and Smith?

In following code I have simply ignored this problem, but it could be handled if needed (I already did a try, breaking names, unpivoting them and treating as double match).

Step Zero: some declaration and prepare base data

declare @cli as table (UniqueID int primary key, FirstName varchar(20), LastName varchar(20), Birthday varchar(20))
declare @comb as table (id1 int, id2 int, done bit)
declare @grp as table (ix int identity primary key, grp int, id int, unique (grp,ix))
declare @str_id as table (grp int primary key, SyntheticID varchar(1000))
declare @id1 as int, @g int

;with
t as (
    select *
    from (values
    (1         , 'Peter'     , 'Smith'     , '1980-11-04'),
    (2         , 'Peter'     , 'Gray'      , '1980-11-04'),
    (3         , 'Peter'     , 'Gray-Smith', '1980-11-04'),
    (4         , 'Frank'     , 'May'       , '1985-06-09'),
    (5         , 'Frank-Paul', 'May'       , '1985-06-09'),
    (6         , 'Gina'      , 'Ericson'   , '1950-11-04')
    ) x (UniqueID  , FirstName , LastName  , Birthday)
)
insert into @cli
select * from t

Step One: Create the list of possible 1st level matches for each contract

;with
p as(select UniqueID, Birthday, FirstName, LastName from @cli),
m as (
    select p.UniqueID UniqueID1, p.FirstName FirstName1, p.LastName LastName1, p.Birthday Birthday1, pp.UniqueID UniqueID2, pp.FirstName FirstName2, pp.LastName LastName2, pp.Birthday Birthday2
    from p
    join p pp on (pp.Birthday=p.Birthday) and (pp.FirstName = p.FirstName or pp.LastName = p.LastName)
    where p.UniqueID<=pp.UniqueID
)
insert into @comb
select UniqueID1,UniqueID2,0
from m

Step Two: Create the base groups list

insert into @grp
select ROW_NUMBER() over(order by id1), id1 from @comb where id1=id2

Step Three: Iterate the matches list updating the group list Only loop on contracts that have possible matches and updates only if needed

set @id1 = 0
while not(@id1 is null) begin
    set @id1 = (select top 1 id1 from @comb where id1<>id2 and done=0)

    if not(@id1 is null) begin

        set @g = (select grp from @grp where id=@id1)
        update g set grp= @g
        from @grp g
        inner join @comb c on g.id = c.id2
        where c.id2<>@id1 and c.id1=@id1
        and grp<>@g

        update @comb set done=1 where id1=@id1
    end
end

Step Four: Build up the SyntheticID Recursively add ALL (distinct) first and last names of group to SyntheticID.
I used '_' as separator for birth date, first names and last names, and ',' as separator for the list of names to avoid conflicts.

;with
c as(
    select c.*, g.grp
    from @cli c
    join @grp g on g.id = c.UniqueID
),
d as (
    select *, row_number() over (partition by g order by t,s) n1, row_number() over (partition by g order by t desc,s desc) n2
    from (
        select distinct c.grp g, 1 t, FirstName s from c
        union 
        select distinct c.grp, 2, LastName from c 
        ) l
),
r as (
    select d.*, cast(CONVERT(VARCHAR(10), t.Birthday, 112) + '_' + s as varchar(1000)) Names, cast(0 as bigint) i1, cast(0 as bigint) i2
    from d
    join @cli t on t.UniqueID=d.g
    where n1=1
    union all
    select d.*, cast(r.names + IIF(r.t<>d.t,'_',',') +  d.s as varchar(1000)), r.n1, r.n2
    from d
    join r on r.g = d.g and r.n1=d.n1-1 
)
insert into @str_id 
select g, Names
from r
where n2=1

Step Five: Output results

select c.UniqueID, case when id2=UniqueID then id1 else id2 end PossibleMatchingContract, s.SyntheticID
from @cli c
left join @comb cb on c.UniqueID in(id1,id2) and id1<>id2
left join @grp g on c.UniqueID = g.id
left join @str_id s on s.grp = g.grp

Here is the results

UniqueID    PossibleMatchingContract    SyntheticID
1           2                           1980-11-04_Peter_Gray,Gray-Smith,Smith
1           3                           1980-11-04_Peter_Gray,Gray-Smith,Smith
2           1                           1980-11-04_Peter_Gray,Gray-Smith,Smith
2           3                           1980-11-04_Peter_Gray,Gray-Smith,Smith
3           1                           1980-11-04_Peter_Gray,Gray-Smith,Smith
3           2                           1980-11-04_Peter_Gray,Gray-Smith,Smith
4           5                           1985-06-09_Frank,Frank-Paul_May
5           4                           1985-06-09_Frank,Frank-Paul_May
6           NULL                        1950-11-04_Gina_Ericson

I think that in this way the resulting SyntheticID should also be "unique" for each group



回答2:

This creates a synthetic value and is easy to change to suit your needs.

DECLARE @T TABLE (
    UniqueID INT
    ,FirstName VARCHAR(200)
    ,LastName  VARCHAR(200)
    ,Birthday DATE
)

INSERT INTO @T(UniqueID,FirstName,LastName,Birthday) SELECT 1,'Peter','Smith','1980-11-04'
INSERT INTO @T(UniqueID,FirstName,LastName,Birthday) SELECT 2,'Peter','Gray','1980-11-04'
INSERT INTO @T(UniqueID,FirstName,LastName,Birthday) SELECT 3,'Peter','Gray-Smith','1980-11-04'
INSERT INTO @T(UniqueID,FirstName,LastName,Birthday) SELECT 4,'Frank','May','1985-06-09'
INSERT INTO @T(UniqueID,FirstName,LastName,Birthday) SELECT 5,'Frank-Paul','May','1985-06-09'
INSERT INTO @T(UniqueID,FirstName,LastName,Birthday) SELECT 6,'Gina','Ericson','1950-11-04'

DECLARE @PossibleMatches TABLE (UniqueID INT,[PossibleMatch] INT,SynKey VARCHAR(2000)
)

INSERT INTO @PossibleMatches
    SELECT t1.UniqueID [UniqueID],t2.UniqueID [Possible Matches],'Ln=' + t1.LastName + ' Fn=' +  + t1.FirstName + ' DoB=' + CONVERT(VARCHAR,t1.Birthday,102) [SynKey]
    FROM @T t1
    INNER JOIN @T t2 ON t1.Birthday=t2.Birthday
        AND t1.FirstName=t2.FirstName
        AND t1.LastName=t2.LastName
        AND t1.UniqueID<>t2.UniqueID

INSERT INTO @PossibleMatches
    SELECT t1.UniqueID [UniqueID],t2.UniqueID [Possible Matches],'Fn=' + t1.FirstName + ' DoB=' + CONVERT(VARCHAR,t1.Birthday,102) [SynKey]
    FROM @T t1
    INNER JOIN @T t2 ON t1.Birthday=t2.Birthday
        AND t1.FirstName=t2.FirstName
        AND t1.UniqueID<>t2.UniqueID

INSERT INTO @PossibleMatches
    SELECT t1.UniqueID,t2.UniqueID,'Ln=' + t1.LastName + ' DoB=' + CONVERT(VARCHAR,t1.Birthday,102) [SynKey]
    FROM @T t1
    INNER JOIN @T t2 ON t1.Birthday=t2.Birthday
        AND t1.LastName=t2.LastName
        AND t1.UniqueID<>t2.UniqueID

INSERT INTO @PossibleMatches
    SELECT t1.UniqueID,pm.UniqueID,'Ln=' + t1.LastName + ' Fn=' +  + t1.FirstName + ' DoB=' + CONVERT(VARCHAR,t1.Birthday,102) [SynKey]
    FROM @T t1
    LEFT JOIN @PossibleMatches pm on pm.UniqueID=t1.UniqueID
    WHERE pm.UniqueID IS NULL

SELECT *
FROM @PossibleMatches
ORDER BY UniqueID,[PossibleMatch]


回答3:

I think this will work for you

SELECT 
    C.UniqueID,
    CC.UniqueID AS PossiblyMatchingContracts,
    FIRST_VALUE(CC.FirstName+CC.LastName+CC.Birthday) 
          OVER (PARTITION BY C.UniqueID ORDER BY CC.UniqueID) as SyntheticID
FROM 
    [dbo].AllContracts AS C INNER JOIN
    [dbo].AllContracts AS CC ON
        C.SecondaryMatchCodeFB = CC.SecondaryMatchCodeFB OR
        C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeLB OR
        C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeBB OR
        C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeBB OR
        C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeLB
WHERE 
    C.UniqueID NOT IN(
    SELECT UniqueID FROM [dbo].DefinitiveMatches)
AND C.AssociatedUser IS NULL


回答4:

You can try this:

SELECT 
    C.UniqueID,
    CC.UniqueID AS PossiblyMatchingContracts,
    FIRST_VALUE(CC.FirstName+CC.LastName+CC.Birthday) 
          OVER (PARTITION BY C.UniqueID ORDER BY CC.UniqueID) as SyntheticID
FROM 
    [dbo].AllContracts AS C
INNER JOIN
    [dbo].AllContracts AS CC
ON
        C.SecondaryMatchCodeFB = CC.SecondaryMatchCodeFB
    OR
        C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeLB
    OR
        C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeBB
    OR
        C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeBB
    OR
        C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeLB
WHERE 
    C.UniqueID NOT IN
    (
        SELECT UniqueID FROM [dbo].DefinitiveMatches
    )
AND
    C.AssociatedUser IS NULL

This will generate one extra row (because we left out C.UniqueID <> CC.UniqueID) but will give you the good souluton.



回答5:

Following an example with some example data extracted from your original post. The idea: Generate all SyntheticID in a CTE, query all records with a "PossibleMatch" and Union it with all records which are not yet included:

DECLARE @t TABLE(
  UniqueID int
 ,FirstName nvarchar(20)
 ,LastName nvarchar(20)
 ,Birthday datetime
)

INSERT INTO @t VALUES (1, 'Peter', 'Smith', '1980-11-04');
INSERT INTO @t VALUES (2, 'Peter', 'Gray', '1980-11-04');
INSERT INTO @t VALUES (3, 'Peter', 'Gray-Smith', '1980-11-04');
INSERT INTO @t VALUES (4, 'Frank', 'May', '1985-06-09');
INSERT INTO @t VALUES (5, 'Frank-Paul', 'May', '1985-06-09');
INSERT INTO @t VALUES (6, 'Gina', 'Ericson', '1950-11-04');


WITH ctePrep AS(
SELECT UniqueID, FirstName, LastName, BirthDay,
       ROW_NUMBER() OVER (PARTITION BY FirstName, BirthDay ORDER BY FirstName, BirthDay) AS k,
       FirstName+LastName+CONVERT(nvarchar(10), Birthday, 126) AS SyntheticID
  FROM @t
),
cteKeys AS(
SELECT FirstName, BirthDay, SyntheticID
  FROM ctePrep
  WHERE k = 1
),
cteFiltered AS(
SELECT 
    C.UniqueID,
    CC.UniqueID AS PossiblyMatchingContracts,
    keys.SyntheticID
FROM @t AS C
JOIN @t AS CC ON C.FirstName = CC.FirstName
              AND C.Birthday = CC.Birthday
JOIN cteKeys AS keys ON keys.FirstName = c.FirstName
                  AND keys.Birthday = c.Birthday
WHERE C.UniqueID <> CC.UniqueID
)
SELECT UniqueID, PossiblyMatchingContracts, SyntheticID
  FROM cteFiltered
UNION ALL
SELECT UniqueID, NULL, FirstName+LastName+CONVERT(nvarchar(10), Birthday, 126) AS SyntheticID
  FROM @t
  WHERE UniqueID NOT IN (SELECT UniqueID FROM cteFiltered)

Hope this helps. The result looked OK to me:

UniqueID    PossiblyMatchingContracts   SyntheticID
---------------------------------------------------------------
2           1                           PeterSmith1980-11-04
3           1                           PeterSmith1980-11-04
1           2                           PeterSmith1980-11-04
3           2                           PeterSmith1980-11-04
1           3                           PeterSmith1980-11-04
2           3                           PeterSmith1980-11-04
4           NULL                        FrankMay1985-06-09
5           NULL                        Frank-PaulMay1985-06-09
6           NULL                        GinaEricson1950-11-04


回答6:

Tested in SSMS, it works perfect. :)

--create table structure
create table #temp
(
    uniqueID int,
    firstname varchar(15),
    lastname varchar(15),
    birthday date
)

--insert data into the table
insert #temp
select 1, 'peter','smith','1980-11-04'
union all
select 2, 'peter','gray','1980-11-04'
union all
select 3, 'peter','gray-smith','1980-11-04'
union all
select 4, 'frank','may','1985-06-09'
union all
select 5, 'frank-paul','may','1985-06-09'
union all
select 6, 'gina','ericson','1950-11-04'

select * from #temp

--solution is as below

select ab.uniqueID
, PossiblyMatchingContracts
, c.firstname+c.lastname+cast(c.birthday as varchar) as synID
from
(
    select a.uniqueID
            , case 
                when  a.uniqueID < min(b.uniqueID)over(partition by a.uniqueid)
                    then a.uniqueID
                else min(b.uniqueID)over(partition by a.uniqueid)
            end as SmallestID
            , b.uniqueID as PossiblyMatchingContracts
        from #temp a
        left join #temp b
        on (a.firstname = b.firstname OR a.lastname = b.lastname) AND a.birthday = b.birthday AND a.uniqueid <> b.uniqueID
) as ab
left join #temp c
on ab.SmallestID = c.uniqueID

Result capture is attached below:



回答7:

Say we have following table (a VIEW in your case):

UniqueID    PossiblyMatchingContracts   SyntheticID
1           2                           G1
1           3                           G2
2           1                           G3
2           3                           G4
3           1                           G4
3           4                           G6
4           5                           G7
5           4                           G8
6           NULL                        G9

In your case you can set initial SyntheticID as a string like PeterSmith1980-11-04 using UniqueID for each line. Here is a recursive CTE query it divides all lines to unconnected groups and select MAX(SyntheticId) in the current group as a new SyntheticID for all lines in this group.

WITH CTE AS 
(
  SELECT CAST(','+CAST(UniqueID AS Varchar(100)) +','+ CAST(PossiblyMatchingContracts as Varchar(100))+',' as Varchar(MAX)) as GroupCont,
        SyntheticID
  FROM PossiblyMatchingContracts
  UNION ALL
  SELECT CAST(GroupCont+CAST(UniqueID AS Varchar(100)) +','+ CAST(PossiblyMatchingContracts as Varchar(100))+',' AS Varchar(MAX)) as GroupCont,
         pm.SyntheticID 
  FROM CTE 
  JOIN PossiblyMatchingContracts as pm
  ON 
     (
       CTE.GroupCont LIKE '%,'+CAST(pm.UniqueID AS Varchar(100))+',%'
       OR 
       CTE.GroupCont LIKE '%,'+CAST(pm.PossiblyMatchingContracts AS Varchar(100))+',%'
     )
     AND NOT  
     (
       CTE.GroupCont LIKE '%,'+CAST(pm.UniqueID AS Varchar(100))+',%'
       AND 
       CTE.GroupCont LIKE '%,'+CAST(pm.PossiblyMatchingContracts AS Varchar(100))+',%'
     )
)
SELECT pm.UniqueID,
       pm.PossiblyMatchingContracts,  
       ISNULL(
        (SELECT MAX(SyntheticID) FROM CTE WHERE 
        (
           CTE.GroupCont LIKE '%,'+CAST(pm.UniqueID AS Varchar(100))+',%'
           OR 
           CTE.GroupCont LIKE '%,'+CAST(pm.PossiblyMatchingContracts AS Varchar(100))+',%'
        ))
        ,pm.SyntheticID) as SyntheticID
FROM PossiblyMatchingContracts pm