Finding each customer group's most recent acco

I have a table that contains customers information. Each customer is assigned a Customer ID (their SSN) that they retain as they open more accounts. Two customers may be on the same account, each with their own ID. The account numbers are not ordered by date.

I would like to find the most recent account of each customer or group of customers. If two customers have ever been on an account together, I want to return the most recent account either customer has been on.

Here is a sample table with some of the possible cases.

Example table ACCT:

acctnumber  date            Cust1ID     Cust2ID 
10000       '2016-02-01'    1110        NULL    --Case0-customer has only ever had
                                                --one account

10001       '2016-02-01'    1111        NULL    --Case1-one customer has multiple
10050       '2017-02-01'    1111        NULL    --accounts
400050      '2017-06-01'    1111        NULL
10089       '2017-12-08'    1111        NULL

10008       '2016-02-01'    1120        NULL    --Case2-customer has account(s) and later
10038       '2016-04-01'    1120        NULL
10058       '2017-02-03'    1120        1121    --gets account(s) with another customer

10002       '2016-02-01'    1112        NULL    --Case3-customer has account(s) and later
10052       '2017-02-02'    1113        1112    --becomes the second customer on another
10152       '2017-05-02'    1113        1112    --account(s)

10003       '2016-02-02'    1114        1115    --Case4-customer and second customer
7060        '2017-02-04'    1115        1114    --switch which is first and second

10004       '2016-02-02'    1116        1117    --Case5-second customer later gets
10067       '2017-02-05'    1117        NULL    --separate account(s)
10167       '2018-02-05'    1117        NULL

50013       '2016-01-01'    2008        NULL    --Case5b -customer has account(s) & later
50014       '2017-02-02'    2008        2009    --gets account(s) with second customer &
50015       '2017-04-04'    2008        NULL    --later still first customer gets
100015      '2018-05-05'    2008        NULL    --separate account(s)

30005       '2015-02-01'    1118        NULL    --Case6-customer has account(s) 
10005       '2016-02-01'    1118        NULL
10054       '2017-02-02'    1118        1119    --gets account(s) with another
40055       '2017-03-03'    1118        1119
10101       '2017-04-04'    1119        NULL    --who later gets separate account(s)
10201       '2017-05-05'    1119        NULL
30301       '2017-06-06'    1119        NULL
10322       '2018-01-01'    1119        NULL

10007       '2016-02-01'    1122        1123    --Case7-customers play musical chairs
10057       '2017-02-03'    1123        1124
10107       '2017-06-02'    1124        1125

50001       '2016-01-01'    2001        NULL    --Case8a-customers with account(s)
50002       '2017-02-02'    2001        2002    --together each later get separate
50003       '2017-03-03'    2001        NULL    --account(s)
50004       '2017-04-04'    2002        NULL

50005       '2016-01-01'    2003        NULL    --Case8b-customers with account(s)
50006       '2017-02-02'    2003        2004    --together each later get separate
50007       '2017-03-03'    2004        NULL    --account(s)
50008       '2017-04-04'    2003        NULL
50017       '2018-03-03'    2004        NULL
50018       '2018-04-04'    2003        NULL

50009       '2016-01-01'    2005        NULL    --Case9a-customer has account(s) & later
50010       '2017-02-02'    2005        2006    --gets account(s) with a second customer
50011       '2017-03-03'    2005        2007    --& later still gets account(s) with a
                                                --third customer

50109       '2016-01-01'    2015        NULL    --Case9b starts the same as Case9a, but
50110       '2017-02-02'    2015        2016    
50111       '2017-03-03'    2015        2017    
50112       '2017-04-04'    2015        NULL    --after all accounts with other customers
50122       '2017-05-05'    2015        NULL    --are complete, the original primary
                                                --customer begins opening individual
                                                --accounts again

Desired Results:

acctnumber  date            Cust1ID     Cust2ID 
10000       '2016-02-01'    1110        NULL    --Case0    
10089       '2017-12-08'    1111        NULL    --Case1
10058       '2017-02-03'    1120        1121    --Case2
10152       '2017-05-02'    1113        1112    --Case3
7060        '2017-02-04'    1115        1114    --Case4
10167       '2018-02-05'    1117        NULL    --Case5
100015      '2018-05-05'    2008        NULL    --Case5b
10322       '2018-01-01'    1119        NULL    --Case6
10107       '2017-06-02'    1124        1125    --Case7
50003       '2017-03-03'    2001        NULL    --Case8a result 1
50004       '2017-04-04'    2002        NULL    --Case8a result 2
50017       '2018-03-03'    2004        NULL    --Case8b result 1
50018       '2018-04-04'    2003        NULL    --Case8b result 2
50011       '2017-03-03'    2005        2007    --Case9a
50122       '2017-05-05'    2015        NULL    --Case9b

Alternatively, I would accept Case 7 outputting the two separate customer groups:

10007       '2016-02-01'    1122        1123    --Case7 result 1
10107       '2017-06-02'    1124        1125    --Case7 result 2

Because Cases 8a & 8b would represent the company acknowledging the customers are worthy of holding separate accounts, we would want to then consider their group as splitting, so it has separate sets of results.

Also, in most scenarios the customers have many accounts, and mix and matching the above cases overtime is common. For example, a single customer can have five accounts (Case 1), then later opens one or more accounts with another customer (Case 3) sometimes switching the primary account holder (Case 4) then afterwards the first customer begins opening individual accounts again (Case 5b).

I have attempted joining the table to a copy of itself whenever acctnumbers are unique and any of the Cust IDs match. However, this removes customers who have only had one account so I added a union of cust that have no matches on the custid or account number and groups by custid.

Unfortunately, the second piece does not only include custids from case 0 and there are some custids which are excluded all together that shouldn't be.

select
    max(date1) as date,
    cust1id1 as cust1id
from
(
select
    acctnumber as [acctnumber1],
    date as [date1],
    cust1id as [cust1id1],
    cust2id as [cust2id1]
from 
    acct
) t1
join
(
select
    acctnumber as [acctnumber2],
    date as [date2],
    cust1id as [cust1id2],
    cust2id as [cust2id2]
from 
    acct
) t2
on t1.date1 > t2.date2 and
(t1.cust1id1 = t2.cust1id2 or
t1.cust1id1 = t2.cust2id2 or
t1.cust2id1 = t2.cust2id2)
Group by
cust1id1
union
select
    max(date1) as date,
    cust1id1 as cust1id
from
(
select
    acctnumber as [acctnumber1],
    date as [date1],
    cust1id as [cust1id1],
    cust2id as [cust2id1]
from 
    acct
) t1
join
(
select
    acctnumber as [acctnumber2],
    date as [date2],
    cust1id as [cust1id2],
    cust2id as [cust2id2]
from 
    acct
) t2
on (t1.acctnumber1 != t2.acctnumber2 and
t1.cust1id1 != t2.cust1id2 and
t1.cust1id1 != t2.cust2id2 and
t1.cust2id1 != t2.cust2id2)
group by
cust1id1

Update

Thank you for all the great answers and comments so far. I have been trying out the queries and comparing results.

@VladimirBaranov has brought up a rare case that I had not previously considered in comments to other answers.

Similarly to case 7, it will be a bonus if Case8 is handled, but not expected.

Case 9 is important and the result for 9a and 9b should be handled.

Update 2

I noticed issues with my original set of 7 cases.

In more recent accounts, when a customer is no longer on the account, it was always the second borrower that remained. This was entirely unintentional, you can look at any of those examples and either customer can potentially be the remaining customer on the most recent account.

Also, each case had the minimum number of accounts to display exactly what the case was testing, but this is not common. Usually in each step of each case there can be 5, 10, 15 or more accounts before a customer switches to adding on a second customer, and those two can then have many accounts together.

Reviewing the answers I see many have index, create, update and other clauses specific to being able to edit the database. Unfortunately, I am on the consumer side of this database so I have read only access, and the program I can use to interact with the database automatically rejects them.

标签： sql sql-server tsql greatest-n-per-group

9条回答

甜甜的少女心

2楼-- · 2019-04-03 13:19

Could you just use a left-join to join accounts with other "linked" accounts with potentially later dates, and then just filter out records where the "Later Account" table is not null? Something like this:

select ThisAccount.* 
from Accounts ThisAccount
left join Accounts LaterAccount on
    LaterAccount.AcctNumber <> ThisAccount.AcctNumber
    and LaterAccount.dt > ThisAccount.dt
    and
    (   LaterAccount.Cust1ID = ThisAccount.Cust1ID
        or LaterAccount.Cust2ID = ThisAccount.Cust1ID
        or LaterAccount.Cust1ID = ThisAccount.Cust2ID
        or LaterAccount.Cust2ID = ThisAccount.Cust2ID
    )
where LaterAccount.AcctNumber is null
order by ThisAccount.AcctNumber

This should return the results as expected:

AcctNo  Dt          Cust1   Cust2
7060    2017-02-04  1115    1114
10000   2016-02-01  1110    NULL
10050   2017-02-01  1111    NULL
10052   2017-02-02  1113    1112
10058   2017-02-03  1120    1121
10067   2017-02-05  1117    NULL
10101   2017-06-02  1119    NULL
10107   2017-06-02  1124    1125
50003   2017-03-03  2001    NULL
50004   2017-04-04  2002    NULL
50007   2017-03-03  2004    NULL
50008   2017-04-04  2003    NULL

0人赞添加讨论(0) 举报

爷的心禁止访问

3楼-- · 2019-04-03 13:28

This is quite complex...

First you want to identify groups of customers. That is all customers who were directly or indirectly related. With customer pairs A/B, B/C, D/E, D/F, G/A, H/A, H/F you'd have just one single group for instance. In SQL this requires a recursive query.

SQL Server lacks a cycle detection in recursive queries. So from customers A/B you'd get to all pairs containing A or B, which is B/C, A/B G/A, H/A, and A/B itself for that matter. Even, if we detect this direct circle (same pair), we'd go on with B/C looking for all records that contain B or C. And one of these is A/B again and once more we are in a cycle. One way to deal with this is to build a string of yet visited customers and not visit them again.

Our result is all cutomers with all directly or indirectly connected other customers. Using aggregation, we can take the minimum partner per customer and use this as a group key. In above example all customers are related to A, so A is all their minimum partner, showing that all belong to the same group. If we add two records X/Y and Z/-, then we have two more groups: X and Y belonging to the X group, and Z belonging to the Z group.

These groups we use to look up our original records again. With ROW_NUMBER we number each group's last record with #1. Then we keep only those and we are done.

with all_cust(custid) as
(
  select cust1id from mytable
  union
  select cust2id from mytable where cust2id is not null
)
, cte(c1, c2, sofar) as
(
  select custid, custid, '<' + cast(custid as varchar(max)) + '>' from all_cust
  union all
  select cte.c1, case when cte.c2 = m.cust1id then m.cust2id else m.cust1id end,
     cte.sofar + '<' + cast(case when cte.c2 = m.cust1id then m.cust2id else m.cust1id end as varchar(max)) + '>'
  from mytable m
  join cte on cte.c2 in (m.cust1id, m.cust2id)
  and cte.sofar not like '%' + cast(case when cte.c2 = m.cust1id then m.cust2id else m.cust1id end as varchar(max)) + '%'
)
, groups(custid, grp) as
(
  select c1, min(c2) from cte group by c1
)
, ranked as
(
  select *, row_number() over (partition by g.grp order by date desc) as rn 
  from groups g
  join mytable m on g.custid in (m.cust1id, m.cust2id)
)
select acctnumber, date, cust1id, cust2id
from ranked
where rn = 1
order by cust1id;

Rextester demo: http://rextester.com/RWCQ83881

0人赞添加讨论(0) 举报

贪生不怕死

4楼-- · 2019-04-03 13:30

I'm leaving my original answer in place, because the approach might work for someone else searching for this down the line.

I can't figure out how to do this without a cursor. As such, any other answer that provides the right answer (that doesn't use a cursor) is going to outperform this one. I'm not smart enough to figure out what that looks like, but it would have to include a nasty recursive CTE.

The real trick is getting all accounts that were ever related to each other grouped together. That is done in the big cursored if/then/else chain at the top, which could be cleaned up a bit. I've left my debug print statements in place, they can obviously be removed.

You could also make the Associations table permanent, instead of using a table variable.

Again, performance-wise, this is going to be really, really bad, but it does work. I'm looking forward to seeing what others come up with. Thanks for the high-quality question, too, that made life a lot easier.

The code:

declare @Associations table (
  GroupID int,
  CustID int
);

declare @NextGroupID int = 0;
declare @FoundGroup1ID int;
declare @FoundGroup2ID int;
declare @Cust1 int;
declare @Cust2 int;

declare db_cursor cursor for
select Cust1ID, Cust2ID from @ACCT;

open db_cursor;
fetch next from db_cursor into @Cust1, @Cust2;

while @@fetch_status = 0
begin
  set @FoundGroup1ID = null;
  set @FoundGroup2ID = null;
  print '----------------------------'
  print 'Cust1 = ' + isnull(cast(@Cust1 as varchar(max)), 'NULL')
  print 'Cust2 = ' + isnull(cast(@Cust2 as varchar(max)), 'NULL')

  select @FoundGroup1ID = GroupID from @Associations where CustID = @Cust1
  print 'FoundGroup1ID = ' + isnull(cast(@FoundGroup1ID as varchar(max)), 'NULL')

  if @Cust2 is null
  begin
    if @FoundGroup1ID is null 
    begin
      set @NextGroupID = @NextGroupID +1
      print 'Adding Cust1 to new group ' + cast(@NextGroupID as varchar(max))
      insert into @Associations (GroupID, CustID) values (@NextGroupID, @Cust1)
    end
  end 
  else -- @Cust2 is not null
  begin

    print 'FoundGroup2ID = ' + isnull(cast(@FoundGroup2ID as varchar(max)), 'NULL')
    select @FoundGroup2ID = GroupID from @Associations where CustID = @Cust2

    if @FoundGroup1ID is null and @FoundGroup2ID is null
    begin
      set @NextGroupID = @NextGroupID +1
      print 'Adding both to new group ' + cast(@NextGroupID as varchar(max))
      insert into @Associations (GroupID, CustID) values (@NextGroupID, @Cust1)
      insert into @Associations (GroupID, CustID) values (@NextGroupID, @Cust2)
    end 
    else if @FoundGroup1ID is not null and @FoundGroup2ID is null
    begin
      print 'Adding Cust2 to existing group ' + cast(@FoundGroup1ID as varchar(max))
      insert into @Associations (GroupID, CustID) values (@FoundGroup1ID, @Cust2)
    end
    else if @FoundGroup1ID is null and @FoundGroup2ID is not null
    begin
      print 'Adding Cust1 to existing group ' + cast(@FoundGroup2ID as varchar(max))
      insert into @Associations (GroupID, CustID) values (@FoundGroup2ID, @Cust1)
    end
    else -- Neither is null
    begin
      print 'Switching all of GroupID ' + cast(@FoundGroup2ID as varchar(max)) + ' to GroupID ' + cast(@FoundGroup1ID as varchar(max))
      update @Associations set GroupID = @FoundGroup1ID where GroupID = @FoundGroup2ID
    end
  end
  fetch next from db_cursor into @Cust1, @Cust2;
end
close db_cursor;
deallocate db_cursor;

;with

AddedGroupID as (
select
  ACCT.acctnumber,
  ACCT.[date],
  ACCT.Cust1ID,
  ACCT.Cust2ID,
  Associations.GroupID,
  row_number() over (partition by Associations.GroupID order by ACCT.[date] desc) as RowID
from
  @ACCT ACCT
  inner join @Associations Associations on
    Associations.CustID in (ACCT.Cust1ID, ACCT.Cust2ID)
)

select 
  acctnumber, [date], Cust1ID, Cust2ID
from 
  AddedGroupID
where
  RowID = 1

The results:

acctnumber  date    Cust1ID Cust2ID
10000   2016-02-01  1110    NULL
10050   2017-02-01  1111    NULL
10058   2017-02-03  1120    1121
10052   2017-02-02  1113    1112
7060    2017-02-04  1115    1114
10067   2017-02-05  1117    NULL
10101   2017-06-02  1119    NULL
10107   2017-06-02  1124    1125

0人赞添加讨论(0) 举报

戒情不戒烟

5楼-- · 2019-04-03 13:32

To apply logic to each subset a good operator to use is the CROSS APPLY operator. This allows us to find the most recent account for each Customer Id.

Setup

DECLARE @Stage TABLE
(
    AcctNumber INT
    ,[Date] DATETIME
    ,Cust1Id INT
   ,Cust2Id INT
)

INSERT INTO @Stage (AcctNumber, [Date] ,Cust1Id ,Cust2Id)
VALUES
(10000,'2.1.16',1110,NULL)
,(10001,'2.1.16',1111,NULL)
,(10050,'2.1.17',1111,NULL)
,(10008,'2.1.16',1120,NULL)
,(10058,'2.3.17',1120,1121)
,(10002,'2.1.16',1112,NULL)
,(10052,'2.2.17',1113,1112)
,(10003,'2.2.16',1114,1115)
,(7060,'2.4.17',1115,1114)
,(10004,'2.2.16',1116,1117)
,(10067,'2.5.17',1117,NULL)
,(10005,'2.1.16',1118,NULL)
,(10054,'2.3.17',1118,1119)
,(10101,'6.2.17',1119,NULL)
,(10007,'2.1.16',1122,1123)
,(10057,'2.3.17',1123,1124)
,(10107,'6.2.17',1124,1125)

--Additional Cases to cover
,(50001, '2016-01-01', 2001, NULL)
,(50002, '2017-02-02', 2001, 2002)
,(50003, '2017-03-03', 2001, NULL)
,(50004, '2017-04-04', 2002, NULL)

,(50005, '2016-01-01', 2003, NULL)
,(50006, '2017-02-02', 2003, 2004)
,(50007, '2017-03-03', 2004, NULL)
,(50008, '2017-04-04', 2003, NULL)

Execution

Cross Apply

;WITH Results AS(
    SELECT DISTINCT S2.*
    FROM @Stage S1
    CROSS APPLY (
        SELECT TOP 1 S2.*
        FROM @Stage S2
        WHERE 
            (S1.Cust1Id = S2.Cust1Id
            OR S1.Cust1Id = S2.Cust2Id
            OR S1.Cust2Id = S2.Cust1Id
            OR S1.Cust2Id = S2.Cust2Id)
        ORDER BY S2.[Date] DESC
            ) S2
)
SELECT R1.*
FROM Results R1
    LEFT JOIN Results R2
        ON R1.Cust2Id = R2.Cust1Id
WHERE R1.[Date] > R2.[Date]
    OR R2.AcctNumber IS NULL

The CROSS APPLY operators walk the cases backward to apply the logic to each joint account case while ensuring the most recent account is carried over. This alone covers most of the cases. The only lingering cases are the ones with 3 accounts being shifted between 3 customers. The self join and WHERE clause in the final select cover these.

Results

+------------+------------+---------+---------+
| AcctNumber | Date       | Cust1Id | Cust2Id |
| 7060       | 2017-02-04 | 1115    | 1114    |
| 10000      | 2016-02-01 | 1110    | NULL    |
| 10050      | 2017-02-01 | 1111    | NULL    |
| 10052      | 2017-02-02 | 1113    | 1112    |
| 10058      | 2017-02-03 | 1120    | 1121    |
| 10067      | 2017-02-05 | 1117    | NULL    |
| 10101      | 2017-06-02 | 1119    | NULL    |
| 10107      | 2017-06-02 | 1124    | 1125    |
| 50003      | 2017-03-03 | 2001    | NULL    |
| 50004      | 2017-04-04 | 2002    | NULL    |
| 50007      | 2017-03-03 | 2004    | NULL    |
| 50008      | 2017-04-04 | 2003    | NULL    |
+------------+------------+---------+---------+

0人赞添加讨论(0) 举报

闹够了就滚

6楼-- · 2019-04-03 13:34

we should not worry about using EXISTS as it operate fast in such case and i suppose is simplest possible solution:

SELECT
    A.ACCTNUMBER, A.DT as "date", A.CUST1ID, A.CUST2ID
FROM
    ACCT A
WHERE
        NOT EXISTS
        (SELECT
            *
        FROM
            ACCT A2
        WHERE
            (A2.CUST1ID = A.CUST1ID
            OR A2.CUST2ID = A.CUST1ID
            OR (A.CUST2ID IS NOT NULL AND A2.CUST1ID = A.CUST2ID)
            OR (A.CUST2ID IS NOT NULL AND A2.CUST2ID = A.CUST2ID)
            )
            AND A2.DT>A.DT
        )

i have assumed that you have separate indexes on CUST1ID and another on CUST2ID. You can compare result without ascending index on DT ("date") field and with it. It can speed up your query or slow down - i do not know how your real data looks like

0人赞添加讨论(0) 举报

Viruses.

7楼-- · 2019-04-03 13:36

I'm sure there is a much easier approach, but this is what I've had in mind :

SELECT 
    a.acctnumber, 
    a.date, 
    a.Cust1ID, 
    a.Cust2ID 
FROM acct a
OUTER APPLY (
SELECT acctnumber
FROM (
SELECT *, 
    ROW_NUMBER() OVER(PARTITION BY acctnumber ORDER BY [date] DESC) AS ACC_RN,
    ROW_NUMBER() OVER(PARTITION BY CustomerID ORDER BY [date] DESC) AS RN
FROM (
SELECT 
     a1.acctnumber,
     a1.[date],
     a1.Cust1ID AS CustomerID
FROM acct a1
UNION 
SELECT 
     a2.acctnumber,
     a2.[date],
     a2.Cust2ID
FROM acct a2
) D
) C
WHERE 
    RN = 1
AND CustomerID IS NOT NULL
AND ACC_RN = 2
) acc
WHERE a.acctnumber IN(acc.acctnumber)

0人赞添加讨论(0) 举报

1 2 下一页