SQL: Select Most Recent Sequentially Distinct Valu

2019-08-17 14:12发布

问题:

I am having trouble writing a query that would select the last "new" sequentially distinct value (let's call this column Col A) grouped based on another column (Col B). Since this is a bit ambiguous/confusing, here is an example to explain (assume row number is indicative of sequence inside groups; in my issue the rows are ordered by date):

|--------|-------|-------|
| RowNum | Col A | Col B |
|--------|-------|-------|
| 1      | A     | A     |
| 2      | B     | A     |
| 3      | C     | A     |
| 4      | B     | B     |
| 5      | A     | B     |
| 6      | B     | B     |

Would select:

| 3      | C     | A     |
| 6      | B     | B     |

Note that although B also appears in row 4, the fact that row 5 contains A means that the B in row 6 is sequentially distinct. But if table looked like this:

|--------|-------|-------|
| RowNum | Col A | Col B |
|--------|-------|-------|
| 1      | A     | A     |
| 2      | B     | A     |
| 3      | C     | A     |
| 4      | B     | B     |
| 5      | A     | B     |
| 6      | A     | B     | <--

Then we would want to select:

| 3      | C     | A     |
| 5      | A     | B     |

I think that this would be an easier problem if I wasn't concerned with values being distinct but not sequential. I'm not really sure how to even consider sequence when making a query.

I have attempted to solve this by calculating the min/max row numbers where each value of Col A appears. That calculation (using the second sample table) would produce a result like this:

|--------|--------|--------|--------|
| ColA   | ColB   | MinRow | MaxRow |
|--------|--------|--------|--------|
| A      | A      | 1      | 1      |
| B      | A      | 2      | 2      |
| C      | A      | 3      | 3      | 
| A      | B      | 5      | 6      |
| B      | B      | 4      | 4      | 

A solution raised in a related post (SQL: Select Row with Last New Sequentially Distinct Value) went on a similar path, essentially taking the most recent RowNum which differs from the last ColA and then picks the next row. However, in that question I failed to address the need for the query to work for multiple groups, hence the new post.

Any help with this problem, if it is at all possible to do in SQL, would be greatly appreciated. I am running SQL 2008 SP4.

回答1:

Hmmm . . . One method is to get the last value. Then choose all the last rows with that value and aggregate:

select min(rownum), colA, colB
from (select t.*,
             first_value(colA) over (partition by colB order by rownum desc) as last_colA
      from t
     ) t
where rownum > all (select t2.rownum
                    from t t2
                    where t2.colB = t.colB and t2.colA <> t.last_colA
                   )
group by colA, colB;

Or, without the aggregation:

select t.*
from (select t.*,
             first_value(colA) over (partition by colB order by rownum desc) as last_colA,
             lag(colA) over (partition by colB order by rownum) as prev_clA
      from t
     ) t
where rownum > all (select t2.rownum
                    from t t2
                    where t2.colB = t.colB and t2.colA <> t.last_colA
                   ) and
      (prev_colA is null or prev_colA <> colA);

But in SQL Server 2008, let's treat this as a gaps-and-islands problem:

select t.*
from (select t.*,
             min(rownum) over (partition by colB, colA, (seqnum_b - seqnum_ab) ) as min_rownum_group,
             max(rownum) over (partition by colB, colA, (seqnum_b - seqnum_ab) ) as max_rownum_group
      from (select t.*,
                   row_number() over (partition by colB order by rownum) as seqnum_b,
                   row_number() over (partition by colB, colA order by rownum) as seqnum_ab,
                   max(rownum) over (partition by colB order by rownum) as max_rownum
            from t
           ) t
     ) t
where rownum = min_rownum_group and  -- first row in the group defined by adjacent colA, colB
      max_rownum_group = max_rownum  -- last group for each colB;

This identifies each of the groups using a difference of row numbers. It calculates the maximum rownum for the group and overall in the data. These are the same for the last group.