SQL: Count of rows since certain value first occur

2019-08-04 02:25发布

问题:

This is a similar scenario to SQL: Count of rows since certain value first occurred

In SQL Server, I'm trying to calculate the count of days since the same weather as today (let's assume today is 6th August 2018) was observed first in the past 5 days. Per town.

Here's the data:

+---------+---------+--------+--------+--------+
| Date    | Toronto | Cairo  | Zagreb | Ankara |
+---------+---------+--------+--------+--------+
| 1.08.18 | Rain    | Sun    | Clouds | Sun    |
| 2.08.18 | Sun     | Sun    | Clouds | Sun    |
| 3.08.18 | Rain    | Sun    | Clouds | Rain   |
| 4.08.18 | Clouds  | Sun    | Clouds | Clouds |
| 5.08.18 | Rain    | Clouds | Rain   | Rain   |
| 6.08.18 | Rain    | Sun    | Sun    | Sun    |
+---------+---------+--------+--------+--------+

This needs to perform well but all I came up with so far is single queries for each town (and there are going to be dozens of towns, not just the four). This works but is not going to scale.

Here's the one for Toronto...

SELECT 
    DATEDIFF(DAY, MIN([Date]), GETDATE()) + 1 
FROM
    (SELECT TOP 5 * 
     FROM Weather 
     WHERE [Date] <= GETDATE()
     ORDER BY [Date] DESC) a
WHERE 
    Toronto = (SELECT TOP 1 Toronto 
               FROM Weather
               WHERE DataDate = GETDATE())

...which correctly returns 4 since today there is rain and the first occurrence of rain within the past 5 days was 3rd August.

But what I want returned is a table like this:

+---------+-------+--------+--------+
| Toronto | Cairo | Zagreb | Ankara |
+---------+-------+--------+--------+
| 4       | 5     | 1      | 5      |
+---------+-------+--------+--------+

Slightly modified from the accepted answer by @Used_By_Already is this code:

CREATE TABLE mytable(
   Date    date  NOT NULL
  ,Toronto VARCHAR(9) NOT NULL
  ,Cairo   VARCHAR(9) NOT NULL
  ,Zagreb  VARCHAR(9) NOT NULL
  ,Ankara  VARCHAR(9) NOT NULL
);

INSERT INTO mytable(Date,Toronto,Cairo,Zagreb,Ankara) VALUES ('20180801','Rain','Sun','Clouds','Sun');
INSERT INTO mytable(Date,Toronto,Cairo,Zagreb,Ankara) VALUES ('20180802','Sun','Sun','Clouds','Sun');
INSERT INTO mytable(Date,Toronto,Cairo,Zagreb,Ankara) VALUES ('20180803','Rain','Sun','Clouds','Rain');
INSERT INTO mytable(Date,Toronto,Cairo,Zagreb,Ankara) VALUES ('20180804','Clouds','Sun','Clouds','Clouds');
INSERT INTO mytable(Date,Toronto,Cairo,Zagreb,Ankara) VALUES ('20180805','Rain','Clouds','Rain','Rain');
INSERT INTO mytable(Date,Toronto,Cairo,Zagreb,Ankara) VALUES ('20180806','Rain','Sun','Sun','Sun');

with cte as (
        select
              date, city, weather
        FROM (
              SELECT * from mytable
             ) AS cp
        UNPIVOT (
                  Weather FOR City IN (Toronto, Cairo, Zagreb, Ankara)
            ) AS up
    )

select 
        date, city, weather, datediff(day,ca.prior,cte.date)+1 as daysPresent
from cte
cross apply (
    select min(prev.date) as prior
    from cte as prev 
    where prev.city = cte.city
    and prev.date between dateadd(day,-4,cte.date) and dateadd(day,0,cte.date)
    and prev.weather = cte.weather
    ) ca

order by city,date

Output:

However, what I'm trying now is to keep counting "daysPresent" up even after those five past days in question. Meaning that the last marked row in the output sample should show 6. The logic being to increase the previous number by the count of days between them if there is less than 5 days of a gap between them. If there has not been the same weather in the past 5 days, go back to 1.

I experimented with LEAD and LAG but cannot get it to work. Is it even the right way to add another layer to it or would the query need to look different entirely?

I'm a but puzzled.

回答1:

You have a major problem with your data structure. The values should be in rows, not columns. So, start with:

select d.dte, v.*from data d cross apply
     (values ('Toronto', Toronto), ('Cairo', Cairo), . . .
     ) v(city, val)
where d.date >= dateadd(day, -5, getdate());

From there, we can use the window function first_value() (or last_value()) to get the most recent reading. The rest is just aggregation by city:

with d as (
      select d.dte, v.*,
             first_value(v.val) over (partition by v.city order by d.dte desc) as last_val
      from data d cross apply
           (values ('Toronto', Toronto), ('Cairo', Cairo), . . .
           ) v(city, val)
      where d.date >= dateadd(day, -5, getdate())
     )
select city, datediff(day, min(dte), getdate()) + 1
from d
where val = last_val
group by city;

This gives you the information you want, in rows rather than columns. You can re-pivot if you really want. But I advise you to keep the data with city data in different rows.