可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have a table that seems like this:

+-----+-----------+------------+
| id  |     value |       date |
+-----+-----------+------------+
| id1 |      1499 | 2012-05-10 |
| id1 |      1509 | 2012-05-11 |
| id1 |      1511 | 2012-05-12 |
| id1 |      1515 | 2012-05-13 |
| id1 |      1522 | 2012-05-14 |
| id1 |      1525 | 2012-05-15 |
| id2 |      2222 | 2012-05-10 |
| id2 |      2223 | 2012-05-11 |
| id2 |      2238 | 2012-05-13 |
| id2 |      2330 | 2012-05-14 |
| id2 |      2340 | 2012-05-15 |
| id3 |      1001 | 2012-05-10 |
| id3 |      1020 | 2012-05-11 |
| id3 |      1089 | 2012-05-12 |
| id3 |      1107 | 2012-05-13 |
| id3 |      1234 | 2012-05-14 |
| id3 |      1556 | 2012-05-15 |
| ... |       ... |        ... |
| ... |       ... |        ... |
| ... |       ... |        ... |
+-----+-----------+------------+

What I want to do is to produce the total sum of the value column for all the data in this table per date. There is one entry for each id per day. The problem is that some ids haven't a value for all days, e.g. id2 haven't a value for the date: 2012-05-11

What I want to do is: when for a given date there is no value for a specific id, then the value of the previous date (much closer to the given date) to be calculated in the sum.

For example, suppose we have only the data shown above. we can take the sum of all values for a specific date from this query:

SELECT SUM(value) FROM mytable WHERE date='2012-05-12';

the result will be: 1511 + 1089 = 2600

But what I want to have is to make a query that does this calculation: 1511 + 2223 + 1089 = 4823

so that the 2223 of id2 of date 2012-05-11 is added instead of the missed value:

| id2 |  2223 | 2012-05-11 |

Do you know how can I do this through an SQL query? or through a script? e.g. python..

I have thousands of ids per date, so I would like the query to be a little bit fast if it is possible.

回答1:

It's not pretty, as it has to join four copies of your table to itself, which could hit all sorts of performance pain (I strongly advise you to have indexes on id and date)... but this will do the trick:

SELECT   y.report_date, SUM(x.value)
FROM     mytable AS x
  NATURAL JOIN (
    SELECT   a.id, b.date AS report_date, MAX(c.date) AS date
    FROM     (SELECT DISTINCT id   FROM mytable) a JOIN
             (SELECT DISTINCT date FROM mytable) b JOIN
             mytable AS c ON (c.id = a.id AND c.date <= b.date)
    GROUP BY a.id, b.date
 ) AS y
GROUP BY y.report_date

See it on sqlfiddle.

回答2:

The SQL solution that I can think of for this is not very pretty (a sub-select inside a case statement on the value column with a right join to a dates sequence table... It's pretty ugly.) so I'll go with the python version:

import pyodbc
#connect to localhost
conn = pyodbc.connect('Driver={MySQL ODBC 5.1 Driver};Server=127.0.0.1;Port=3306;Database=information_schema;User=root; Password=root;Option=3;')
cursor = conn.cursor()

sums = {}  ## { id : { 'dates': [], 'values': [], 'sum': 0 } }      # sum is optional, you can always just sum() on the values list.

query = """SELECT
    id, value, date
FROM mytable
ORDER BY date ASC, id ASC;"""

## note that I use "fetchall()" here because in my experience the memory
## required to hold the result set is available. If this is not the case
## for you, see below for a row-by-row streaming

for row in cursor.execute(query).fetchall():
    id = sums.get(row['id'], {'dates' : [], 'values': [], 'sum': 0})
    if len(id['date']) > 0: # previous records exist for id
        # days diff is greater than 1
        days = row['date'] - id['dates'][-1]).days  
        ## days == 0, range(0) == [], in which case the loop steps won't be run
        for d in range(1, days):   
            id['dates'].append(id['dates'][-1] + datetime.timedelta(days = 1))  # add date at 1 day increments from last date point
            id['values'].append(id['values'][-1])  # add value of last date point again
            id['sum'] = id['sum'] + id['values'][-1]    # add to sum
        ## finally add the actual time point
        id['dates'].append(row['date'])
        id['values'].append(row['value'])
        id['sum'] = id['sum'] + row['value']

    else: # this is the first record for the id
        sums[row['id']] = {'dates': [row['date']], 'values': [row['value']], 'sum': row['value'] }

Alternative row-by-row streaming loop:

cursor.execute(query)
while 1:
    row = cursor.fetchone()
    if not row:
        break
    id = sums.get(row['id'], {'dates' : [], 'values': [], 'sum': 0})
    if len(id['date']) > 0: # previous records exist for id
        # days diff is greater than 1
        days = row['date'] - id['dates'][-1]).days  
        ## days == 0, range(0) == [], in which case the loop steps won't be run
        for d in range(1, days):   
            id['dates'].append(id['dates'][-1] + datetime.timedelta(days = 1))  # add date at 1 day increments from last date point
            id['values'].append(id['values'][-1])  # add value of last date point again
            id['sum'] = id['sum'] + id['values'][-1]    # add to sum
        ## finally add the actual time point
        id['dates'].append(row['date'])
        id['values'].append(row['value'])
        id['sum'] = id['sum'] + row['value']

    else: # this is the first record for the id
        sums[row['id']] = {'dates': [row['date']], 'values': [row['value']], 'sum': row['value'] }

Don't forget to close the connection when you're done!

conn.close()

回答3:

You might want to think about the semantics of your date column a bit more.

Perhaps you should add a column and make your date a range, instead.

Anything you do that does not involve data from the record is likely to be slow. A literal interpretation of your request would potentially require a date traversal for each value to sum.