I have a table that seems like this:
+-----+-----------+------------+
| id | value | date |
+-----+-----------+------------+
| id1 | 1499 | 2012-05-10 |
| id1 | 1509 | 2012-05-11 |
| id1 | 1511 | 2012-05-12 |
| id1 | 1515 | 2012-05-13 |
| id1 | 1522 | 2012-05-14 |
| id1 | 1525 | 2012-05-15 |
| id2 | 2222 | 2012-05-10 |
| id2 | 2223 | 2012-05-11 |
| id2 | 2238 | 2012-05-13 |
| id2 | 2330 | 2012-05-14 |
| id2 | 2340 | 2012-05-15 |
| id3 | 1001 | 2012-05-10 |
| id3 | 1020 | 2012-05-11 |
| id3 | 1089 | 2012-05-12 |
| id3 | 1107 | 2012-05-13 |
| id3 | 1234 | 2012-05-14 |
| id3 | 1556 | 2012-05-15 |
| ... | ... | ... |
| ... | ... | ... |
| ... | ... | ... |
+-----+-----------+------------+
What I want to do is to produce the total sum of the value
column for all the data
in this table per date. There is one entry for each id
per day. The problem is that
some ids haven't a value for all days, e.g. id2 haven't a value for the date: 2012-05-11
What I want to do is: when for a given date there is no value for a specific id, then
the value of the previous date (much closer to the given date) to be calculated in the sum.
For example, suppose we have only the data shown above. we can take the sum of all values
for a specific date from this query:
SELECT SUM(value) FROM mytable WHERE date='2012-05-12';
the result will be: 1511 + 1089 = 2600
But what I want to have is to make a query that does this calculation:
1511 + 2223 + 1089 = 4823
so that the 2223 of id2
of date 2012-05-11 is added instead of the missed value:
| id2 | 2223 | 2012-05-11 |
Do you know how can I do this through an SQL query? or through a script? e.g. python..
I have thousands of ids per date, so I would like the query to be a little bit fast if it is possible.
It's not pretty, as it has to join four copies of your table to itself, which could hit all sorts of performance pain (I strongly advise you to have indexes on id
and date
)... but this will do the trick:
SELECT y.report_date, SUM(x.value)
FROM mytable AS x
NATURAL JOIN (
SELECT a.id, b.date AS report_date, MAX(c.date) AS date
FROM (SELECT DISTINCT id FROM mytable) a JOIN
(SELECT DISTINCT date FROM mytable) b JOIN
mytable AS c ON (c.id = a.id AND c.date <= b.date)
GROUP BY a.id, b.date
) AS y
GROUP BY y.report_date
See it on sqlfiddle.
The SQL solution that I can think of for this is not very pretty (a sub-select inside a case statement on the value column with a right join to a dates sequence table... It's pretty ugly.) so I'll go with the python version:
import pyodbc
#connect to localhost
conn = pyodbc.connect('Driver={MySQL ODBC 5.1 Driver};Server=127.0.0.1;Port=3306;Database=information_schema;User=root; Password=root;Option=3;')
cursor = conn.cursor()
sums = {} ## { id : { 'dates': [], 'values': [], 'sum': 0 } } # sum is optional, you can always just sum() on the values list.
query = """SELECT
id, value, date
FROM mytable
ORDER BY date ASC, id ASC;"""
## note that I use "fetchall()" here because in my experience the memory
## required to hold the result set is available. If this is not the case
## for you, see below for a row-by-row streaming
for row in cursor.execute(query).fetchall():
id = sums.get(row['id'], {'dates' : [], 'values': [], 'sum': 0})
if len(id['date']) > 0: # previous records exist for id
# days diff is greater than 1
days = row['date'] - id['dates'][-1]).days
## days == 0, range(0) == [], in which case the loop steps won't be run
for d in range(1, days):
id['dates'].append(id['dates'][-1] + datetime.timedelta(days = 1)) # add date at 1 day increments from last date point
id['values'].append(id['values'][-1]) # add value of last date point again
id['sum'] = id['sum'] + id['values'][-1] # add to sum
## finally add the actual time point
id['dates'].append(row['date'])
id['values'].append(row['value'])
id['sum'] = id['sum'] + row['value']
else: # this is the first record for the id
sums[row['id']] = {'dates': [row['date']], 'values': [row['value']], 'sum': row['value'] }
Alternative row-by-row streaming loop:
cursor.execute(query)
while 1:
row = cursor.fetchone()
if not row:
break
id = sums.get(row['id'], {'dates' : [], 'values': [], 'sum': 0})
if len(id['date']) > 0: # previous records exist for id
# days diff is greater than 1
days = row['date'] - id['dates'][-1]).days
## days == 0, range(0) == [], in which case the loop steps won't be run
for d in range(1, days):
id['dates'].append(id['dates'][-1] + datetime.timedelta(days = 1)) # add date at 1 day increments from last date point
id['values'].append(id['values'][-1]) # add value of last date point again
id['sum'] = id['sum'] + id['values'][-1] # add to sum
## finally add the actual time point
id['dates'].append(row['date'])
id['values'].append(row['value'])
id['sum'] = id['sum'] + row['value']
else: # this is the first record for the id
sums[row['id']] = {'dates': [row['date']], 'values': [row['value']], 'sum': row['value'] }
Don't forget to close the connection when you're done!
conn.close()
You might want to think about the semantics of your date
column a bit more.
Perhaps you should add a column and make your date
a range, instead.
Anything you do that does not involve data from the record is likely to be slow. A literal interpretation of your request would potentially require a date
traversal for each value to sum.