I have data on spells (hospital stays), each with a start and end date, but I want to count the number of days spent in hospital for calendar months. Of course, this number can be zero for months not appearing in a spell. But I cannot just attribute the length of each spell to the starting month, as longer spells run over to the following month (or more).
Basically, it would suffice for me if I could cut spells at turn-of-month datetimes, getting from the data in the first example to the data in the second:
id start end
1 2011-01-01 10:00:00 2011-01-08 16:03:00
2 2011-01-28 03:45:00 2011-02-04 15:22:00
3 2011-03-02 11:04:00 2011-03-05 05:24:00
id start end month stay
1 2011-01-01 10:00:00 2011-01-08 16:03:00 2011-01 7
2 2011-01-28 03:45:00 2011-01-31 23:59:59 2011-01 4
2 2011-02-01 00:00:00 2011-02-04 15:22:00 2011-02 4
3 2011-03-02 11:04:00 2011-03-05 05:24:00 2011-03 3
I read up on the Time Series / Date functionality of pandas, but I do not see a straightforward solution to this. How can one accomplish the slicing?
It's simpler than you think: just subtract the dates. The result is a time span. See Add column with number of days between dates in DataFrame pandas
You even get to do this for the entire frame at once: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.subtract.html
Update, now that I understand the problem better. Add a new column: take the spell's end date; if the start date is in a different month, then set this new date's day to 01 and the time to 00:00.
This is the cut DateTime you can use to compute the portion of the stay attributable to each month. cut - start is the first month; end - cut is the second.