In this example we have two days of data sampled at a resolution of 1min, giving us 2880 measurements. The measurements are collected across multiple timezones sequentially: the first 240 minutes in Europe/London and the remaining 2640 measurements in 'America/Los_Angeles'.
import pandas as pd
import numpy as np
df=pd.DataFrame(index=pd.DatetimeIndex(pd.date_range('2015-03-29 00:00','2015-03-30 23:59',freq='1min',tz='UTC')))
df.loc['2015-03-29 00:00':'2015-03-29 04:00','timezone']='Europe/London'
df.loc['2015-03-29 04:00':'2015-03-30 23:59','timezone']='America/Los_Angeles'
df['sales1']=np.random.random_integers(100,size=len(df))
df['sales2']=np.random.random_integers(10,size=len(df))
To calculate mean sales per minute (as per UTC time) in a 24-hour cycle across multiple days the following approach works nicely:
utc_sales=df.groupby([df.index.hour,df.index.minute]).mean()
utc_sales.set_index(pd.date_range("00:00","23:59", freq="1min").time,inplace=True)
This groupby approach can also be applied to calculate mean sales based on one of the other two timezones, say 'Europe/London'.
df['London']=df.index.tz_convert('Europe/London')
london_sales=df.groupby([df['London'].dt.hour,df['London'].dt.minute]).mean()
london_sales.set_index(pd.date_range("00:00","23:59", freq="1min").time,inplace=True)
However I'm struggling to come up with an efficient way of calculating mean sales per minute -as per localtime- in a 24-hour cycle. I attempted the same approach from above, however when multiple timezones are present in the same series, groupby reverts back to the index which is in utc.
def calculate_localtime(x):
return pd.to_datetime(x.name,unit='s').tz_convert(x['timezone'])
df['localtime']=df.apply(calculate_localtime,axis=1)
local_sales=df.groupby([df['localtime'].dt.hour,df['localtime'].dt.minute]).mean()
local_sales.set_index(pd.date_range("00:00","23:59",freq="1min").time,inplace=True)
We can verify that local_sales are identical to utc_sales, and therefore this approach does not work.
In [8]: np.unique(local_sales == utc_sales)
Out[8]: array([ True], dtype=bool)
Could anyone recommend an approach suitable for large datatsets and multiple timezones?
Here is an approach to get what I think you want. This requires pandas 0.17.0
Create the data as you have aboe
Pivot according to the timezone; this creates a multi-index with the timezone separated
Create the groupers that we want to use, namely hours and minutes in the local zone. We are going to populate them according to the mask, IOW. where both sales1/sales2 are notnull, we will use the hours/minutes for that (local) zone
After the above. (Note this could be a bit simplified, meaning that we don't need to actually record the local timezone, just compute hours/minutes).
This uses the new representation for timezones (in 0.17.0).
Results