Pandas timeseries resampling and interpolating tog

2019-02-15 17:33发布

问题:

I have timestamped sensor data. Because of technical details, I get data from the sensors at approximately one minute intervals. The data may look like this:

   tstamp               val
0  2016-09-01 00:00:00  57
1  2016-09-01 00:01:00  57
2  2016-09-01 00:02:23  57
3  2016-09-01 00:03:04  57
4  2016-09-01 00:03:58  58
5  2016-09-01 00:05:00  60

Now, essentially, I would be extremely happy if I got all data at the exact minute, but I don't. The only way to conserve the distribution and have data at each minute is to interpolate. For example, between row indexes 1 and 2 there are 83 seconds, and the natural way to get a value at the exact minute is to interpolate between the two rows of data (in this case, it is 57, but that is not the case everywhere).

Right now, my approach is to do the following:

date = pd.to_datetime(df['measurement_tstamp'].iloc[0].date())
ts_d = df['measurement_tstamp'].dt.hour * 60 * 60 +\
       df['measurement_tstamp'].dt.minute * 60 +\
       df['measurement_tstamp'].dt.second
ts_r = np.arange(0, 24*60*60, 60)
data = scipy.interpolate.interp1d(x=ts_d, y=df['speed'].values)(ts_r)
req = pd.Series(data, index=pd.to_timedelta(ts_r, unit='s'))
req.index = date + req.index

But this feels rather drawn out and long to me. There are excellent pandas methods that do resampling, rounding, etc. I have been reading them all day, but it turns out that nothing does interpolation just the way I want it. resample works like a groupby and averages time points that fall together. fillna does interpolation, but not after resample has already altered the data by averaging.

Am I missing something, or is my approach the best there is?


For simplicity, assume that I group the data by day, and by sensor, so only a 24 hour period from one sensor is interpolated at a time.

回答1:

d = df.set_index('tstamp')
t = d.index
r = pd.date_range(t.min().date(), periods=24*60, freq='T')

d.reindex(t.union(r)).interpolate('index').ix[r]


Note, periods=24*60 works on daily data, not on the sample provided in the question. For that sample, periods=6 will work.