I have timestamped sensor data. Because of technical details, I get data from the sensors at approximately one minute intervals. The data may look like this:
tstamp val
0 2016-09-01 00:00:00 57
1 2016-09-01 00:01:00 57
2 2016-09-01 00:02:23 57
3 2016-09-01 00:03:04 57
4 2016-09-01 00:03:58 58
5 2016-09-01 00:05:00 60
Now, essentially, I would be extremely happy if I got all data at the exact minute, but I don't. The only way to conserve the distribution and have data at each minute is to interpolate. For example, between row indexes 1 and 2 there are 83 seconds, and the natural way to get a value at the exact minute is to interpolate between the two rows of data (in this case, it is 57, but that is not the case everywhere).
Right now, my approach is to do the following:
date = pd.to_datetime(df['measurement_tstamp'].iloc[0].date())
ts_d = df['measurement_tstamp'].dt.hour * 60 * 60 +\
df['measurement_tstamp'].dt.minute * 60 +\
df['measurement_tstamp'].dt.second
ts_r = np.arange(0, 24*60*60, 60)
data = scipy.interpolate.interp1d(x=ts_d, y=df['speed'].values)(ts_r)
req = pd.Series(data, index=pd.to_timedelta(ts_r, unit='s'))
req.index = date + req.index
But this feels rather drawn out and long to me. There are excellent pandas methods that do resampling, rounding, etc. I have been reading them all day, but it turns out that nothing does interpolation just the way I want it. resample
works like a groupby
and averages time points that fall together. fillna
does interpolation, but not after resample
has already altered the data by averaging.
Am I missing something, or is my approach the best there is?
For simplicity, assume that I group the data by day, and by sensor, so only a 24 hour period from one sensor is interpolated at a time.