I am using Python Pandas for the first time. I have 5-min lag traffic data in csv format:
...
2015-01-04 08:29:05,271238
2015-01-04 08:34:05,329285
2015-01-04 08:39:05,-1
2015-01-04 08:44:05,260260
2015-01-04 08:49:05,263711
...
There are several issues:
- for some timestamps there's missing data (-1)
- missing entries (also 2/3 consecutive hours)
- the frequency of the observations is not exactly 5 minutes, but actually loses some seconds once in a while
I would like to obtain a regular time series, so with entries every (exactly) 5 minutes (and no missing valus). I have successfully interpolated the time series with the following code to approximate the -1 values with this code:
ts = pd.TimeSeries(values, index=timestamps)
ts.interpolate(method='cubic', downcast='infer')
How can I both interpolate and regularize the frequency of the observations? Thank you all for the help.
Change the
-1
s to NaNs:Then resample the data to have a 5 minute frequency.
Note that, by default, if two measurements fall within the same 5 minute period,
resample
averages the values together.Finally, you could linearly interpolate the time series according to the time:
Since it looks like your data already has roughly a 5-minute frequency, you might need to resample at a shorter frequency so cubic or spline interpolation can smooth out the curve: