I have a pandas DataFrame with time as index (1 min Freq) and several columns worth of data. Sometimes the data contains NaN. If so, I want to interpolate only if the gap is not longer than 5 Minutes. In this case this would be a maximum of 5 consecutive NaNs. The data may look like this (several test cases, which show the problems):
import numpy as np
import pandas as pd
from datetime import datetime
start = datetime(2014,2,21,14,50)
data = pd.DataFrame(index=[start + timedelta(minutes=1*x) for x in range(0, 8)],
data={'a': [123.5, np.NaN, 136.3, 164.3, 213.0, 164.3, 213.0, 221.1],
'b': [433.5, 523.2, 536.3, 464.3, 413.0, 164.3, 213.0, 221.1],
'c': [123.5, 132.3, 136.3, 164.3] + [np.NaN]*4,
'd': [np.NaN]*8,
'e': [np.NaN]*7 + [2330.3],
'f': [np.NaN]*4 + [2763.0, 2142.3, 2127.3, 2330.3],
'g': [2330.3] + [np.NaN]*7,
'h': [2330.3] + [np.NaN]*6 + [2777.7]})
It reads like this:
In [147]: data
Out[147]:
a b c d e f g h
2014-02-21 14:50:00 123.5 433.5 123.5 NaN NaN NaN 2330.3 2330.3
2014-02-21 14:51:00 NaN 523.2 132.3 NaN NaN NaN NaN NaN
2014-02-21 14:52:00 136.3 536.3 136.3 NaN NaN NaN NaN NaN
2014-02-21 14:53:00 164.3 464.3 164.3 NaN NaN NaN NaN NaN
2014-02-21 14:54:00 213.0 413.0 NaN NaN NaN 2763.0 NaN NaN
2014-02-21 14:55:00 164.3 164.3 NaN NaN NaN 2142.3 NaN NaN
2014-02-21 14:56:00 213.0 213.0 NaN NaN NaN 2127.3 NaN NaN
2014-02-21 14:57:00 221.1 221.1 NaN NaN 2330.3 2330.3 NaN 2777.7
I am aware of data.interpolate()
but it has several flaws, as it produces this result, which is good for the columns a-e, but for the columns f-h it fails for different reasons::
a b c d e f g \
2014-02-21 14:50:00 123.5 433.5 123.5 NaN NaN NaN 2330.3
2014-02-21 14:51:00 129.9 523.2 132.3 NaN NaN NaN 2330.3
2014-02-21 14:52:00 136.3 536.3 136.3 NaN NaN NaN 2330.3
2014-02-21 14:53:00 164.3 464.3 164.3 NaN NaN NaN 2330.3
2014-02-21 14:54:00 213.0 413.0 164.3 NaN NaN 2763.0 2330.3
2014-02-21 14:55:00 164.3 164.3 164.3 NaN NaN 2142.3 2330.3
2014-02-21 14:56:00 213.0 213.0 164.3 NaN NaN 2127.3 2330.3
2014-02-21 14:57:00 221.1 221.1 164.3 NaN 2330.3 2330.3 2330.3
h
2014-02-21 14:50:00 2330.300000
2014-02-21 14:51:00 2394.214286
2014-02-21 14:52:00 2458.128571
2014-02-21 14:53:00 2522.042857
2014-02-21 14:54:00 2585.957143
2014-02-21 14:55:00 2649.871429
2014-02-21 14:56:00 2713.785714
2014-02-21 14:57:00 2777.700000
f) The gap consists of 4 minutes worth of NaNs in the beginning, they should be replaced by that value 2763.0 (i.e. extrapolating backwards in time)
g) The gap is longer than 5 minutes but still it gets extrapolated
h) The gap is longer than 5 minutes but still the gap is interpolated.
I understand those reasons, of course I nowhere specified that it should not interpolate longer gaps than 5 minutes. I understand that interpolate
only extrapolates forward in time, but I want it to also extrapolate backward in time. Is there any known methods I can use for my problem, without reinventing the wheel?
Edit:
The method data.interpolate
accepts the input parameter limit
, which defines the maximum number of consecutive NaNs to be substituted by interpolation. But this still interpolates up to the limit, but I want to go on with all NaNs in that case.
So here is a mask that ought to solve the problem. Just interpolate
and then apply the mask to reset appropriate values to NaN. Honestly, this was a bit more work than I realized it would be because I had to loop through each column but then groupby didn't quite work without me providing some dummy columns like 'ones'.
Anyway, I can explain if anything is unclear but really only a couple of the lines are somewhat hard to understand. See here for a little bit more of an explanation of the trick on the df['new']
line or just print out individual lines to better see what is going on.
mask = data.copy()
for i in list('abcdefgh'):
df = pd.DataFrame( data[i] )
df['new'] = ((df.notnull() != df.shift().notnull()).cumsum())
df['ones'] = 1
mask[i] = (df.groupby('new')['ones'].transform('count') < 5) | data[i].notnull()
In [7]: data
Out[7]:
a b c d e f g h
2014-02-21 14:50:00 123.5 433.5 123.5 NaN NaN NaN 2330.3 2330.3
2014-02-21 14:51:00 NaN 523.2 132.3 NaN NaN NaN NaN NaN
2014-02-21 14:52:00 136.3 536.3 136.3 NaN NaN NaN NaN NaN
2014-02-21 14:53:00 164.3 464.3 164.3 NaN NaN NaN NaN NaN
2014-02-21 14:54:00 213.0 413.0 NaN NaN NaN 2763.0 NaN NaN
2014-02-21 14:55:00 164.3 164.3 NaN NaN NaN 2142.3 NaN NaN
2014-02-21 14:56:00 213.0 213.0 NaN NaN NaN 2127.3 NaN NaN
2014-02-21 14:57:00 221.1 221.1 NaN NaN 2330.3 2330.3 NaN 2777.7
In [8]: mask
Out[8]:
a b c d e f g h
2014-02-21 14:50:00 True True True False False True True True
2014-02-21 14:51:00 True True True False False True False False
2014-02-21 14:52:00 True True True False False True False False
2014-02-21 14:53:00 True True True False False True False False
2014-02-21 14:54:00 True True True False False True False False
2014-02-21 14:55:00 True True True False False True False False
2014-02-21 14:56:00 True True True False False True False False
2014-02-21 14:57:00 True True True False True True False True
It's easy from there if you don't do anything fancier with respect to extrapolation:
In [9]: data.interpolate().bfill()[mask]
Out[9]:
a b c d e f g h
2014-02-21 14:50:00 123.5 433.5 123.5 NaN NaN 2763.0 2330.3 2330.3
2014-02-21 14:51:00 129.9 523.2 132.3 NaN NaN 2763.0 NaN NaN
2014-02-21 14:52:00 136.3 536.3 136.3 NaN NaN 2763.0 NaN NaN
2014-02-21 14:53:00 164.3 464.3 164.3 NaN NaN 2763.0 NaN NaN
2014-02-21 14:54:00 213.0 413.0 164.3 NaN NaN 2763.0 NaN NaN
2014-02-21 14:55:00 164.3 164.3 164.3 NaN NaN 2142.3 NaN NaN
2014-02-21 14:56:00 213.0 213.0 164.3 NaN NaN 2127.3 NaN NaN
2014-02-21 14:57:00 221.1 221.1 164.3 NaN 2330.3 2330.3 NaN 2777.7
Edit to add: Here's a faster (about 2x on this sample data) and slightly simpler way, by moving some stuff outside of the loop:
mask = data.copy()
grp = ((mask.notnull() != mask.shift().notnull()).cumsum())
grp['ones'] = 1
for i in list('abcdefgh'):
mask[i] = (grp.groupby(i)['ones'].transform('count') < 5) | data[i].notnull()
I had to solve a similar problem and came up with a numpy
based solution before I found the answer above. Since my code is approx. ten times faster, I provide it here for it to be useful for somebody in the future. It handles NaNs at the end of the series differently than the solution of JohnE above. If a series ends with NaNs it flags this last gap as invalid.
Here is the code:
def bfill_nan(arr):
""" Backward-fill NaNs """
mask = np.isnan(arr)
idx = np.where(~mask, np.arange(mask.shape[0]), mask.shape[0]-1)
idx = np.minimum.accumulate(idx[::-1], axis=0)[::-1]
out = arr[idx]
return out
def calc_mask(arr, maxgap):
""" Mask NaN gaps longer than `maxgap` """
isnan = np.isnan(arr)
cumsum = np.cumsum(isnan).astype('float')
diff = np.zeros_like(arr)
diff[~isnan] = np.diff(cumsum[~isnan], prepend=0)
diff[isnan] = np.nan
diff = bfill_nan(diff)
return (diff < maxgap) | ~isnan
mask = data.copy()
for column_name in data:
x = data[column_name].values
mask[column_name] = calc_mask(x, 5)
print('data:')
print(data)
print('\nmask:')
print mask
Output:
data:
a b c d e f g h
2014-02-21 14:50:00 123.5 433.5 123.5 NaN NaN NaN 2330.3 2330.3
2014-02-21 14:51:00 NaN 523.2 132.3 NaN NaN NaN NaN NaN
2014-02-21 14:52:00 136.3 536.3 136.3 NaN NaN NaN NaN NaN
2014-02-21 14:53:00 164.3 464.3 164.3 NaN NaN NaN NaN NaN
2014-02-21 14:54:00 213.0 413.0 NaN NaN NaN 2763.0 NaN NaN
2014-02-21 14:55:00 164.3 164.3 NaN NaN NaN 2142.3 NaN NaN
2014-02-21 14:56:00 213.0 213.0 NaN NaN NaN 2127.3 NaN NaN
2014-02-21 14:57:00 221.1 221.1 NaN NaN 2330.3 2330.3 NaN 2777.7
mask:
a b c d e f g h
2014-02-21 14:50:00 True True True False False True True True
2014-02-21 14:51:00 True True True False False True False False
2014-02-21 14:52:00 True True True False False True False False
2014-02-21 14:53:00 True True True False False True False False
2014-02-21 14:54:00 True True False False False True False False
2014-02-21 14:55:00 True True False False False True False False
2014-02-21 14:56:00 True True False False False True False False
2014-02-21 14:57:00 True True False False True True False True