It's very easy to interpolate NaN cells in a Pandas DataFrame:
In [98]: df
Out[98]:
neg neu pos avg
250 0.508475 0.527027 0.641292 0.558931
500 NaN NaN NaN NaN
1000 0.650000 0.571429 0.653983 0.625137
2000 NaN NaN NaN NaN
3000 0.619718 0.663158 0.665468 0.649448
4000 NaN NaN NaN NaN
6000 NaN NaN NaN NaN
8000 NaN NaN NaN NaN
10000 NaN NaN NaN NaN
20000 NaN NaN NaN NaN
30000 NaN NaN NaN NaN
50000 NaN NaN NaN NaN
[12 rows x 4 columns]
In [99]: df.interpolate(method='nearest', axis=0)
Out[99]:
neg neu pos avg
250 0.508475 0.527027 0.641292 0.558931
500 0.508475 0.527027 0.641292 0.558931
1000 0.650000 0.571429 0.653983 0.625137
2000 0.650000 0.571429 0.653983 0.625137
3000 0.619718 0.663158 0.665468 0.649448
4000 NaN NaN NaN NaN
6000 NaN NaN NaN NaN
8000 NaN NaN NaN NaN
10000 NaN NaN NaN NaN
20000 NaN NaN NaN NaN
30000 NaN NaN NaN NaN
50000 NaN NaN NaN NaN
[12 rows x 4 columns]
I would also want it to extrapolate the NaN values that are outside of the interpolation scope, using the given method. How could I best do this?
yields
Note: I changed your
df
a little to show how interpolating withnearest
is different than doing adf.fillna
. (See the row with index 999.)I also added a row of NaNs with index 0 to show that
bfill()
may also be necessary.Extrapolating Pandas
DataFrame
sDataFrame
s maybe be extrapolated, however, there is not a simple method call within pandas and requires another library (e.g. scipy.optimize).Extrapolating
Extrapolating, in general, requires one to make certain assumptions about the data being extrapolated. One way is by curve fitting some general parameterized equation to the data to find parameter values that best describe the existing data, which is then used to calculate values that extend beyond the range of this data. The difficult and limiting issue with this approach is that some assumption about trend must be made when the parameterized equation is selected. This can be found thru trial and error with different equations to give the desired result or it can sometimes be inferred from the source of the data. The data provided in the question is really not large enough of a dataset to obtain a well fit curve; however, it is good enough for illustration.
The following is an example of extrapolating the
DataFrame
with a 3rd order polynomialThis generic function (
func()
) is curve fit onto each column to obtain unique column specific parameters (i.e. a, b, c, d). Then these parameterized equations are used to extrapolate the data in each column for all the indexes withNaN
s.Extrapolating Results
Plot for
avg
columnWithout a larger dataset or knowing the source of the data, this result maybe completely wrong, but should exemplify the process to extrapolate a
DataFrame
. The assumed equation infunc()
would probably need to be played with to get the correct extrapolation. Also, no attempt to make the code efficient was made.Update:
If your index is non-numeric, like a
DatetimeIndex
, see this answer for how to extrapolate them.