In Pandas, when I join
, the joined data is misaligned with respect to the original DataFrame:
import os
import pandas as pd
import statsmodels.formula.api as sm
import numpy as np
import matplotlib.pyplot as plt
flu_train = pd.read_csv('FluTrain.csv')
# From: https://courses.edx.org/c4x/MITx/15.071x/asset/FluTrain.csv
cols = ['Ystart', 'Mstart', 'Dstart', 'Yend', 'Mend', 'Dend']
flu_train = flu_train.join(pd.DataFrame(flu_train.Week.str.findall('\d+').tolist(), dtype=np.int64, columns=cols))
flu_trend_1 = sm.ols('log(ILI) ~ Queries', flu_train).fit()
flu_test = pd.read_csv('FluTest.csv')
# From: https://courses.edx.org/c4x/MITx/15.071x/asset/FluTest.csv
flu_test = flu_test.join(pd.DataFrame(flu_test.Week.str.findall('\d+').tolist(), dtype=np.int64, columns=cols))
flu_test = flu_test.join(pd.DataFrame(exp(flu_trend_1.predict(flu_test)), columns=['ILIPred1'] ))
flu_train['ILIShift2'] = flu_train.ILI.shift(2)
flu_trend_2 = sm.ols('log(ILI) ~ Queries + log(ILIShift2)', flu_train).fit()
flu_test['ILIShift2'] = flu_test.ILI.shift(2)
# Note that this does not work in a simplified example
# See -- http://stackoverflow.com/q/22457880/
flu_test[:2].ILIShift2 = list(flu_train[-2:].ILI)
# This SHIFTS the joined column "up" two rows, loosing the first two values of ILIPred2 and making the last 2 'NaN'
flu_test = flu_test.join(pd.DataFrame(exp(flu_trend_2.predict(flu_test)), columns=['ILIPred2']))
The final statement shifts the joined column "up" two rows, loosing the first two values of ILIPred2 and making the last 2 'NaN'. I expected the joined column to align with all of the others.
Why is this happening and how do I prevent it?
This joined dataframe (
pd.DataFrame(np.exp(flu_trend_2.predict(flu_test)), columns=['ILIPred2'])
) has an index from 0 to 49.You're joining it to
flu_test
, which has an index of 0 to 51.So where those indices don't match up (50, and 51), you get
NaN
as I would hope.If you want to force the joined column to sit at the bottom of your main dataframe, you can do the following (note the use of
iloc
and therow_shift
variable):Which gives me: