In Pandas, when I join
, the joined data is misaligned with respect to the original DataFrame:
import os
import pandas as pd
import statsmodels.formula.api as sm
import numpy as np
import matplotlib.pyplot as plt
flu_train = pd.read_csv('FluTrain.csv')
# From: https://courses.edx.org/c4x/MITx/15.071x/asset/FluTrain.csv
cols = ['Ystart', 'Mstart', 'Dstart', 'Yend', 'Mend', 'Dend']
flu_train = flu_train.join(pd.DataFrame(flu_train.Week.str.findall('\d+').tolist(), dtype=np.int64, columns=cols))
flu_trend_1 = sm.ols('log(ILI) ~ Queries', flu_train).fit()
flu_test = pd.read_csv('FluTest.csv')
# From: https://courses.edx.org/c4x/MITx/15.071x/asset/FluTest.csv
flu_test = flu_test.join(pd.DataFrame(flu_test.Week.str.findall('\d+').tolist(), dtype=np.int64, columns=cols))
flu_test = flu_test.join(pd.DataFrame(exp(flu_trend_1.predict(flu_test)), columns=['ILIPred1'] ))
flu_train['ILIShift2'] = flu_train.ILI.shift(2)
flu_trend_2 = sm.ols('log(ILI) ~ Queries + log(ILIShift2)', flu_train).fit()
flu_test['ILIShift2'] = flu_test.ILI.shift(2)
# Note that this does not work in a simplified example
# See -- http://stackoverflow.com/q/22457880/
flu_test[:2].ILIShift2 = list(flu_train[-2:].ILI)
# This SHIFTS the joined column "up" two rows, loosing the first two values of ILIPred2 and making the last 2 'NaN'
flu_test = flu_test.join(pd.DataFrame(exp(flu_trend_2.predict(flu_test)), columns=['ILIPred2']))
The final statement shifts the joined column "up" two rows, loosing the first two values of ILIPred2 and making the last 2 'NaN'. I expected the joined column to align with all of the others.
Why is this happening and how do I prevent it?
This joined dataframe (pd.DataFrame(np.exp(flu_trend_2.predict(flu_test)), columns=['ILIPred2'])
) has an index from 0 to 49.
You're joining it to flu_test
, which has an index of 0 to 51.
So where those indices don't match up (50, and 51), you get NaN
as I would hope.
If you want to force the joined column to sit at the bottom of your main dataframe, you can do the following (note the use of iloc
and the row_shift
variable):
import os
import pandas as pd
import statsmodels.formula.api as sm
import numpy as np
import matplotlib.pyplot as plt
row_shift = 2
flu_train = pd.read_csv('https://courses.edx.org/c4x/MITx/15.071x/asset/FluTrain.csv')
cols = ['Ystart', 'Mstart', 'Dstart', 'Yend', 'Mend', 'Dend']
flu_train = flu_train.join(pd.DataFrame(flu_train.Week.str.findall('\d+').tolist(), dtype=np.int64, columns=cols))
flu_trend_1 = sm.ols('np.log(ILI) ~ Queries', flu_train).fit()
flu_test = pd.read_csv('https://courses.edx.org/c4x/MITx/15.071x/asset/FluTest.csv')
flu_test = flu_test.join(pd.DataFrame(flu_test.Week.str.findall('\d+').tolist(), dtype=np.int64, columns=cols))
flu_test = flu_test.join(pd.DataFrame(np.exp(flu_trend_1.predict(flu_test)), columns=['ILIPred1'] ))
flu_train['ILIShift2'] = flu_train.ILI.shift(row_shift)
flu_trend_2 = sm.ols('np.log(ILI) ~ Queries + np.log(ILIShift2)', flu_train).fit()
flu_test['ILIShift2'] = flu_test.ILI.shift(row_shift)
# Note that this does not work in a simplified example
# See -- http://stackoverflow.com/q/22457880/
flu_test.iloc[:2].ILIShift2 = list(flu_train.iloc[-row_shift:].ILI)
joiner = pd.DataFrame(np.exp(flu_trend_2.predict(flu_test)), columns=['ILIPred2'], index=flu_test.index[row_shift:])
flu_test.join(joiner)
Which gives me:
Week ILI Queries Ystart Mstart Dstart Yend Mend Dend ILIPred1 ILIShift2 ILIPred2
0 2012-01-01 - 2012-01-07 1.766707 0.593625 2012 1 1 2012 1 7 3.520332 NaN NaN
1 2012-01-08 - 2012-01-14 1.543401 0.499336 2012 1 8 2012 1 14 2.662689 NaN NaN
2 2012-01-15 - 2012-01-21 1.647615 0.500664 2012 1 15 2012 1 21 2.673181 1.766707 2.140941
3 2012-01-22 - 2012-01-28 1.684297 0.479416 2012 1 22 2012 1 28 2.510160 1.543401 1.907817
4 2012-01-29 - 2012-02-04 1.863542 0.471448 2012 1 29 2012 2 4 2.451624 1.647615 1.971504
5 2012-02-05 - 2012-02-11 1.864079 0.503320 2012 2 5 2012 2 11
...
46 2012-11-18 - 2012-11-24 2.304625 0.511288 2012 11 18 2012 11 24 2.758619 1.610915 2.042260
47 2012-11-25 - 2012-12-01 2.225997 0.609562 2012 11 25 2012 12 1 3.690445 1.733293 2.424141
48 2012-12-02 - 2012-12-08 2.978047 0.671979 2012 12 2 2012 12 8 4.439679 2.304625 3.160283
49 2012-12-09 - 2012-12-15 3.600230 0.705179 2012 12 9 2012 12 15 4.898351 2.225997 3.220680
50 2012-12-16 - 2012-12-22 4.547268 0.787517 2012 12 16 2012 12 22 6.250888 2.978047 4.322513
51 2012-12-23 - 2012-12-29 6.033614 0.805421 2012 12 23 2012 12 29 6.591252 3.600230 5.006438