Why does my Pandas join shift rows of the joined d

2019-09-18 09:28发布

问题:

In Pandas, when I join, the joined data is misaligned with respect to the original DataFrame:

import os
import pandas as pd
import statsmodels.formula.api as sm
import numpy as np
import matplotlib.pyplot as plt

flu_train = pd.read_csv('FluTrain.csv')
# From: https://courses.edx.org/c4x/MITx/15.071x/asset/FluTrain.csv
cols = ['Ystart', 'Mstart', 'Dstart', 'Yend', 'Mend', 'Dend']
flu_train = flu_train.join(pd.DataFrame(flu_train.Week.str.findall('\d+').tolist(), dtype=np.int64, columns=cols))

flu_trend_1 = sm.ols('log(ILI) ~ Queries', flu_train).fit()

flu_test = pd.read_csv('FluTest.csv')
# From: https://courses.edx.org/c4x/MITx/15.071x/asset/FluTest.csv
flu_test = flu_test.join(pd.DataFrame(flu_test.Week.str.findall('\d+').tolist(), dtype=np.int64, columns=cols))

flu_test = flu_test.join(pd.DataFrame(exp(flu_trend_1.predict(flu_test)), columns=['ILIPred1'] ))
flu_train['ILIShift2'] = flu_train.ILI.shift(2)

flu_trend_2 = sm.ols('log(ILI) ~ Queries + log(ILIShift2)', flu_train).fit()
flu_test['ILIShift2'] = flu_test.ILI.shift(2)

# Note that this does not work in a simplified example
# See -- http://stackoverflow.com/q/22457880/
flu_test[:2].ILIShift2 = list(flu_train[-2:].ILI)

# This SHIFTS the joined column "up" two rows, loosing the first two values of ILIPred2 and making the last 2 'NaN'
flu_test = flu_test.join(pd.DataFrame(exp(flu_trend_2.predict(flu_test)), columns=['ILIPred2']))

The final statement shifts the joined column "up" two rows, loosing the first two values of ILIPred2 and making the last 2 'NaN'. I expected the joined column to align with all of the others.

Why is this happening and how do I prevent it?

回答1:

This joined dataframe (pd.DataFrame(np.exp(flu_trend_2.predict(flu_test)), columns=['ILIPred2'])) has an index from 0 to 49.

You're joining it to flu_test, which has an index of 0 to 51.

So where those indices don't match up (50, and 51), you get NaN as I would hope.

If you want to force the joined column to sit at the bottom of your main dataframe, you can do the following (note the use of iloc and the row_shift variable):

import os
import pandas as pd
import statsmodels.formula.api as sm
import numpy as np
import matplotlib.pyplot as plt

row_shift = 2

flu_train = pd.read_csv('https://courses.edx.org/c4x/MITx/15.071x/asset/FluTrain.csv')
cols = ['Ystart', 'Mstart', 'Dstart', 'Yend', 'Mend', 'Dend']
flu_train = flu_train.join(pd.DataFrame(flu_train.Week.str.findall('\d+').tolist(), dtype=np.int64, columns=cols))

flu_trend_1 = sm.ols('np.log(ILI) ~ Queries', flu_train).fit()

flu_test = pd.read_csv('https://courses.edx.org/c4x/MITx/15.071x/asset/FluTest.csv')
flu_test = flu_test.join(pd.DataFrame(flu_test.Week.str.findall('\d+').tolist(), dtype=np.int64, columns=cols))

flu_test = flu_test.join(pd.DataFrame(np.exp(flu_trend_1.predict(flu_test)), columns=['ILIPred1'] ))
flu_train['ILIShift2'] = flu_train.ILI.shift(row_shift)

flu_trend_2 = sm.ols('np.log(ILI) ~ Queries + np.log(ILIShift2)', flu_train).fit()
flu_test['ILIShift2'] = flu_test.ILI.shift(row_shift)

# Note that this does not work in a simplified example
# See -- http://stackoverflow.com/q/22457880/
flu_test.iloc[:2].ILIShift2 = list(flu_train.iloc[-row_shift:].ILI)

joiner = pd.DataFrame(np.exp(flu_trend_2.predict(flu_test)), columns=['ILIPred2'], index=flu_test.index[row_shift:])
flu_test.join(joiner)

Which gives me:

                       Week       ILI   Queries  Ystart  Mstart  Dstart  Yend  Mend  Dend  ILIPred1  ILIShift2  ILIPred2
0   2012-01-01 - 2012-01-07  1.766707  0.593625    2012       1       1  2012     1     7  3.520332        NaN       NaN
1   2012-01-08 - 2012-01-14  1.543401  0.499336    2012       1       8  2012     1    14  2.662689        NaN       NaN
2   2012-01-15 - 2012-01-21  1.647615  0.500664    2012       1      15  2012     1    21  2.673181   1.766707  2.140941
3   2012-01-22 - 2012-01-28  1.684297  0.479416    2012       1      22  2012     1    28  2.510160   1.543401  1.907817
4   2012-01-29 - 2012-02-04  1.863542  0.471448    2012       1      29  2012     2     4  2.451624   1.647615  1.971504
5   2012-02-05 - 2012-02-11  1.864079  0.503320    2012       2       5  2012     2    11  
...
46  2012-11-18 - 2012-11-24  2.304625  0.511288    2012      11      18  2012    11    24  2.758619   1.610915  2.042260
47  2012-11-25 - 2012-12-01  2.225997  0.609562    2012      11      25  2012    12     1  3.690445   1.733293  2.424141
48  2012-12-02 - 2012-12-08  2.978047  0.671979    2012      12       2  2012    12     8  4.439679   2.304625  3.160283
49  2012-12-09 - 2012-12-15  3.600230  0.705179    2012      12       9  2012    12    15  4.898351   2.225997  3.220680
50  2012-12-16 - 2012-12-22  4.547268  0.787517    2012      12      16  2012    12    22  6.250888   2.978047  4.322513
51  2012-12-23 - 2012-12-29  6.033614  0.805421    2012      12      23  2012    12    29  6.591252   3.600230  5.006438