Python Pandas Running Totals with Resets

2019-08-01 15:04发布

问题:

I would like to perform the following task. Given a 2 columns (good and bad) I would like to replace any rows for the two columns with a running total. Here is an example of the current dataframe along with the desired data frame.

EDIT: I should have added what my intentions are. I am trying to create equally binned (in this case 20) variable using a continuous variable as the input. I know the pandas cut and qcut functions are available, however the returned results will have zeros for the good/bad rate (needed to compute the weight of evidence and information value). Zeros in either the numerator or denominator will not allow the mathematical calculations to work.

   d={'AAA':range(0,20),
      'good':[3,3,13,20,28,32,59,72,64,52,38,24,17,19,12,5,7,6,2,0],
      'bad':[0,0,1,1,1,0,6,8,10,6,6,10,5,8,2,2,1,3,1,1]}
   df=pd.DataFrame(data=d)
   print(df)

Here is an explanation of what I need to do to the above dataframe.

Roughly speaking, anytime I encounter a zero for either column, I need to use a running total for the column which is not zero to the next row which has a non-zero value for the column that contained zeros.

Here is the desired output:

dd={'AAA':range(0,16),
    'good':[19,20,60,59,72,64,52,38,24,17,19,12,5,7,6,2],
    'bad':[1,1,1,6,8,10,6,6,10,5,8,2,2,1,3,2]}

desired_df=pd.DataFrame(data=dd)    
print(desired_df) 

回答1:

The basic idea of my solution is to create a column from a cumsum over non-zero values in order to get the zero values with the next non zero value into one group. Then you can use groupby + sum to get your the desired values.

two_good = df.groupby((df['bad']!=0).cumsum().shift(1).fillna(0))['good'].sum()
two_bad = df.groupby((df['good']!=0).cumsum().shift(1).fillna(0))['bad'].sum()

two_good = two_good.loc[two_good!=0].reset_index(drop=True)
two_bad = two_bad.loc[two_bad!=0].reset_index(drop=True)

new_df = pd.concat([two_bad, two_good], axis=1).dropna()
print(new_df)
    bad  good
0     1  19.0
1     1  20.0
2     1  28.0
3     6  91.0
4     8  72.0
5    10  64.0
6     6  52.0
7     6  38.0
8    10  24.0
9     5  17.0
10    8  19.0
11    2  12.0
12    2   5.0
13    1   7.0
14    3   6.0
15    1   2.0

This code treats your etch case of trailing zeros different from your desired output, it simple cuts it off. You'd have to add some extra code to catch that one with a different logic.



回答2:

P.Tillmann. I appreciate your assistance with this. For the more advanced readers I would assume you to find this code appalling, as I do. I would be more than happy to take any recommendation which makes this more streamlined.

d={'AAA':range(0,20),
  'good':[3,3,13,20,28,32,59,72,64,52,38,24,17,19,12,5,7,6,2,0],
  'bad':[0,0,1,1,1,0,6,8,10,6,6,10,5,8,2,2,1,3,1,1]}
df=pd.DataFrame(data=d)
print(df)

row_good=0
row_bad=0
row_bad_zero_count=0
row_good_zero_count=0
row_out='NO'
crappy_fix=pd.DataFrame()
for index,row in df.iterrows():
    if row['good']==0 or row['bad']==0:
        row_bad += row['bad']
        row_good += row['good']
        row_bad_zero_count += 1
        row_good_zero_count += 1
        output_ind='1'
        row_out='NO'
    elif index+1 < len(df) and (df.loc[index+1,'good']==0 or    df.loc[index+1,'bad']==0):
        row_bad=row['bad']
        row_good=row['good'] 
        output_ind='2'
        row_out='NO'    
    elif (row_bad_zero_count > 1 or row_good_zero_count > 1) and row['good']!=0 and row['bad']!=0:
        row_bad += row['bad']
        row_good += row['good']
        row_bad_zero_count=0
        row_good_zero_count=0    
        row_out='YES'
        output_ind='3'
    else:
        row_bad=row['bad']
        row_good=row['good']
        row_bad_zero_count=0
        row_good_zero_count=0
        row_out='YES'
        output_ind='4'

    if ((row['good']==0 or row['bad']==0) 
        and (index > 0 and (df.loc[index-1,'good']!=0 or df.loc[index-1,'bad']!=0))
        and row_good != 0 and row_bad != 0):
        row_out='YES'

    if row_out=='YES':
        temp_dict={'AAA':row['AAA'],
                   'good':row_good,
                   'bad':row_bad}
        crappy_fix=crappy_fix.append([temp_dict],ignore_index=True)
        print(str(row['AAA']),'-',
              str(row['good']),'-',
              str(row['bad']),'-',
              str(row_good),'-',
              str(row_bad),'-',
              str(row_good_zero_count),'-',
              str(row_bad_zero_count),'-',
              row_out,'-',
              output_ind) 

print(crappy_fix)