I have a large Dataframe that looks similar to this:
ID_Code Status1 Status2
0 A Done Not
1 A Done Done
2 B Not Not
3 B Not Done
4 C Not Not
5 C Not Not
6 C Done Done
What I want to do is calculate is for each of the set of duplicate ID codes, find out the percentage of Not-Not entries are present.
(i.e. [# of Not-Not/# of total entries] * 100)
I'm struggling to do so using groupby and can't seem to get the right syntax to perform this.
I may have misunderstood the question, but you appear to be referring to when values of Status1
and Status2
are both Not
, correct? If that's the case, you can do something like:
df.groupby('ID_Code').apply(lambda x: (x[['Status1','Status2']] == 'Not').all(1).sum()/len(x)*100)
ID_Code
A 0.000000
B 50.000000
C 66.666667
dtype: float64
IIUC using crosstab
pd.crosstab(df['ID_Code'],(df['Status1'].eq('Not'))&(df['Status2'].eq('Not')),normalize ='index')
Out[713]:
col_0 False True
ID_Code
A 1.000000 0.000000
B 0.500000 0.500000
C 0.333333 0.666667
#pd.crosstab(df['ID_Code'],(df['Status1'].eq('Not'))&(df['Status2'].eq('Not')),normalize ='index')[True]
Using sum
and a boolean
mask:
df.filter(like='Status').eq('Not').all(1).groupby(df.ID_Code).mean().mul(100)
ID_Code
A 0.000000
B 50.000000
C 66.666667
Name: flag, dtype: float64