I want to get a percentage of a particular value in a df column. Say I have a df with (col1, col2 , col3, gender) gender column has values of M or F. I want to get the percentage of M and F values in the df.
I have tried this, which gives me the number M and F instances, but I want these as a percentage of the total number of values in the df.
df.groupby('gender').size()
Can someone help?
Use value_counts
with normalize=True
:
df['gender'].value_counts(normalize=True) * 100
If you do not need to look M
and F
values other than gender
column then, may be you can try using value_counts()
and count()
as following:
df = pd.DataFrame({'gender':['M','M','F', 'F', 'F']})
# Percentage calculation
(df['gender'].value_counts()/df['gender'].count())*100
Result:
F 60.0
M 40.0
Name: gender, dtype: float64
Or, using groupby
:
(df.groupby('gender').size()/df['gender'].count())*100
Let's say there are 200 values out of which 120 are categorized as M and 80 as F
1)
df['gender'].value_counts()
output:
M=120
F=80
2)
df['gender'].value_counts(Normalize=True)
output:
M=0.60
F=0.40
3)
df['gender'].value_counts(Normalize=True)*100 #will convert output to percentages
output:
M=60
F=40
finding the percentage of target variation to chenck imbalance/not.
g = data[Target_col_Y]
df = pd.concat([g.value_counts(),
g.value_counts(normalize=True).mul(100)],axis=1,keys=('counts','percentage'))
print (df)
counts percentage
0 36548 88.734583
1 4640 11.265417
finding the maximum in the columns percentage here, to check how much #imbalance there
df1=df.diff(periods=1,axis=0)
difvalue=df1[[list(df1.columns)[-1]]].max()