COUNTIF in pandas python over multiple columns wit

2020-02-10 06:33发布

问题:

I have a dataset wherein I am trying to determine the number of risk factors per person. So I have the following data:

Person_ID  Age  Smoker  Diabetes
      001   30       Y         N
      002   45       N         N
      003   27       N         Y
      004   18       Y         Y
      005   55       Y         Y

Each attribute (Age, Smoker, Diabetes) has its own condition to determine whether it is a risk factor. So if Age >= 45, it's a risk factor. Smoker and Diabetes are risk factors if they are "Y". What I would like is to add a column that adds up the number of risk factors for each person based on those conditions. So the data would look like this:

Person_ID  Age  Smoker  Diabetes  Risk_Factors
      001   30       Y         N             1
      002   25       N         N             0
      003   27       N         Y             1
      004   18       Y         Y             2
      005   55       Y         Y             3

I have a sample dataset that I was fooling around with in Excel, and the way I did it there was to use the COUNTIF formula like so:

=COUNTIF(B2,">45") + COUNTIF(C2,"=Y") + COUNTIF(D2,"=Y")

However, the actual dataset that I will be using is way too large for Excel, so I'm learning pandas for python. I wish I could provide examples of what I've already tried, but frankly I don't even know where to start. I looked at this question, but it doesn't really address what to do about applying it to an entire new column using different conditions from multiple columns. Any suggestions?

回答1:

If you want to stick with pandas. You can use the following...

Solution

isY = lambda x:int(x=='Y')
countRiskFactors = lambda row: isY(row['Smoker']) + isY(row['Diabetes']) + int(row["Age"]>45)

df['Risk_Factors'] = df.apply(countRiskFactors,axis=1)

How it works

isY - is a stored lambda function that checks if the value of a cell is Y returns 1 if it is otherwise 0 countRiskFactors - adds up the risk factors

the final line uses the apply method, with the paramater key set to 1, which applies the method -first parameter - row wise along the DataFrame and Returns a Series which is appended to the DataFrame.

output of print df

   Person_ID  Age Smoker Diabetes  Risk_Factors
0          1   30      Y        N             1
1          2   45      N        N             0
2          3   27      N        Y             1
3          4   18      Y        Y             2
4          5   55      Y        Y             3


回答2:

I would do this the following way.

  1. For each column, create a new boolean series using the column's condition
  2. Add those series row-wise

(Note that this is simpler if your Smoker and Diabetes column is already boolean (True/False) instead of in strings.)

It might look like this:

df = pd.DataFrame({'Age': [30,45,27,18,55],
                   'Smoker':['Y','N','N','Y','Y'],
                   'Diabetes': ['N','N','Y','Y','Y']})

   Age Diabetes Smoker
0   30        N      Y
1   45        N      N
2   27        Y      N
3   18        Y      Y
4   55        Y      Y

#Step 1
risk1 = df.Age > 45
risk2 = df.Smoker == "Y"
risk3 = df.Diabetes == "Y"
risk_df = pd.concat([risk1,risk2,risk3],axis=1)

     Age Smoker Diabetes
0  False   True    False
1  False  False    False
2  False  False     True
3  False   True     True
4   True   True     True

df['Risk_Factors'] = risk_df.sum(axis=1)

   Age Diabetes Smoker  Risk_Factors
0   30        N      Y             1
1   45        N      N             0
2   27        Y      N             1
3   18        Y      Y             2
4   55        Y      Y             3


回答3:

If you are starting from excel and want to go to the next evolution then I would recommend MS access. It will be a lot easier then learning Panda for python. You should just replace the CountIf() with:

Risk Factor: IIF(Age>45, 1, 0) + IIF(Smoker="Y", 1, 0) + IIF(Diabetes="Y", 1, 0)