How to check if pandas dataframe rows have certain

2019-07-31 16:36发布

问题:

I have implemented the CN2 classification algorithm, it induces rules to classify the data of the form:

IF Attribute1 = a AND Attribute4 = b THEN class = class 1

My current implementation loops through a pandas DataFrame containing the training data using the iterrows() function and returns True or False for each row if it satisfies the rule or not, however, I am aware this is a highly inefficient solution. I would like to vectorise the code, my current attempt is like so:

DataFrame = df
    age  prescription  astigmatism  tear rate  
    1      1              2            1         
    2      2              1            1         
    2      1              1            2         

rule = {'age':[1],'prescription':[1],'astigmatism':[1,2],'tear rate':[1,2]}
df.isin(rule)

This produces:

age  prescription  astigmatism  tear rate  
True   True             True       True  
False  False            True       True  
False  True             True       True  

I have coded the rule to be a dictionary which contains a single value for target attributes and the set of all possible values for non-target attributes.

The result I would like is a single True or False for each row if the conditions of the rule are met or not and the index of the rows which evaluate to all True. Currently I can only get a DataFrame with a T/F for each value. To be concrete, in the example i have shown, I wish the result to be the index of the first row which is the only row which satisfies the rule.

回答1:

I think you need check if at least one value per row is True use DataFrame.any:

mask = df.isin(rule).any(axis=1)

print (mask)
0    True
1    True
2    True
dtype: bool

Or for check if all values are Trues use DataFrame.all:

mask = df.isin(rule).all(axis=1)

print (mask)
0     True
1    False
2    False
dtype: bool

For filtering is possible use boolean indexing:

df = df[mask]