I used to the following when altering a dataframe column based on a condition (in this case, every woman gets a wage of 200).
import pandas as pd
df = pd.DataFrame([[False,100],[True,100],[True,100]],columns=['female','wage'])
df.loc[df['female'] == True,'wage'] = 200
The PEP 8 Style convention checker (in Spyder) recommends in line 3:
comparison to True should be 'if cond is True:' or 'if cond:'
Changing the last row to
df.loc[df['female'] is True,'wage'] = 200
yields
KeyError: 'cannot use a single bool to index into setitem'
because now the statement is evaluated to a single boolean value and not to a Series.
Is this a case where one has to deviate from styling conventions?
You should use df['female']
with no comparison, rather than comparing to True
with any operator. df['female']
is already the mask you need.
Comparison to True
with ==
is almost always a bad idea, even in NumPy or Pandas.
Just do
df.loc[df['female'], 'wage'] = 200
In fact df['female']
as a Boolean series has exactly the same values as the Boolean series returned by evaluating df['female'] == True
, which is also a Boolean series. (A Series is the Pandas term like a single column in a dataframe).
By the way, the last statement is precisely why df['female'] is True
should never work. In Python, the is
operator is reserved for object identity, not for comparing values for equality. df['female'] will always be a Series (if df is a Pandas dataframe) and a Series will never be the same (object) as the single
To understand this better think of the difference, in English, between 'equal' and 'same'. In German, this is the difference between 'selbe' (identity) and 'gleiche' (equality). In other languages, this distinction is not as explicit.
Thus, in Python, you can compare a (reference to an) object to (the special object) None
with : if obj is None : ...
or even check that two variables ('names' in Python terminology) point to the exact same object with if a is b
. But this condition holding is a much stronger assertion than just comparing for equality a == b
. In fact the result of evaluating the expression a == b
might be anything, not just a single Boolean value. It all depends on what class a
belongs to, that is, what its type is. In your context a == b
actually yields a boolean Series, provided both a
and b
are also a Pandas Series.
By the way if you want to check that all values agree between two Series a
and b
then you should evaluate (a == b).all()
which reduces the whole series to a single Boolean value, which will be True if and only if a[i] == b[i]
for every value of i
.