id name gender
0 13 John Smith 0
1 46 Jim Jeffries 2
2 75 Jennifer Johnson 0
3 37 Sam Adams 0
4 24 John Cleese 0
5 17 Taika Waititi 0
I have a lot of people's names and genders in a df, taken from a film actors' db. Genders were assigned a 1 (female), 2 (male), or 0 (not listed). I'd like to comb through and callously assume genders by name. Names would be stored in a list, and filled out manually. Perhaps I spot somebody with a gender-nonspecific name by ID and find out myself if they are male/female, I'd like to inject that as well:
m_names = ['John', ...]
f_names = ['Jennifer', ...]
m_ids = ['37', ...]
f_ids = ['', ...]
I've got fine control of for loops and np.where, but I can't figure out how to get through this df, row by row.
If what's above were to be used, what I want to return would look like:
for index, row in df.iterrows():
if row['gender'] == 0:
if row['name'].str.contains(' |'.join(f_names)) or row['id'].str.contains('|'.join(f_ids)):
return 1
elif row['name'].str.contains(' |'.join(m_names)) or row['id'].str.contains('|'.join(m_ids)):
return 2
print(df)
id name gender
0 13 John Smith 2
1 46 Jim Jeffries 2
2 75 Jennifer Johnson 1
3 37 Sam Adams 2
4 24 John Cleese 2
5 17 Taika Waititi 0
Note the space before '|' in the condition for names, to avoid grabbing any parts of last names.
At this point, I'm running into a wall with how I've formatted my if statements. Python doesn't like my formatting, and says my 'return's are 'outside function'. If I change these to
row['gender'] = #
I run into issues with unicode and my usage of 'str' and 'contains'.
Seems like you need
np.select
and no for loopsYou can first construct and combine Boolean masks. For example:
Then apply logic via
pd.DataFrame.loc
:Or use nested
numpy.where
:Or, if you wish to supply a scalar default value, use
numpy.select
:You could use the Pandas function isin
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.isin.html