Update Value in one column, if string in other col

2019-07-30 10:44发布

问题:

  id name             gender
0 13 John Smith       0
1 46 Jim Jeffries     2
2 75 Jennifer Johnson 0
3 37 Sam Adams        0
4 24 John Cleese      0
5 17 Taika Waititi    0

I have a lot of people's names and genders in a df, taken from a film actors' db. Genders were assigned a 1 (female), 2 (male), or 0 (not listed). I'd like to comb through and callously assume genders by name. Names would be stored in a list, and filled out manually. Perhaps I spot somebody with a gender-nonspecific name by ID and find out myself if they are male/female, I'd like to inject that as well:

m_names = ['John', ...]
f_names = ['Jennifer', ...]
m_ids   = ['37', ...]
f_ids   = ['', ...]

I've got fine control of for loops and np.where, but I can't figure out how to get through this df, row by row.

If what's above were to be used, what I want to return would look like:

for index, row in df.iterrows():
  if row['gender'] == 0:
    if   row['name'].str.contains(' |'.join(f_names)) or row['id'].str.contains('|'.join(f_ids)):
      return 1
    elif row['name'].str.contains(' |'.join(m_names)) or row['id'].str.contains('|'.join(m_ids)):
      return 2
print(df)

  id name             gender
0 13 John Smith       2
1 46 Jim Jeffries     2
2 75 Jennifer Johnson 1
3 37 Sam Adams        2
4 24 John Cleese      2
5 17 Taika Waititi    0

Note the space before '|' in the condition for names, to avoid grabbing any parts of last names.

At this point, I'm running into a wall with how I've formatted my if statements. Python doesn't like my formatting, and says my 'return's are 'outside function'. If I change these to

row['gender'] = #

I run into issues with unicode and my usage of 'str' and 'contains'.

回答1:

Seems like you need np.select and no for loops

df['gender'] = np.select([df.name.str.contains(" |".join(m_names)),
                          df.name.str.contains(" |".join(f_names))],
                         [2, 1], 
                         default=3)


回答2:

You could use the Pandas function isin

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.isin.html

df.loc[df.name.isin(m_names), 'gender'] = 2


回答3:

You can first construct and combine Boolean masks. For example:

m_zero = df['gender'].eq(0)

m_name_female = df['name'].str.contains(' |'.join(f_names))
m_name_male = df['name'].str.contains(' |'.join(m_names))

m_id_female = df['id'].str.contains('|'.join(f_ids))
m_id_male = df['id'].str.contains('|'.join(m_ids))

female_mask = m_zero & (m_name_female | m_id_female)
male_mask = m_zero & (m_name_male | m_id_male)

Then apply logic via pd.DataFrame.loc:

df.loc[female_mask, 'gender'] = 1
df.loc[male_mask, 'gender'] = 2

Or use nested numpy.where:

df['gender'] = np.where(female_mask, 1, np.where(male_mask, 2, df['gender']))

Or, if you wish to supply a scalar default value, use numpy.select:

df['gender'] = np.select([female_mask, male_mask], [1, 2], 3)