How do you search a column and fill another column

2020-07-10 11:11发布

问题:

I have a big pandas Dataframe with fictional persondata. The below is a small example - each person is defined by a number.

import pandas as pd
import numpy as np
df = pd.DataFrame({ 'Number':["5569", "3385", "9832", "6457", "5346", "5462", "9873", "2366"] , 'Gender': ['Male', 'Male', 'Female', 'Male', 'Female', 'Female', 'Male', 'Female'], 'Children': [np.nan, "5569 6457", "5569", np.nan, "6457", "2366", "2366", np.nan]})

df
   Number  Gender   Children
0    5569    Male        NaN
1    3385    Male  5569 6457
2    9832  Female       5569
3    6457    Male        NaN
4    5346  Female       6457
5    5462  Female       2366
6    9873    Male       2366
7    2366  Female        NaN

Some of the people are the children of some of the others. Now I want to make two columns "Mother" and "Father" and fill them with the relevant numbers. I would get those by looking at the "Children" column and then adding someone as the father if they are a male and has the number of the child in "Children" and the same for females as mothers. However, some of the values are NaN and some people have multiple children (they can have more than 4 in the actual dataset).

I've been trying with .isin and similar but I simply can't get it to work.

They expected output for this example would look like this:

df = pd.DataFrame({ 'Number':["5569", "3385", "9832", "6457", "5346", "5462", "9873", "2366"] , 'Gender': ['Male', 'Male', 'Female', 'Male', 'Female', 'Female', 'Male', 'Female'], 'Children': [np.nan, "5569 6457", "5569", np.nan, "6457", "2366", "2366", np.nan], 'Mother':[9832, np.nan, np.nan,"5346", np.nan, np.nan, np.nan, "5462"], 'Father':["3385", np.nan, np.nan, "3385", np.nan, np.nan, np.nan, "9873"]})

df
  Number  Gender   Children Mother Father
0   5569    Male        NaN   9832   3385
1   3385    Male  5569 6457    NaN    NaN
2   9832  Female       5569    NaN    NaN
3   6457    Male        NaN   5346   3385
4   5346  Female       6457    NaN    NaN
5   5462  Female       2366    NaN    NaN
6   9873    Male       2366    NaN    NaN
7   2366  Female        NaN   5462   9873

回答1:

Use

df = df.join(df.assign(Children=df['Children'].str.split(' '))
               .explode('Children')
               .assign(Children = lambda x: pd.to_numeric(x['Children'],
                                                          errors = 'coerce'))
               .pivot_table(columns='Gender',
                            index ='Children',
                            values = 'Number',
                            fill_value=0)
               .rename(columns = {'Female':'Mother','Male':'Father'}),
              on = 'Number')
print(df)
   Number  Gender   Children  Mother  Father
0    5569    Male        NaN  9832.0  3385.0
1    3385    Male  5569 6457     NaN     NaN
2    9832  Female       5569     NaN     NaN
3    6457    Male        NaN  5346.0  3385.0
4    5346  Female       6457     NaN     NaN
5    5462  Female       2366     NaN     NaN
6    9873    Male       2366     NaN     NaN
7    2366  Female        NaN  5462.0  9873.0

note that here the number of spaces between the values ​​in each cell of the children column is very important due to the use of Series.str.split



回答2:

This looks good for me (Only 2 lines :D )

Note: the string with the space => I ignored the space and made a large number

df['MotherNumber'] =  np.where(pd.notna(df['Children'].str.strip()) & (df['Gender'] == 'Female'),  float('nan'), df['Mother'])
df['FatherNumber'] =  np.where(pd.notna(df['Children'].str.strip()) & (df['Gender'] == 'Male'),  float('nan'), df['Father'])


print(df)
  Number  Gender   Children Mother Father MotherNumber FatherNumber
0   5569    Male        NaN   9832   3385         9832         3385
1   3385    Male  5569 6457    NaN    NaN          NaN          NaN
2   9832  Female       5569    NaN    NaN          NaN          NaN
3   6457    Male        NaN   5346   3385         5346         3385
4   5346  Female       6457    NaN    NaN          NaN          NaN
5   5462  Female       2366    NaN    NaN          NaN          NaN
6   9873    Male       2366    NaN    NaN          NaN          NaN
7   2366  Female        NaN   5462   9873         5462         9873


回答3:

I really like ansev's slick answer if you have access to DataFrame.explode in >0.25.0, but if you were looking for something more basic, heres a solution using Series apply. See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html

def find_mother(number):
    temp = df[df['Children'].str.contains(number) & (df['Gender'] == 'Female')]
    if len(temp):
        return temp['Number'].iloc[0]
    return np.nan
def find_father(number):
    temp = df[df['Children'].str.contains(number) & (df['Gender'] == 'Male')]
    if len(temp):
        return temp['Number'].iloc[0]
    return np.nan

df['Mother'] = df['Number'].apply(find_mother)
df['Father'] = df['Number'].apply(find_father)

df
Output:
  Number  Gender   Children Mother Father
0   5569    Male        NaN   9832   3385
1   3385    Male  5569 6457    NaN    NaN
2   9832  Female       5569    NaN    NaN
3   6457    Male        NaN   5346   3385
4   5346  Female       6457    NaN    NaN
5   5462  Female       2366    NaN    NaN
6   9873    Male       2366    NaN    NaN
7   2366  Female        NaN   5462   9873