Pandas create Foreign ID column based on Name colu

2019-07-21 03:04发布

问题:

I have a simple dataframe like this for example:

df = pd.DataFrame({'Name': ['John Doe', 'Jane Smith', 'John Doe', 'Jane Smith','Jack Dawson','John Doe']})
df:
        Name
    0   John Doe
    1   Jane Smith
    2   John Doe
    3   Jane Smith
    4   Jack Dawson
    5   John Doe

I want to add a column ['foreign_key'] that assigns a unique ID to each unique name (but rows with the same name should have the same 'foreign_key'. So the final output looks like:

df:
            Name        Foreign_Key
        0   John Doe    foreignkey1
        1   Jane Smith  foreignkey2
        2   John Doe    foreignkey1
        3   Jane Smith  foreignkey2
        4   Jack Dawson foreignkey3
        5   John Doe    foreignkey1

I'm trying to use groupby with a custom function that is applied. So my first step is:

name_groupby = df.groupby('Name')

So that's the splitting, and next comes the apply and combine. There doesn't appear to be anything in the docs like this example and I'm unsure where to go from here.

The custom function I started to apply looks like this:

def make_foreign_key(groupby_df):
    return groupby_df['Foreign_Key'] = 'foreign_key' + num

Any help is greatly appreciated!

回答1:

You can do:

pd.merge(
    df,
    pd.DataFrame(df.Name.unique(), columns=['Name']).reset_index().rename(columns={'index': 'Foreign_Key'}),
    on='Name'
)

         Name  Foreign_Key
0    John Doe            0
1    John Doe            0
2  Jane Smith            1
3  Jane Smith            1


回答2:

You can make Name into a Categorical with much the same effect:

In [21]: df["Name"].astype('category')
Out[21]:
0       John Doe
1     Jane Smith
2       John Doe
3     Jane Smith
4    Jack Dawson
5       John Doe
Name: Name, dtype: category
Categories (3, object): [Jack Dawson, Jane Smith, John Doe]

See the categorical section of the docs.

That may suffice, or you can pull out the codes as foreign key.

In [22]: df["Name"] = df["Name"].astype('category')

In [23]: df["Name"].cat.codes
Out[23]:
0    2
1    1
2    2
3    1
4    0
5    2
dtype: int8

In [24]: df["Foreign_Key"] = c.cat.codes

In [25]: df
Out[25]:
          Name  Foreign_Key
0     John Doe            2
1   Jane Smith            1
2     John Doe            2
3   Jane Smith            1
4  Jack Dawson            0
5     John Doe            2


回答3:

I faced the same problem short time ago and my solution looked like the following:

import pandas as pd
import numpy as np
values = df['Name'].unique()
values = pd.Series(np.arange(len(values)), values)
df['new_column'] = df['Name'].apply(values.get)

The output:

          Name  new_column
0     John Doe           0
1   Jane Smith           1
2     John Doe           0
3   Jane Smith           1
4  Jack Dawson           2
5     John Doe           0