I'd like to take a dataset with a bunch of different unique individuals, each with multiple entries, and assign each individual a unique id for all of their entries. Here's an example of the df:
FirstName LastName id
0 Tom Jones 1
1 Tom Jones 1
2 David Smith 1
3 Alex Thompson 1
4 Alex Thompson 1
So, basically I want all entries for Tom Jones to have id=1, all entries for David Smith to have id=2, all entries for Alex Thompson to have id=3, and so on.
So I already have one solution, which is a dead simple python loop iterating two values (One for id, one for index) and assigning the individual an id based on whether they match the previous individual:
x = 1
i = 1
while i < len(df_test):
if (df_test.LastName[i] == df_test.LastName[i-1]) &
(df_test.FirstName[i] == df_test.FirstName[i-1]):
df_test.loc[i, 'id'] = x
i = i+1
else:
x = x+1
df_test.loc[i, 'id'] = x
i = i+1
The problem I'm running into is that the dataframe has about 9 million entries, so with that loop it would have taken a huge amount of time to run. Can anyone think of a more efficient way to do this? I've been looking at groupby and multiindexing as potential solutions, but haven't quite found the right solution yet. Thanks!
This approach uses
.groupby()
and.ngroup()
(new in Pandas 0.20.2) to create theid
column:I checked timings and, for the small dataset in this example, Alexander's answer is faster:
However, for larger dataframes, the
groupby()
approach appears to be faster. To create a large, representative data set, I usedfaker
to create a dataframe of 5000 names and then concatenated the first 2000 names to this dataframe to make a dataframe with 7000 names, 2000 of which were duplicates.Running the timing on this larger data set gives:
You may want to test both approaches on your data set to determine which one works best given the size of your data.
You could join the last name and first name, convert it to a category, and then get the codes.
Of course, multiple people with the same name would have the same
id
.This method allow the 'id' column name to be defined with a variable. Plus I find it a little easier to read compared to the assign or groupby methods.
Output: