Create Pandas DataFrames from Unique Values in one

I have a Pandas dataframe with 1000s of rows. and it has the Names column includes the customer names and their records. I want to create individual dataframes for each customer based on their unique names. I got the unique names into a list

customerNames = DataFrame['customer name'].unique().tolist() this gives the following array

['Name1', 'Name2', 'Name3, 'Name4']

I tried a loop by catching the unique names in the above list and creating dataframes for each name and assign the dataframes to the customer name. So for example when I write Name3, it should give the Name3's data as a separate dataframe

for x in customerNames:
    x = DataFrame.loc[DataFrame['customer name'] == x]
x

Above lines returned the dataframe for only Name4 as dataframe result, but skipped the rest.

How can I solve this problem?

标签： python pandas

3条回答

戒情不戒烟

2楼-- · 2020-02-09 23:30

To create a dataframe for all the unique values in a column, create a dict of dataframes, as follows.

Creates a dict, where each key is a unique value from the column of choice and the value is a dataframe.
Access each dataframe as you would a standard dict (e.g. df_names['Name1'])
.groupby() creates a generator, which can be unpacked.
- k is the unique values in the column and v is the data associated with each k.

With a `for-loop` and `.groupby`:

df_names = dict()
for k, v in df.groupby('customer name'):
    df_names[k] = v

With a Python Dictionary Comprehension

PEP 274 -- Dict Comprehensions

Using `.groupby`

df_names = {k: v for (k, v) in df.groupby('customer name')}

This comes from a conversation with rafaelc, who pointed out that using .groupby is faster than .unique.
- With 6 unique values in the column, .groupby is faster, at 104 ms compared to 392 ms
- With 26 unique values in the column, .groupby is faster, at 147 ms compared to 1.53 s.
Using an a for-loop is slightly faster than a comprehension, particularly for more unique column values or lots of rows (e.g. 10M).

Using `.unique`:

Use Boolean indexing to match the unique values in the column of choice.

df_names = {name: df[df['customer name'] == name] for name in df['customer name'].unique()}

Testing

The following data was used for testing

import pandas as pd
import string
import random

random.seed(365)

# 6 unique values
data = {'class': [random.choice(['1-5', '6-25', '26-100', '100-500', '500-1000', '>1000']) for _ in range(1000000)],
        'treatment': [random.choice(['Yes', 'No']) for _ in range(1000000)]}

# 26 unique values
data = {'class': [random.choice( list(string.ascii_lowercase)) for _ in range(1000000)],
        'treatment': [random.choice(['Yes', 'No']) for _ in range(1000000)]}

df = pd.DataFrame(data)

0人赞添加讨论(0) 举报

太酷不给撩

3楼-- · 2020-02-09 23:37

maybe i get you wrong but

when

for x in customerNames:
    x = DataFrame.loc[DataFrame['customer name'] == x]
x

gives you the right output for the last list entry its because your output is out of the indent of the loop

import pandas as pd

customer_df = pd.DataFrame.from_items([('A', ['Jean', 'France']), ('B', ['James', 'USA'])],
                        orient='index', columns=['customer', 'country'])

customer_list = ['James', 'Jean']

for x in customer_list:
    x = customer_df.loc[customer_df['customer'] == x]
    print(x)
    print('now I could append the data to something new')

you get the output:

  customer country
B    James     USA
now I could append the data to something new
  customer country
A     Jean  France
now I could append the data to something new

Or if you dont like loops you could go with

import pandas as pd

customer_df = pd.DataFrame.from_items([('A', ['Jean', 'France']), ('B', ['James', 'USA']),('C', ['Hans', 'Germany'])],
                        orient='index', columns=['customer', 'country'])

customer_list = ['James', 'Jean']


print(customer_df[customer_df['customer'].isin(customer_list)])

Output:

  customer country
A     Jean  France
B    James     USA

df.isin is better explained under:How to implement 'in' and 'not in' for Pandas dataframe

0人赞添加讨论(0) 举报

再贱就再见

4楼-- · 2020-02-09 23:46

Your current iteration overwrites x twice every time it runs: the for loop assigns a customer name to x, and then you assign a dataframe to it.

To be able to call each dataframe later by name, try storing them in a dictionary:

df_dict = {name: df.loc[df['customer name'] == name] for name in customerNames}

df_dict['Name3']

0人赞添加讨论(0) 举报