Several dictionary with duplicate key but differen

2020-07-20 04:56发布

Here dataset with unlimited key in dictionary. The detail column in row may have different information products depending on customer.

ID  Name    Detail
1   Sara    [{"Personal":{"ID":"001","Name":"Sara","Type":"01","TypeName":"Book"},"Order":[{"ID":"0001","Date":"20200222","ProductID":"C0123","ProductName":"ABC", "Price":"4"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0004","Date":"20200222","ProductID":"D0123","ProductName":"Small beef", "Price":"15"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0005","Date":"20200222","ProductID":"D0200","ProductName":"Shrimp", "Price":"28"}]}]
2   Frank   [{"Personal":{"ID":"002","Name":"Frank","Type":"02","TypeName":"Food"},"Order":[{"ID":"0008","Date":"20200228","ProductID":"D0288","ProductName":"Salmon", "Price":"24"}]}]

My expected output is

ID Name Personal_ID Personal_Name Personal_Type Personal_TypeName Personal_Order_ID Personal_Order_Date Personal_Order_ProductID Personal_Order_ProductName Personal_Order_Price    
1  Sara 001         Sara          01            Book              0001              20200222            C0123                    ABC                          4    
2  Sara 001         Sara          02            Food              0004              20200222            D0123                    Small beef                   15
3  Sara 001         Sara          02            Food              0005              20200222            D0200                    Shrimp                       28
4  Frank 002        Frank         02            Food              0008              20200228            D0288                    Salmon                       24

标签: python pandas
3条回答
时光不老,我们不散
2楼-- · 2020-07-20 05:17

You can use explode to get all elements of lists in Details separatly, and then you can use Shubham Sharma's answer,

import io
import pandas as pd


#Creating dataframe:
s_e='''
ID    Name
1   Sara    
2   Frank    
'''

df = pd.read_csv(io.StringIO(s_e), sep='\s\s+', engine='python')
df['Detail']=[[{"Personal":{"ID":"001","Name":"Sara","Type":"01","TypeName":"Book"},"Order":[{"ID":"0001","Date":"20200222","ProductID":"C0123","ProductName":"ABC", "Price":"4"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0004","Date":"20200222","ProductID":"D0123","ProductName":"Small beef", "Price":"15"}]},{"Personal":{"ID":"001","Name":"Sara","Type":"02","TypeName":"Food"},"Order":[{"ID":"0005","Date":"20200222","ProductID":"D0200","ProductName":"Shrimp", "Price":"28"}]}],[{"Personal":{"ID":"002","Name":"Frank","Type":"02","TypeName":"Food"},"Order":[{"ID":"0008","Date":"20200228","ProductID":"D0288","ProductName":"Salmon", "Price":"24"}]}]]

#using explode
df = df.explode('Detail').reset_index()
df['Detail']=df['Detail'].apply(lambda x: [x])
print('using explode:', df)

#retrieved from @Shubham Sharma's answer:
personal = df['Detail'].str[0].str.get('Personal').apply(pd.Series).add_prefix('Personal_')

order = df['Detail'].str[0].str.get('Order').str[0].apply(pd.Series).add_prefix('Personal_Order_')

result = pd.concat([df[['ID', "Name"]], personal, order], axis=1)

#reset ID
result['ID']=[i+1 for i in range(len(result.index))]
print(result)

Output:

#Using explode:
    index  ID   Name                                                                                               Detail
0      0   1   Sara  [{'Personal': {'ID': '001', 'Name': 'Sara', 'Type': '01', 'TypeName': 'Book'}, 'Order': [{'ID': '0001', 'Date': '20200222', 'ProductID': 'C0123', 'ProductName': 'ABC', 'Price': '4'}]}]
1      0   1   Sara  [{'Personal': {'ID': '001', 'Name': 'Sara', 'Type': '02', 'TypeName': 'Food'}, 'Order': [{'ID': '0004', 'Date': '20200222', 'ProductID': 'D0123', 'ProductName': 'Small beef', 'Price': '15'}]}]
2      0   1   Sara  [{'Personal': {'ID': '001', 'Name': 'Sara', 'Type': '02', 'TypeName': 'Food'}, 'Order': [{'ID': '0005', 'Date': '20200222', 'ProductID': 'D0200', 'ProductName': 'Shrimp', 'Price': '28'}]}]
3      1   2  Frank  [{'Personal': {'ID': '002', 'Name': 'Frank', 'Type': '02', 'TypeName': 'Food'}, 'Order': [{'ID': '0008', 'Date': '20200228', 'ProductID': 'D0288', 'ProductName': 'Salmon', 'Price': '24'}]}]




#result:
   ID Name Personal_ID Personal_Name Personal_Type Personal_TypeName Personal_Order_ID Personal_Order_Date Personal_Order_ProductID Personal_Order_ProductName Personal_Order_Price    
0   1  Sara 001         Sara          01            Book              0001              20200222            C0123                    ABC                          4    
1   2  Sara 001         Sara          02            Food              0004              20200222            D0123                    Small beef                   15
2   3  Sara 001         Sara          02            Food              0005              20200222            D0200                    Shrimp                       28
3   4  Frank 002        Frank         02            Food              0008              20200228            D0288                    Salmon                       24
查看更多
神经病院院长
3楼-- · 2020-07-20 05:22

First you need to create a function that processes the list of dicts in each row of Detail column. Briefly, pandas can process a list of dicts as a dataframe. So all I am doing here is processing the list of dicts in each row of Personal and Detail column, to get mapped dataframes which can be merged for each entry. This function when applied :

def processdicts(x):
    personal=pd.DataFrame.from_dict(list(pd.DataFrame.from_dict(x)['Personal']))
    personal=personal.rename(columns={"ID": "Personal_ID"})
    personal['Personal_Name']=personal['Name']
    orders=pd.DataFrame(list(pd.DataFrame.from_dict(list(pd.DataFrame.from_dict(x)['Order']))[0]))
    orders=orders.rename(columns={"ID": "Order_ID"})

    personDf=orders.merge(personal, left_index=True, right_index=True)
    return personDf

Create an empty dataframe that will contain the compiled data

    outcome=pd.DataFrame(columns=[],index=[])

Now process the data for each row of the DataFrame using the function we created above. Using a simple for loop here to show the process. 'apply' function can also be called for greater efficiency but with slight modification of the concat process. With an empty dataframe at hand where we will concat the data from each row, for loop is as simple as 2 lines below:

for details in yourdataframe['Detail']:
    outcome=pd.concat([outcome,processdicts(details)])

Finally reset index:

outcome=outcome.reset_index(drop=True)

You may rename columns according to your requirement in the final dataframe. For example:

outcome=outcome.rename(columns={"TypeName": "Personal_TypeName","ProductName":"Personal_Order_ProductName","ProductID":"Personal_Order_ProductID","Price":"Personal_Order_Price","Date":"Personal_Order_Date","Order_ID":"Personal_Order_ID","Type":"Personal_Type"})

Order (or skip) the columns according to your requirement using:

outcome=outcome[['Name','Personal_ID','Personal_Name','Personal_Type','Personal_TypeName','Personal_Order_ID','Personal_Order_Date','Personal_Order_ProductID','Personal_Order_ProductName','Personal_Order_Price']]

Assign a name to the index of the dataframe:

outcome.index.name='ID'

This should help.

查看更多
男人必须洒脱
4楼-- · 2020-07-20 05:25

So basically you have a nested JSON in your detail column that you need to break out into a df then merge with your original.

import pandas as pd
import json
from pandas import json_normalize

#create empty df to hold the detail information
detailDf = pd.DataFrame()
#We will need to loop over each row to read each JSON
for ind, row in df.iterrows():
    #Read the json, make it a DF, then append the information to the empty DF
    detailDf = detailDf.append(json_normalize(json.loads(row['Detail']), record_path = ('Order'), meta = [['Personal','ID'], ['Personal','Name'], ['Personal','Type'],['Personal','TypeName']]))

# Personally, you don't really need the rest of the code, as the columns Personal.Name
# and Personal.ID is the same information, but none the less.

# You will have to merge on name and ID
df = df.merge(detailDf, how = 'right', left_on = [df['Name'], df['ID']], right_on = [detailDf['Personal.Name'], detailDf['Personal.ID'].astype(int)])

#Clean up
df.rename(columns = {'ID_x':'ID', 'ID_y':'Personal_Order_ID'}, inplace = True)
df.drop(columns = {'Detail', 'key_1', 'key_0'}, inplace = True)

If you look through my comments, I recommend using detailDf as your final df as the merge really isnt necessary and that information is already in the Detail JSON.

查看更多
登录 后发表回答