I am trying to merge >=2
files with the same schema.
The files will contain duplicate entries but rows won't be identical, for example:
file1:
store_id,address,phone
9191,9827 Park st,999999999
8181,543 Hello st,1111111111
file2:
store_id,address,phone
9191,9827 Park st Apt82,999999999
7171,912 John st,87282728282
Expected output:
9191,9827 Park st Apt82,999999999
8181,543 Hello st,1111111111
7171,912 John st,87282728282
If you noticed :
9191,9827 Park st,999999999 and 9191,9827 Park st Apt82,999999999
are similar based on store_id and phone but I picked it up from file2 since the address was more descriptive.
store_id+phone_number
was my composite primary key to lookup a location and find duplicates (store_id is enough to find it in the above example but I need a key based on multiple column values)
Question:
- I need to merge multiple CSV files with same schema but with duplicate rows.
- Where the row level merge should have the logic to pick a specific value of a row based on its value. Like phone picked up from file1 and address pickedup from file2.
- A combination of 1 or many column values will define if rows are duplicate or not.
Can this be achieved using pandas?
One way to smash them together is to use merge (on store_id and number, if these are the index then this would be a join rather than a merge):
In [11]: res = df1.merge(df2, on=['store_id', 'phone'], how='outer')
In [12]: res
Out[12]:
store_id address_x phone address_y
0 9191 9827 Park st 999999999 9827 Park st Apt82
1 8181 543 Hello st 1111111111 NaN
2 7171 NaN 87282728282 912 John st
You can then use where
to select address_y
if it exists, otherwise address_x
:
In [13]: res['address'] = res.address_y.where(res.address_y, res.address_x)
In [14]: del res['address_x'], res['address_y']
In [15]: res
Out[15]:
store_id phone address
0 9191 999999999 9827 Park st Apt82
1 8181 1111111111 543 Hello st
2 7171 87282728282 912 John st
How about use concat
, groupby
, agg
, then you can write a agg function to choose the right value:
import pandas as pd
import io
t1 = """store_id,address,phone
9191,9827 Park st,999999999
8181,543 Hello st,1111111111"""
t2 = """store_id,address,phone
9191,9827 Park st Apt82,999999999
7171,912 John st,87282728282"""
df1 = pd.read_csv(io.BytesIO(t1))
df2 = pd.read_csv(io.BytesIO(t2))
df = pd.concat([df1, df2]).reset_index(drop=True)
def f(s):
loc = s.str.len().idxmax()
return s[loc]
df.groupby(["store_id", "phone"]).agg(f)