I have written some code that replaces values in a DataFrame with values from another frame using a dictionary, and it is working, but i am using this on some large files, where the dictionary can get very long. A few thousand pairs. When I then uses this code it runs very slow, and it have also been going out of memory on a few ocations.
I am somewhat convinced that my method of doing this is far from optimal, and that there must be some faster ways to do this. I have created a simple example that does what I want, but that is slow for large amounts of data. Hope someone have a simpler way to do this.
import pandas as pd
#Frame with data where I want to replace the 'id' with the name from df2
df1 = pd.DataFrame({'id' : [1, 2, 3, 4, 5, 3, 5, 9], 'values' : [12, 32, 42, 51, 23, 14, 111, 134]})
#Frame containing names linked to ids
df2 = pd.DataFrame({'id' : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'name' : ['id1', 'id2', 'id3', 'id4', 'id5', 'id6', 'id7', 'id8', 'id9', 'id10']})
#My current "slow" way of doing this.
#Starts by creating a dictionary from df2
#Need to create dictionaries from the domain and banners tables to link ids
df2_dict = dict(zip(df2['id'], df2['name']))
#and then uses the dict to replace the ids with name in df1
df1.replace({'id' : df2_dict}, inplace=True)
I think you can use
map
withSeries
convertedto_dict
- getNaN
if not exist value indf2
:Or
replace
, if dont exist value indf2
let original values fromdf1
:Sample: