Is there a memory efficient way to replace a list

2019-07-20 20:25发布

问题:

I am trying to replace all of the unique strings in a large pandas dataframe (1.5 million rows, and about 15 columns) with an integer index. My problem is that my dataframe is 2Gigs and my list of unique strings ends up with around eighty thousand or more entries.

To produce my list of unique strings I use:

unique_string_list = pd.unique(df.values.ravel()).tolist()

Then if I try to use df.replace() either with a pair of lists or with a dictionary the memory overhead is too much for my 8 Gigs of RAM. The problem is in the size of the replacement list so even if I only use a few thousand row chunk of the dataframe it will eat all the RAM:

mapdict = dict(zip(unique_string_list, range(len(unique_string_list))))
replacedict = dict(zip(df.columns.values, [mapdict for column in df.columns.values]))
df.replace(replacedict)

I have tried looping over the string list instead. This reduced the memory overhead but it is very inefficient and takes too long to run (longer than overnight).

Any help here would be very much appreciated.