I have a csv file that too big to load to memory.I need to drop duplicated rows of the file.So I follow this way:
chunker = pd.read_table(AUTHORS_PATH, names=['Author ID', 'Author name'], encoding='utf-8', chunksize=10000000)
for chunk in chunker:
chunk.drop_duplicates(['Author ID'])
But if duplicated rows distribute in different chunk seems like above script can't get the expected results.
Is there any better way?
You could try something like this.
First, create your chunker.
chunker = pd.read_table(AUTHORS_PATH, names=['Author ID', 'Author name'], encoding='utf-8', chunksize=10000000)
Now create a set of ids:
ids = set()
Now iterate over the chunks:
for chunk in chunker:
chunk.drop_duplicates(['Author ID'])
However, now, within the body of the loop, drop also ids already in the set of known ids:
chunk = chunk[~chunk['Author ID'].isin(ids)]
Finally, still within the body of the loop, add the new ids
ids.update(chunk['Author ID'].values)
If ids
is too large to fit into main memory, you might need to use some disk-based database.