Memory leak in Pandas.groupby.apply()?

2020-06-04 02:23发布

问题:

I'm currently using Pandas for a project with csv source files of around 600mb. During the analysis I am reading in the csv to a dataframe, grouping on some column and applying a simple function to the grouped dataframe. I noticed that I was going into Swap Memory during this process and so carried out a basic test:

I first created a fairly large dataframe in the shell:

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3000000, 3),index=range(3000000),columns=['a', 'b', 'c'])

I defined a pointless function called do_nothing():

def do_nothing(group):
    return group

And ran the following command:

df = df.groupby('a').apply(do_nothing)

My system has 16gb of RAM and is running Debian (Mint). After creating the dataframe I was using ~600mb of RAM. As soon as the apply method began to execute, that value started to soar. It steadily climbed up to around 7gb(!) before finishing the command and settling back down to 5.4gb (while the shell was still active). The problem is, my work requires doing more than the 'do_nothing' method and as such while executing the real program, I cap my 16gb of RAM and start swapping, making the program unusable. Is this intended? I can't see why Pandas should need 7gb of RAM to effectively 'do_nothing', even if it has to store the grouped object.

Any ideas on what's causing this/how to fix it?

Cheers,

.P

回答1:

Using 0.14.1, I don't think their is a memory leak (1/3 size of your frame).

In [79]: df = DataFrame(np.random.randn(100000,3))

In [77]: %memit -r 3 df.groupby(df.index).apply(lambda x: x)
maximum of 3: 1365.652344 MB per loop

In [78]: %memit -r 10 df.groupby(df.index).apply(lambda x: x)
maximum of 10: 1365.683594 MB per loop

Two general comments on how to approach a problem like this:

1) use the cython level function if at all possible, will be MUCH faster, and will use much less memory. IOW, it almost always worth it to decouple a groupby expression and void using function (if possible, somethings are just too complicated, but that's the point, you want to break things down). e.g.

Instead of:

df.groupby(...).apply(lambda x: x.sum() / x.mean())

It is MUCH better to do:

g = df.groupby(...)
g.sum() / g.mean()

2) You can easily 'control' the groupby by doing your aggregation manually (additionally this will allow periodic output and garbage collection if needed).

results = []
for i, (g, grp) in enumerate(df.groupby(....)):

    if i % 500 == 0:
        print "checkpoint: %s" % i
        gc.collect()


    results.append(func(g,grp))

# final result
pd.concate(results)