Memory leak in Pandas.groupby.apply()?

I'm currently using Pandas for a project with csv source files of around 600mb. During the analysis I am reading in the csv to a dataframe, grouping on some column and applying a simple function to the grouped dataframe. I noticed that I was going into Swap Memory during this process and so carried out a basic test:

I first created a fairly large dataframe in the shell:

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3000000, 3),index=range(3000000),columns=['a', 'b', 'c'])

I defined a pointless function called do_nothing():

def do_nothing(group):
    return group

And ran the following command:

df = df.groupby('a').apply(do_nothing)

My system has 16gb of RAM and is running Debian (Mint). After creating the dataframe I was using ~600mb of RAM. As soon as the apply method began to execute, that value started to soar. It steadily climbed up to around 7gb(!) before finishing the command and settling back down to 5.4gb (while the shell was still active). The problem is, my work requires doing more than the 'do_nothing' method and as such while executing the real program, I cap my 16gb of RAM and start swapping, making the program unusable. Is this intended? I can't see why Pandas should need 7gb of RAM to effectively 'do_nothing', even if it has to store the grouped object.

Any ideas on what's causing this/how to fix it?

Cheers,

标签： python memory-leaks pandas

1条回答

够拽才男人

2楼-- · 2020-06-04 03:09

Using 0.14.1, I don't think their is a memory leak (1/3 size of your frame).

In [79]: df = DataFrame(np.random.randn(100000,3))

In [77]: %memit -r 3 df.groupby(df.index).apply(lambda x: x)
maximum of 3: 1365.652344 MB per loop

In [78]: %memit -r 10 df.groupby(df.index).apply(lambda x: x)
maximum of 10: 1365.683594 MB per loop

Two general comments on how to approach a problem like this:

1) use the cython level function if at all possible, will be MUCH faster, and will use much less memory. IOW, it almost always worth it to decouple a groupby expression and void using function (if possible, somethings are just too complicated, but that's the point, you want to break things down). e.g.

Instead of:

df.groupby(...).apply(lambda x: x.sum() / x.mean())

It is MUCH better to do:

g = df.groupby(...)
g.sum() / g.mean()

2) You can easily 'control' the groupby by doing your aggregation manually (additionally this will allow periodic output and garbage collection if needed).

results = []
for i, (g, grp) in enumerate(df.groupby(....)):

    if i % 500 == 0:
        print "checkpoint: %s" % i
        gc.collect()


    results.append(func(g,grp))

# final result
pd.concate(results)

0人赞添加讨论(0) 举报

Memory leak in Pandas.groupby.apply()?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间