We are running an script ussing the multiprocessing
library (python 3.6
), where big pd.DataFrames
are passed through a process:
from multiprocessing import Pool
import time
def f(x):
# do something time conssuming
time.sleep(50)
if __name__ == '__main__':
with Pool(10) as p:
res = {}
output = {}
for id, big_df in some_dict_of_big_dfs:
res[id] = p.apply_async(f,(big_df ,))
output = {u : res[id].get() for id in id_list}
The problem is that we are getting an error from the pickle
library.
Reason: 'OverflowError('cannot serialize a bytes objects larger than 4GiB',)'
We are aware than pickle v4
can serialize larger objects question related, link, but we don't know how to modify the protocol that multiprocessing
is using.
does anybody know what to do? Thanks !!
Apparently is there an open issue about this topic (Issue), and there is a few related initiatives described on this particular answer (link). I Found a way to change the default
pickle
protocol that is used in themultiprocessing
library based on this answer (link). As was pointed out in the comments this solution Only works with Linux and OS multiprocessing libYou first create a new separated module
And then, in your main script you need to add the following:
That will probably solve the problem of the overflow... But, warning, you might consider reading this before doing anything or you might reach same error as me:
(the reason of this error is well explained in the link above). In short,
multiprocessing
send data through all its process using thepickle
protocol, if you are already reaching the4gb
limit, that probably means that you might consider redefining your functions more as "void" methods rather than input/output methods. All this inbound/outbound data increase the RAM usage, is probably inefficient by construction (my case) and it might be better to point all process to the same object rather than create a new copy for each call.hope this helps.