I often need to apply a function to the groups of a very large DataFrame
(of mixed data types) and would like to take advantage of multiple cores.
I can create an iterator from the groups and use the multiprocessing module, but it is not efficient because every group and the results of the function must be pickled for messaging between processes.
Is there any way to avoid the pickling or even avoid the copying of the DataFrame
completely? It looks like the shared memory functions of the multiprocessing modules are limited to numpy
arrays. Are there any other options?
From the comments above, it seems that this is planned for
pandas
some time (there's also an interesting-lookingrosetta
project which I just noticed).However, until every parallel functionality is incorporated into
pandas
, I noticed that it's very easy to write efficient & non-memory-copying parallel augmentations topandas
directly usingcython
+ OpenMP and C++.Here's a short example of writing a parallel groupby-sum, whose use is something like this:
and output is:
Note Doubtlessly, this simple example's functionality will eventually be part of
pandas
. Some things, however, will be more natural to parallelize in C++ for some time, and it's important to be aware of how easy it is to combine this intopandas
.To do this, I wrote a simple single-source-file extension whose code follows.
It starts with some imports and type definitions
The C++
unordered_map
type is for summing by a single thread, and thevector
is for summing by all threads.Now to the function
sum
. It starts off with typed memory views for fast access:The function continues by dividing the semi-equally to the threads (here hardcoded to 4), and having each thread sum the entries in its range:
When the threads have completed, the function merges all the results (from the different ranges) into a single
unordered_map
:All that's left is to create a
DataFrame
and return the results: