I have a large dataframe (several million rows).
I want to be able to do a groupby operation on it, but just grouping by arbitrary consecutive (preferably equal-sized) subsets of rows, rather than using any particular property of the individual rows to decide which group they go to.
The use case: I want to apply a function to each row via a parallel map in IPython. It doesn't matter which rows go to which back-end engine, as the function calculates a result based on one row at a time. (Conceptually at least; in reality it's vectorized.)
I've come up with something like this:
# Generate a number from 0-9 for each row, indicating which tenth of the DF it belongs to
max_idx = dataframe.index.max()
tenths = ((10 * dataframe.index) / (1 + max_idx)).astype(np.uint32)
# Use this value to perform a groupby, yielding 10 consecutive chunks
groups = [g[1] for g in dataframe.groupby(tenths)]
# Process chunks in parallel
results = dview.map_sync(my_function, groups)
But this seems very long-winded, and doesn't guarantee equal sized chunks. Especially if the index is sparse or non-integer or whatever.
Any suggestions for a better way?
Thanks!
A sign of a good environment is many choices, so I'll add this from Anaconda Blaze, really using Odo
Use numpy has this built in: np.array_split()
I'm not sure if this is exactly what you want, but I found these grouper functions on another SO thread fairly useful for doing a multiprocessor pool.
Here's a short example from that thread, which might do something like what you want:
Which gives you something like this:
I hope that helps.
EDIT
In this case, I used this function with pool of processors in (approximately) this manner:
I assume this should be very similar to using the IPython distributed machinery, but I haven't tried it.
In practice, you can't guarantee equal-sized chunks: the number of rows might be prime, after all, in which case your only chunking options would be chunks of size 1 or one big chunk. I tend to pass an array to
groupby
. Starting from:where I've deliberately made the index uninformative by setting it to 0, we simply decide on our size (here 10) and integer-divide an array by it:
Methods based on slicing the DataFrame can fail when the index isn't compatible with that, although you can always use
.iloc[a:b]
to ignore the index values and access data by position.