Say I have N generators that produce a stream of items gs = [..] # list of generators
.
I can easily zip
them together to get a generator of tuples from each respective generator in gs
: tuple_gen = zip(*gs)
.
This calls next(g)
on each g
in sequence in gs
and gathers the results in a tuple. But if each item is costly to produce we may want to parallelize the work of next(g)
on multiple threads.
How can I implement a pzip(..)
that does this?
What you asked for can be achieved by creating a generator which yields the results from
apply_async
-calls on a ThreadPool.FYI, I benchmarked this approach with
pandas.read_csv
-iterators you get with specifying thechunksize
parameter. I created eight copies of a 1M rows sized csv-file and specified chunksize=100_000.Four of the files were read with the sequential method you provided, four with the
mt_gen
function below, using a pool of four threads:Doesn't mean it will improve results for every hardware and data-setup, though.
Output: