I want to create a number of instances of a class based on values in a pandas.DataFrame
. This I've got down.
import itertools
import multiprocessing as mp
import pandas as pd
class Toy:
id_iter = itertools.count(1)
def __init__(self, row):
self.id = self.id_iter.next()
self.type = row['type']
if __name__ == "__main__":
table = pd.DataFrame({
'type': ['a', 'b', 'c'],
'number': [5000, 4000, 30000]
})
for index, row in table.iterrows():
[Toy(row) for _ in range(row['number'])]
Multiprocessing Attempts
I've been able to parallelize this (sort of) by adding the following:
pool = mp.Pool(processes=mp.cpu_count())
m = mp.Manager()
q = m.Queue()
for index, row in table.iterrows():
pool.apply_async([Toy(row) for _ in range(row['number'])])
It seems that this would be faster if the numbers in row['number']
are substantially longer than the length of table
. But in my actual case, table
is thousands of lines long, and each row['number']
is relatively small.
It seems smarter to try and break up table
into cpu_count()
chunks and iterate within the table. But now we're at the edge of my python skills.
I've tried things that the python interpreter screams at me for, like:
pool.apply_async(
for index, row in table.iterrows():
[Toy(row) for _ in range(row['number'])]
)
Also things that "can't be pickled"
Parallel(n_jobs=4)(
delayed(Toy)([row for _ in range(row['number'])]) \
for index, row in table.iterrows()
)
Edit
This may gotten me a little bit closer, but still not there. I create the class instances in a separate function,
def create_toys(row):
[Toy(row) for _ in range(row['number'])]
....
Parallel(n_jobs=4, backend="threading")(
(create_toys)(row) for i, row in table.iterrows()
)
but I'm told 'NoneType' object is not iterable.