I have a file with like 10,000 rows, each row represents parameters to a download job. I have like 5 custom downloaders. Each job can take anywhere from 5 seconds to 2 minutes. How would I create something that iterates through the 10,000 rows, assigning each job to a downloader if that downloader isn't currently working?
EDIT:
The difficult part for me is that each Downloader
is an instance of a class, and the differences between the instances are the port_numbers I specify when I instantiate each of the 5 Downloader
objects. So I have a = Downloader(port_number=7751) ... e = Downloader(port_number=7755)
. Then, if I were to use a Downloader
I would do a.run(row)
.
How do I define the workers as these a, b, c, d, e
rather than a downloader function
?
There are many ways to do it - the simplest way would be to just use
multiprocessing.Pool
and let it organize the workers for you - 10k rows is not all that much, let's say that an average URL is even a full kilobyte long it will still take only 10MB of memory and memory is cheap.So, just read the file in memory and map it to
multiprocessing.Pool
to do your bidding:You can also use
threading
instead ofmultiprocessing
(ormultiprocessing.pool.ThreadPool
as a drop-in replacement for this) to do everything within a single process if you need shared memory. A single thread is more than enough for download purposes unless you're doing additional processing.UPDATE
If you want your downloaders to run as class instances, you can transform the
downloader
function into a factory for yourDownloader
instances, and then just pass what you need to instantiate those instances alongside your URLs. Here is a simple Round-Robin approach:Keep in mind that this is not the most balanced solution as it can happen to have two
Downloader
instances with the same port running, but it will average over large enough data.If you want to make sure that you don't have two
Downloader
instances running off of the same port, you'll either need to build your own pool, or you'll need to create a central process that will issue ports to yourDownloader
instances when they need them.Read in your 10000 rows into a list of strings.
Assuming that the data does not include a port number and the edited question mentions 5 ports, you should add that to the data.
Write a function that takes one of those tuples as an argument, splits it, creates a Downloader object and runs it.
Use the
imap_unordered
method of amultiprocessing.Pool
, giving it the function you've defined and the list of tuples as arguments.The iterator returned by
imap_unordered
will start yielding results as soon as they become available. You could print them to show the progress.Edit
P.S: if the only method of your
Downloader
object you'll ever use isrun()
, it should not be an object. It is a function in disguise! Look up the "stop writing classes" video on Youtube and watch it.