I have a big list of remote file locations and local paths where I would like them to end up. Each file is small, but there are very many of them. I am generating this list within Python.
I would like to download all of these files as quickly as possible (in parallel) prior to unpacking and processing them. What is the best library or linux command-line utility for me to use? I attempted to implement this using multiprocessing.pool, but that did not work with the FTP library.
I looked into pycurl, and that seemed to be what I wanted, but I could not get it to run on Windows 7 x64.
I normally use
pscp
to do things like this, and then call it usingsubprocess.Popen
for example:
of course I'm assuming linux --> windows
Try wget, a command line utility installed on most Linux distros, also available via Cygwin on Windows.
You may also have a look at Scrapy, which is a library/framework written in Python.
If youuse a
Pool
object from themultiprocessing
module,urllib2
should handle FTP.Of course, spawning processes will have some serious overhead. Non-blocking requests will almost certainly be faster if you can use a 3rd part module like twisted
Whether the overhead is a serious problem will depend on the relative magnitude of download times per file and network latency.
You can try implementing it using python threads rather than processes, but it gets a bit trickier. See the answer to this question to use urllib2 safely with threads. You would also need to use the
multiprocessing.pool.ThreadPool
instead of the regularPool
Know it's an old post but there is a perfect linux utility for this. If you are transferring files from a remote host,
lftp
is great! I mainly use it to quickly push stuff to my ftp server but it works great for pulling stuff off as well using themirror
command. It also has an option to copy a user defined number of files in parallel like you wanted. If you wanted to copy some files from a remote path to a local path your command line would look something like this;Be very careful with this command though, just like other mirror commands if you screw it up, you WILL DELETE FILES.
For more options or documentation for
lftp
I've visited this site http://lftp.yar.ru/lftp-man.html