I have a program which copies large numbers of files from one location to another - I'm talking 100,000+ files (I'm copying 314g in image sequences at this moment). They're both on huge, VERY fast network storage RAID'd in the extreme. I'm using shutil to copy the files over sequentially and it is taking some time, so I'm trying to find the best way to opimize this. I've noticed some software I use effectively multi-threads reading files off of the network with huge gains in load times so I'd like to try doing this in python.
I have no experience with programming multithreading/multiprocessesing - does this seem like the right area to proceed? If so what's the best way to do this? I've looked around a few other SO posts regarding threading file copying in python and they all seemed to say that you get no speed gain, but I do not think this will be the case considering my hardware. I'm nowhere near my IO cap at the moment and resources are sitting around 1% (I have 40 cores and 64g of RAM locally).
- Spencer
UPDATE:
I never did get Gevent working (first answer) because I couldn't install the module without an internet connection, which I don't have on my workstation. However I was able to decrease file copy times by 8 just using the built in threading with python (which I have since learned how to use) and I wanted to post it up as an additional answer for anyone interested! Here's my code below, and it is probably important to note that my 8x copy time will most likely differ from environment to environment due to your hardware/network set-up.
This can be parallelized by using gevent in Python.
I would recommend the following logic to achieve speeding up 100k+ file copying:
Put names of all the 100K+ files, which need to be copied in a csv file, for eg: 'input.csv'.
Then create chunks from that csv file. The number of chunks should be decided based on no.of processors/cores in your machine.
Pass each of those chunks to separate threads.
Each thread sequentially reads filename in that chunk and copies it from one location to another.
Here goes the python code snippet:
Save the file as file_copier.py. Open terminal and run: