I have a lot of zip archives in a remote FTP server and their sizes go up to 20TB. I just need the file names inside those zip archives, so that I can plug them into my Python scripts.
Is there any way to just get the file names without actually downloading files and extracting them on my local machine? If so, can someone direct me to the right library/package?
You can implement a file-like object that reads data from FTP, instead of a local file. And pass that to
ZipFile
constructor, instead of a (local) file name.A trivial implementation can be like:
And then you can use it like:
The above implementation is rather trivial and inefficient. It starts numerous (three at minimum) downloads of small chunks of data to retrieve a list of contained files. It can be optimized by reading and caching larger chunks. But it should give your the idea.
Particularly you can make use of the fact that you are going to read a listing only. The listing is located at the and of a ZIP archive. So you can just download last (about) 10 KB worth of data at the start. And you will be able to fulfill all
read
calls out of that cache.Knowing that, you can actually do a small hack. As the listing is at the end of the archive, you can actually download the end of the archive only. While the downloaded ZIP will be broken, it still can be listed. This way, you won't need the
FtpFile
class. You can even download the listing to memory (StringIO
).If you get
BadZipfile
exception because the 10 KB is too small to contain whole listing, you can retry the code with a larger chunk.