I am trying to read a file from an FTP server. The file is a .gz
file. I would like to know if I can perform actions on this file while the socket is open. I tried to follow what was mentioned in two StackOverflow questions on reading files without writing to disk and reading files from FTP without downloading but was not successful.
I know how to extract data/work on the downloaded file but I'm not sure if I can do it on the fly. Is there a way to connect to the site, get data in a buffer, possibly do some data extraction and exit?
When trying StringIO I got the error:
>>> from ftplib import FTP
>>> from StringIO import StringIO
>>> ftp = FTP('ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz')
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
ftp = FTP('ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz')
File "C:\Python27\lib\ftplib.py", line 117, in __init__
self.connect(host)
File "C:\Python27\lib\ftplib.py", line 132, in connect
self.sock = socket.create_connection((self.host, self.port), self.timeout)
File "C:\Python27\lib\socket.py", line 553, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
gaierror: [Errno 11004] getaddrinfo failed
I just need to know how can I get data into some variable and loop on it until the file from FTP is read.
I appreciate your time and help. Thanks!
Make sure to login to the ftp server first. After this, use
retrbinary
which pulls the file in binary mode. It uses a callback on each chunk of the file. You can use this to load it into a string.Bonus points: how about we decompress the string while we're at it?
Easy mode, using data string above
Little bit better, full solution:
In reality, it would be much better to decompress on the fly but I don't see a way to do that with the built in libraries (at least not easily).
That is not possible. To process data on the server, you need to have some sort of execution permissions, be it for a shell script you would send or SQL access.
FTP is pure file transfer, no execution allowed. You will need either to enable SSH access, load the data into a Database and access that with queries or download the file with
urllib
then process it locally, like this:In particular, I think the third one is the only zero-effort solution.
There are two easy ways I can think of to download a file using FTP and store it locally:
Using
ftplib
:Using
urllib
If you don't want to download and store it to a file, but you want to process it gradually as it comes, I suggest using
urllib2
:which prints your file line by line.