I am new to python. I can't figure out what I am doing wrong when trying to read the contents of .tar.gz file into python. The tarfile I would like to read is hosted at the following web address:
ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz
more info on file at this site (just so you can trust contents) http://www.pubmedcentral.nih.gov/utils/oa/oa.fcgi?id=PMC13901
The tarfile contains .pdf and .nxml copies of the journal article. And also a couple of image files.
If I open the file in my browser by copying and pasting. I can save to a location on my PC and import the tarfile fine using the following commands (note: winzip changes the file from .tar.gz to simply .tar when I save to location):
import tarfile
thetarfile = "C:/Users/dfcm/Documents/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar"
tfile = tarfile.open(thetarfile)
tfile
However, if I try to access the file directly using similar commands:
thetarfile = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
bbb = tarfile.open(thetarfile)
That results in the following error:
Traceback (most recent call last):
File "<pyshell#137>", line 1, in <module>
bbb = tarfile.open(thetarfile)
File "C:\Python30\lib\tarfile.py", line 1625, in open
return func(name, "r", fileobj, **kwargs)
File "C:\Python30\lib\tarfile.py", line 1687, in gzopen
fileobj = bltn_open(name, mode + "b")
File "C:\Python30\lib\io.py", line 278, in __new__
return open(*args, **kwargs)
File "C:\Python30\lib\io.py", line 222, in open
closefd)
File "C:\Python30\lib\io.py", line 615, in __init__
_fileio._FileIO.__init__(self, name, mode, closefd)
IOError: [Errno 22] Invalid argument: 'ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar'
Can anyone explain what I am doing wrong when trying to read the .tar.gz file directly from the web address? Thanks in advance. Chris
Unfortunately you cannot just open files from the network. Things are a bit more complex here. You have to instruct the interpreter to create a network request and create an object representing the request state. This can be done using the
urllib
module.The
ftpstream
object is a file-like that represents the connection to the ftp server. Then the tarfile module can access this stream. Since we do not pass the filename, we have to specify the compression in themode
parameter.