I am new to python. I am having trouble reading the contents of a tarfile into python.
The data are the contents of a journal article (hosted at pubmed central). See info below. And link to tarfile which I want to read into Python.
http://www.pubmedcentral.nih.gov/utils/oa/oa.fcgi?id=PMC13901 ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz
I have a list of similar .tar.gz file I will eventually want to read in as well. I think (know) all of the tarfiles have a .nxml file associated with them. It is the content of the .nxml files I am actually interested in extracting/reading. Open to any suggestions on the best way to do this...
Here is what I have if I save the tarfile to my PC. All runs as expected.
tarfile_name = "F:/PMC_OA_TextMining/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
tfile = tarfile.open(tarfile_name)
tfile_members = tfile.getmembers()
tfile_members1 = []
for i in range(len(tfile_members)):
tfile_members_name = tfile_members[i].name
tfile_members1.append(tfile_members_name)
tfile_members2 = []
for i in range(len(tfile_members1)):
if tfile_members1[i].endswith('.nxml'):
tfile_members2.append(tfile_members1[i])
tfile_extract1 = tfile.extractfile(tfile_members2[0])
tfile_extract1_text = tfile_extract1.read()
I learned today that to in order to access the tarfile directly from the pubmed centrals FTP site I have to set up a network request using urllib
. Below is the revised code (and link to stackoverflow answer I received):
Read contents of .tar.gz file from website into a python 3.x object
tarfile_name = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
ftpstream = urllib.request.urlopen(tarfile_name)
tfile = tarfile.open(fileobj=ftpstream, mode="r|gz")
However, when I run the remaining piece of the code (below) I get an error message ("seeking backwards is not allowed"). How come?
tfile_members = tfile.getmembers()
tfile_members1 = []
for i in range(len(tfile_members)):
tfile_members_name = tfile_members[i].name
tfile_members1.append(tfile_members_name)
tfile_members2 = []
for i in range(len(tfile_members1)):
if tfile_members1[i].endswith('.nxml'):
tfile_members2.append(tfile_members1[i])
tfile_extract1 = tfile.extractfile(tfile_members2[0])
tfile_extract1_text = tfile_extract1.read()
The code fails on the last line, where I try to read the .nxml content associated with my tarfile. Below is the actual error message I receive. What does it mean? What is my best workaround for reading/accessing the content of these .nxml files which are all embedded in tarfiles?
Traceback (most recent call last):
File "F:\PMC_OA_TextMining\test2.py", line 135, in <module>
tfile_extract1_text = tfile_extract1.read()
File "C:\Python30\lib\tarfile.py", line 804, in read
buf += self.fileobj.read()
File "C:\Python30\lib\tarfile.py", line 715, in read
return self.readnormal(size)
File "C:\Python30\lib\tarfile.py", line 722, in readnormal
self.fileobj.seek(self.offset + self.position)
File "C:\Python30\lib\tarfile.py", line 531, in seek
raise StreamError("seeking backwards is not allowed")
tarfile.StreamError: seeking backwards is not allowed
Thanks in advance for your help. Chris