读出的内容tar文件成Python - “逆向查找是不允许的”(Read Contents Tar

我是新来的蟒蛇。我无法读取tar文件的内容到蟒蛇。

该数据是期刊文章（在PubMed中心主办）的内容。请参阅下面的信息。并链接到tar文件，我想读成Python。

http://www.pubmedcentral.nih.gov/utils/oa/oa.fcgi?id=PMC13901 ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61 -65.tar.gz

我也有类似的.tar.gz文件的列表，我最终还是会希望阅读为好。我认为，（知道）所有tarfiles有与之相关的.nxml文件。这是.nxml文件我在解压/阅读真正感兴趣的内容。开放的最佳方式有任何建议做到这一点...

这里是我有什么，如果我的tar文件保存到我的电脑。所有运行正常。

tarfile_name = "F:/PMC_OA_TextMining/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
tfile = tarfile.open(tarfile_name)

tfile_members = tfile.getmembers()

tfile_members1 = []
for i in range(len(tfile_members)):
tfile_members_name = tfile_members[i].name
tfile_members1.append(tfile_members_name)

tfile_members2 = []
for i in range(len(tfile_members1)):
if tfile_members1[i].endswith('.nxml'):
    tfile_members2.append(tfile_members1[i])

tfile_extract1 = tfile.extractfile(tfile_members2[0])
tfile_extract1_text = tfile_extract1.read()

我今天才知道，以便直接从考研中切牙FTP站点访问tar文件我用建立一个网络请求urllib 。下面是修改后的代码（并链接到StackOverflow的答案，我收到）：

阅读从网站.tar.gz文件的内容转换为Python 3.x的对象

tarfile_name = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
ftpstream = urllib.request.urlopen(tarfile_name)
tfile = tarfile.open(fileobj=ftpstream, mode="r|gz")

然而，当我运行代码的剩余部分（下）我得到一个错误信息（“逆向查找是不允许”）。怎么会？

tfile_members = tfile.getmembers()

tfile_members1 = []
for i in range(len(tfile_members)):
tfile_members_name = tfile_members[i].name
tfile_members1.append(tfile_members_name)

tfile_members2 = []
for i in range(len(tfile_members1)):
if tfile_members1[i].endswith('.nxml'):
    tfile_members2.append(tfile_members1[i])

tfile_extract1 = tfile.extractfile(tfile_members2[0])
tfile_extract1_text = tfile_extract1.read()

该代码失败，在最后一行，在那里我尝试阅读我的tar文件相关的.nxml内容。下面是我收到实际的错误信息。这是什么意思？什么是读取/访问它们都嵌在tarfiles这些.nxml文件的内容我最好的解决方法吗？

Traceback (most recent call last):
File "F:\PMC_OA_TextMining\test2.py", line 135, in <module>
tfile_extract1_text = tfile_extract1.read()
File "C:\Python30\lib\tarfile.py", line 804, in read
buf += self.fileobj.read()
File "C:\Python30\lib\tarfile.py", line 715, in read
return self.readnormal(size)
File "C:\Python30\lib\tarfile.py", line 722, in readnormal
self.fileobj.seek(self.offset + self.position)
File "C:\Python30\lib\tarfile.py", line 531, in seek
raise StreamError("seeking backwards is not allowed")
tarfile.StreamError: seeking backwards is not allowed

在此先感谢您的帮助。克里斯

Answer 1:

这是怎么回事错误：tar文件被存储交错。他们进来的订单表头，数据头，数据，报头，数据等当你列举与文件getmembers()你已经在整个文件中读取拿到头。然后当你问的tar文件对象读取数据时，它试图从最后一个头到第一条数据反向跳转。但你不能没有关闭并重新打开urllib的请求，寻求在网络流落后。

如何解决它：你需要下载的文件，临时副本保存到磁盘或StringIO的，列举在此临时副本的文件，然后提取所需的文件。

#!/usr/bin/env python3
from io import BytesIO
import urllib.request
import tarfile

tarfile_url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
ftpstream = urllib.request.urlopen(tarfile_url)

# BytesIO creates an in-memory temporary file.
# See the Python manual: http://docs.python.org/3/library/io.html
tmpfile = BytesIO()
while True:
    # Download a piece of the file from the connection
    s = ftpstream.read(16384)

    # Once the entire file has been downloaded, tarfile returns b''
    # (the empty bytes) which is a falsey value
    if not s:  
        break

    # Otherwise, write the piece of the file to the temporary file.
    tmpfile.write(s)
ftpstream.close()

# Now that the FTP stream has been downloaded to the temporary file,
# we can ditch the FTP stream and have the tarfile module work with
# the temporary file.  Begin by seeking back to the beginning of the
# temporary file.
tmpfile.seek(0)

# Now tell the tarfile module that you're using a file object
# that supports seeking backward.
# r|gz forbids seeking backward; r:gz allows seeking backward
tfile = tarfile.open(fileobj=tmpfile, mode="r:gz")

# You want to limit it to the .nxml files
tfile_members2 = [filename
                  for filename in tfile.getnames()
                  if filename.endswith('.nxml')]

tfile_extract1 = tfile.extractfile(tfile_members2[0])
tfile_extract1_text = tfile_extract1.read()

# And when you're done extracting members:
tfile.close()
tmpfile.close()

Answer 2:

我想，当同样的错误requests.get文件，所以我出的所有的tmp目录，而不是使用BytesIO ，或extractfile(member) ：

# stream == requests.get
inputs = [tarfile.open(fileobj=LZMAFile(stream), mode='r|')]
t = "/tmp"
for tarfileobj in inputs:        
    tarfileobj.extractall(path=t, members=None)
for fn in os.listdir(t):
    with open(os.path.join(t, fn)) as payload:
        print(payload.read())

文章来源: Read Contents Tarfile into Python - “seeking backwards is not allowed”