读取内存映射bzip2压缩文件(Reading memory mapped bzip2 compre

所以我玩维基百科转储文件。这已bzip2压缩的XML文件。我都可以将文件写入目录，但后来当我想要做的分析，我不得不重读磁盘上的所有文件。这给我随机访问，但它的速度慢。我有RAM把整个bzip2压缩文件到RAM中。

我可以加载转储文件就好了，并宣读所有行，但我不能在那里寻找，因为它是巨大的。从它似乎什么时，BZ2图书馆阅读和捕捉偏移，才能发挥我在那里（和解压这一切，作为偏移是在解压缩字节）。

无论如何，我想的mmap转储文件（〜9.5演出），并将其加载到的bZIP。我当然希望之前测试此上的bZIP文件。

我要地图的MMAP文件到BZ2File这样我就可以寻求通过它（得到一个特定的，非压缩字节偏移量），但它似乎什么，这是不可能没有解压整个MMAP文件（这将是超过30千兆字节）。

难道我有什么选择？

下面是一些代码，我写测试。

import bz2
import mmap

lines = '''This is my first line
This is the second
And the third
'''

with open("bz2TestFile", "wb") as f:
    f.write(bz2.compress(lines))

with open("bz2TestFile", "rb") as f:
    mapped = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)

    print "Part of MMAPPED"
    # This does not work until I hit a minimum length
    # due to (I believe) the checksums in the bz2 algorithm
    #
    for x in range(len(mapped)+2):
        line = mapped[0:x]
        try:
            print x
            print bz2.decompress(line)
        except:
            pass

# I can decompress the entire mmapped file
print ":entire mmap file:"
print bz2.decompress(mapped)

# I can create a bz2File object from the file path
# Is there a way to map the mmap object to this function?
print ":BZ2 File readline:"
bzF = bz2.BZ2File("bz2TestFile")

# Seek to specific offset
bzF.seek(22)
# Read the data
print bzF.readline()

这一切都使我怀疑，虽然，有什么特别之处BZ2文件对象，允许其寻求后读取一条线吗？它有阅读之前的每行来从算法校验才能正常工作了呢？

我找到了答案！詹姆斯·泰勒写了几个剧本在BZ2文件中搜索，他的脚本是biopython模块中。

https://bitbucket.org/james_taylor/bx-python/overview

这些工作得很好，虽然他们不允许寻求在BZ2文件任意字节偏移，他的剧本读出BZ2数据块，并允许求基于块。

特别是，见BX-蟒蛇/维基/ IO / SeekingInBzip2Files