Parsing a large .bz2 file (40 GB) with lxm

2019-02-13 11:12发布

问题:

I am trying to parse OpenStreetMap's planet.osm, compressed in bz2 format. Because it is already 41G, I don't want to decompress the file completely.

So I figured out how to parse portions of the planet.osm file using bz2 and lxml, using the following code

from lxml import etree as et
from bz2 import BZ2File

path = "where/my/fileis.osm.bz2"
with BZ2File(path) as xml_file:
    parser = et.iterparse(xml_file, events=('end',))
    for events, elem in parser:

        if elem.tag == "tag":
            continue
        if elem.tag == "node":
            (do something)


    ## Do some cleaning
    # Get rid of that element
    elem.clear()

    # Also eliminate now-empty references from the root node to node        
    while elem.getprevious() is not None:
        del elem.getparent()[0]

which works perfectly with the Geofabrick extracts. However, when I try to parse the planet-latest.osm.bz2 with the same script I get the error:

xml.etree.XMLSyntaxError: Specification mandate value for attribute num_change, line 3684, column 60

Here are the things I tried:

  • Check the planet-latest.osm.bz2 md5sum
  • Check the planet-latest.osm where the script with bz2 stops. There is no apparent error, and the attribute is called "num_changes", not "num_change" as indicated in the error
  • Also I did something stupid, but the error puzzled me: I opened the planet-latest.osm.bz2 in mode 'rb' [c = BZ2File('file.osm.bz2', 'rb')] and then passed c.read() to iterparse(), which returned me an error saying (very long string) cannot be opened. Strange thing, (very long string) ends right where the "Specification mandate value" error refers to...

Then I tried to decompress first the planet.osm.gz2 usin a simple

bzcat planet.osm.gz2 > planet.osm

And ran the parser directly on planet.osm. And... it worked! I am very puzzled by this, and could not find any pointer to why this may happen and how to solve this. My guess would be there is something going on between the decompression and the parsing, but I am not sure. Please help me understand!

回答1:

It turns out that the problem is with the compressed planet.osm file.

As indicated on the OSM Wiki, the planet file is compressed as a multistream file, and the bz2 python module cannot read multistream files. However, the bz2 documentation indicates an alternative module that can read such files, bz2file. I used it and it works perfectly!

So the code should read:

from lxml import etree as et
from bz2file import BZ2File

path = "where/my/fileis.osm.bz2"
with BZ2File(path) as xml_file:
    parser = et.iterparse(xml_file, events=('end',))
    for events, elem in parser:

        if elem.tag == "tag":
            continue
        if elem.tag == "node":
            (do something)


    ## Do some cleaning
    # Get rid of that element
    elem.clear()

    # Also eliminate now-empty references from the root node to node        
    while elem.getprevious() is not None:
        del elem.getparent()[0]

Also, doing some research on using the PBF format (as advised in the comments), I stumbled upon imposm.parser, a python module that implements a generic parser for OSM data (in pbf or xml format). You may want to have a look at this!



回答2:

As an alternative you can use the output of bzcat command (which can handle multistream files too):

p = subprocess.Popen(["bzcat", "data.bz2"], stdout=subprocess.PIPE)
parser = et.iterparse(p.stdout, ...)
# at the end just check that p.returncode == 0 so there were no errors