I am trying to parse OpenStreetMap's planet.osm, compressed in bz2 format. Because it is already 41G, I don't want to decompress the file completely.
So I figured out how to parse portions of the planet.osm file using bz2 and lxml, using the following code
from lxml import etree as et
from bz2 import BZ2File
path = "where/my/fileis.osm.bz2"
with BZ2File(path) as xml_file:
parser = et.iterparse(xml_file, events=('end',))
for events, elem in parser:
if elem.tag == "tag":
continue
if elem.tag == "node":
(do something)
## Do some cleaning
# Get rid of that element
elem.clear()
# Also eliminate now-empty references from the root node to node
while elem.getprevious() is not None:
del elem.getparent()[0]
which works perfectly with the Geofabrick extracts. However, when I try to parse the planet-latest.osm.bz2 with the same script I get the error:
xml.etree.XMLSyntaxError: Specification mandate value for attribute num_change, line 3684, column 60
Here are the things I tried:
- Check the planet-latest.osm.bz2 md5sum
- Check the planet-latest.osm where the script with bz2 stops. There is no apparent error, and the attribute is called "num_changes", not "num_change" as indicated in the error
- Also I did something stupid, but the error puzzled me: I opened the planet-latest.osm.bz2 in mode 'rb' [c = BZ2File('file.osm.bz2', 'rb')] and then passed c.read() to iterparse(), which returned me an error saying (very long string) cannot be opened. Strange thing, (very long string) ends right where the "Specification mandate value" error refers to...
Then I tried to decompress first the planet.osm.gz2 usin a simple
bzcat planet.osm.gz2 > planet.osm
And ran the parser directly on planet.osm. And... it worked! I am very puzzled by this, and could not find any pointer to why this may happen and how to solve this. My guess would be there is something going on between the decompression and the parsing, but I am not sure. Please help me understand!