Loading huge XML files and dealing with MemoryErro

2019-04-06 20:27发布

问题:

I have a very large XML file (20GB to be exact, and yes, I need all of it). When I attempt to load the file, I receive this error:

Python(23358) malloc: *** mmap(size=140736680968192) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
  File "file.py", line 5, in <module>
    code = xml.read()
MemoryError

This is the current code I have, to read the XML file:

from bs4 import BeautifulSoup
xml = open('pages_full.xml', 'r')
code = xml.read()
xml.close()
soup = BeautifulSoup(code)

Now, how would I go about to eliminating this error and be able to continue working on the script. I would try splitting the file into separate files, but as I don't know how that would affect BeautifulSoup as well as the XML data, I'd rather not do this.

(The XML data is a database dump from a wiki I volunteer on, using it to import data from different time-periods, using the direct information from many pages)

回答1:

Do not use BeautifulSoup to try and such a large parse XML file. Use the ElementTree API instead. Specifically, use the iterparse() function to parse your file as a stream, handle information as you are notified of elements, then delete the elements again:

from xml.etree import ElementTree as ET

parser = ET.iterparse(filename)

for event, element in parser:
    # element is a whole element
    if element.tag == 'yourelement'
         # do something with this element
         # then clean up
         element.clear()

By using a event-driven approach, you never need to hold the whole XML document in memory, you only extract what you need and discard the rest.

See the iterparse() tutorial and documentation.

Alternatively, you can also use the lxml library; it offers the same API in a faster and more featurefull package.