I'm trying to parse the DMOZ content/structures XML files into MySQL, but all existing scripts to do this are very old and don't work well. How can I go about opening a large (+1GB) XML file in PHP for parsing?
相关问题
- Views base64 encoded blob in HTML with PHP
- Laravel Option Select - Default Issue
- Illegal to have multiple roots (start tag in epilo
- PHP Recursively File Folder Scan Sorted by Modific
- Correctly parse PDF paragraphs with Python
I would suggest using a SAX based parser rather than DOM based parsing.
Info on using SAX in PHP: http://www.brainbell.com/tutorials/php/Parsing_XML_With_SAX.htm
This is a very similar question to Best way to process large XML in PHP but with a very good specific answer upvoted addressing the specific problem of DMOZ catalogue parsing. However, since this is a good Google hit for large XMLs in general, I will repost my answer from the other question as well:
My take on it:
https://github.com/prewk/XmlStreamer
A simple class that will extract all children to the XML root element while streaming the file. Tested on 108 MB XML file from pubmed.com.
There are only two php APIs that are really suited for processing large files. The first is the old expat api, and the second is the newer XMLreader functions. These apis read continuous streams rather than loading the entire tree into memory (which is what simplexml and DOM does).
For an example, you might want to look at this partial parser of the DMOZ-catalog:
I've recently had to parse some pretty large XML documents, and needed a method to read one element at a time.
If you have the following file
complex-test.xml
:And wanted to return the
<Object/>
sPHP:
This is an old post, but first in the google search result, so I thought I post another solution based on this post:
http://drib.tech/programming/parse-large-xml-files-php
This solution uses both
XMLReader
andSimpleXMLElement
:This isn't a great solution, but just to throw another option out there:
You can break many large XML files up into chunks, especially those that are really just lists of similar elements (as I suspect the file you're working with would be).
e.g., if your doc looks like:
You can read it in a meg or two at a time, artificially wrap the few complete
<listing>
tags you loaded in a root level tag, and then load them via simplexml/domxml (I used domxml, when taking this approach).Frankly, I prefer this approach if you're using PHP < 5.1.2. With 5.1.2 and higher, XMLReader is available, which is probably the best option, but before that, you're stuck with either the above chunking strategy, or the old SAX/expat lib. And I don't know about the rest of you, but I HATE writing/maintaining SAX/expat parsers.
Note, however, that this approach is NOT really practical when your document doesn't consist of many identical bottom-level elements (e.g., it works great for any sort of list of files, or URLs, etc., but wouldn't make sense for parsing a large HTML document)