How can one parse HTML/XML and extract information from it?
相关问题
- Views base64 encoded blob in HTML with PHP
- Laravel Option Select - Default Issue
- Illegal to have multiple roots (start tag in epilo
- PHP Recursively File Folder Scan Sorted by Modific
- Correctly parse PDF paragraphs with Python
There are many ways to process HTML/XML DOM of which most have already been mentioned. Hence, I won't make any attempt to list those myself.
I merely want to add that I personally prefer using the DOM extension and why :
And while I miss the ability to use CSS selectors for
DOMDocument
, there is a rather simple and convenient way to add this feature: subclassing theDOMDocument
and adding JS-likequerySelectorAll
andquerySelector
methods to your subclass.For parsing the selectors, I recommend using the very minimalistic CssSelector component from the Symfony framework. This component just translates CSS selectors to XPath selectors, which can then be fed into a
DOMXpath
to retrieve the corresponding Nodelist.You can then use this (still very low level) subclass as a foundation for more high level classes, intended to eg. parse very specific types of XML or add more jQuery-like behavior.
The code below comes straight out my DOM-Query library and uses the technique I described.
For HTML parsing :
See also Parsing XML documents with CSS selectors by Symfony's creator Fabien Potencier on his decision to create the CssSelector component for Symfony and how to use it.
If you're familiar with jQuery selector, you can use ScarletsQuery for PHP
This library usually taking less than 1 second to process offline html.
It also accept invalid HTML or missing quote on tag attributes.
I've created a library called HTML5DOMDocument that is freely available at https://github.com/ivopetkov/html5-dom-document-php
It supports query selectors too that I think will be extremely helpful in your case. Here is some example code:
Just use DOMDocument->loadHTML() and be done with it. libxml's HTML parsing algorithm is quite good and fast, and contrary to popular belief, does not choke on malformed HTML.
Why you shouldn't and when you should use regular expressions?
First off, a common misnomer: Regexps are not for "parsing" HTML. Regexes can however "extract" data. Extracting is what they're made for. The major drawback of regex HTML extraction over proper SGML toolkits or baseline XML parsers are their syntactic effort and varying reliability.
Consider that making a somewhat dependable HTML extraction regex:
is way less readable than a simple phpQuery or QueryPath equivalent:
There are however specific use cases where they can help.
<!--
, which however are sometimes the more useful anchors for extraction. In particular pseudo-HTML variations<$var>
or SGML residues are easy to tame with regexps.It's sometimes even advisable to pre-extract a snippet of HTML using regular expressions
/<!--CONTENT-->(.+?)<!--END-->/
and process the remainder using the simpler HTML parser frontends.Note: I actually have this app, where I employ XML parsing and regular expressions alternatively. Just last week the PyQuery parsing broke, and the regex still worked. Yes weird, and I can't explain it myself. But so it happened.
So please don't vote real-world considerations down, just because it doesn't match the regex=evil meme. But let's also not vote this up too much. It's just a sidenote for this topic.