How can one parse HTML/XML and extract information from it?
相关问题
- Views base64 encoded blob in HTML with PHP
- Laravel Option Select - Default Issue
- Illegal to have multiple roots (start tag in epilo
- PHP Recursively File Folder Scan Sorted by Modific
- Correctly parse PDF paragraphs with Python
Native XML Extensions
I prefer using one of the native XML extensions since they come bundled with PHP, are usually faster than all the 3rd party libs and give me all the control I need over the markup.
DOM
DOM is capable of parsing and modifying real world (broken) HTML and it can do XPath queries. It is based on libxml.
It takes some time to get productive with DOM, but that time is well worth it IMO. Since DOM is a language-agnostic interface, you'll find implementations in many languages, so if you need to change your programming language, chances are you will already know how to use that language's DOM API then.
A basic usage example can be found in Grabbing the href attribute of an A element and a general conceptual overview can be found at DOMDocument in php
How to use the DOM extension has been covered extensively on StackOverflow, so if you choose to use it, you can be sure most of the issues you run into can be solved by searching/browsing Stack Overflow.
XMLReader
XMLReader, like DOM, is based on libxml. I am not aware of how to trigger the HTML Parser Module, so chances are using XMLReader for parsing broken HTML might be less robust than using DOM where you can explicitly tell it to use libxml's HTML Parser Module.
A basic usage example can be found at getting all values from h1 tags using php
XML Parser
The XML Parser library is also based on libxml, and implements a SAX style XML push parser. It may be a better choice for memory management than DOM or SimpleXML, but will be more difficult to work with than the pull parser implemented by XMLReader.
SimpleXml
SimpleXML is an option when you know the HTML is valid XHTML. If you need to parse broken HTML, don't even consider SimpleXml because it will choke.
A basic usage example can be found at A simple program to CRUD node and node values of xml file and there is lots of additional examples in the PHP Manual.
3rd Party Libraries (libxml based)
If you prefer to use a 3rd-party lib, I'd suggest using a lib that actually uses DOM/libxml underneath instead of string parsing.
FluentDom - Repo
HtmlPageDom
phpQuery (not updated for years)
Also see: https://github.com/electrolinux/phpquery
Zend_Dom
QueryPath
fDOMDocument
sabre/xml
FluidXML
3rd-Party (not libxml-based)
The benefit of building upon DOM/libxml is that you get good performance out of the box because you are based on a native extension. However, not all 3rd-party libs go down this route. Some of them listed below
PHP Simple HTML DOM Parser
I generally do not recommend this parser. The codebase is horrible and the parser itself is rather slow and memory hungry. Not all jQuery Selectors (such as child selectors) are possible. Any of the libxml based libraries should outperform this easily.
PHP Html Parser
Again, I would not recommend this parser. It is rather slow with high CPU usage. There is also no function to clear memory of created DOM objects. These problems scale particularly with nested loops. The documentation itself is inaccurate and misspelled, with no responses to fixes since 14 Apr 16.
Ganon
Never used it. Can't tell if it's any good.
HTML 5
You can use the above for parsing HTML5, but there can be quirks due to the markup HTML5 allows. So for HTML5 you want to consider using a dedicated parser, like
html5lib
We might see more dedicated parsers once HTML5 is finalized. There is also a blogpost by the W3's titled How-To for html 5 parsing that is worth checking out.
WebServices
If you don't feel like programming PHP, you can also use Web services. In general, I found very little utility for these, but that's just me and my use cases.
YQL
ScraperWiki.
Regular Expressions
Last and least recommended, you can extract data from HTML with regular expressions. In general using Regular Expressions on HTML is discouraged.
Most of the snippets you will find on the web to match markup are brittle. In most cases they are only working for a very particular piece of HTML. Tiny markup changes, like adding whitespace somewhere, or adding, or changing attributes in a tag, can make the RegEx fails when it's not properly written. You should know what you are doing before using RegEx on HTML.
HTML parsers already know the syntactical rules of HTML. Regular expressions have to be taught for each new RegEx you write. RegEx are fine in some cases, but it really depends on your use-case.
You can write more reliable parsers, but writing a complete and reliable custom parser with regular expressions is a waste of time when the aforementioned libraries already exist and do a much better job on this.
Also see Parsing Html The Cthulhu Way
Books
If you want to spend some money, have a look at
I am not affiliated with PHP Architect or the authors.
There are several reasons to not parse HTML by regular expression. But, if you have total control of what HTML will be generated, then you can do with simple regular expression.
Above it's a function that parses HTML by regular expression. Note that this function is very sensitive and demands that the HTML obey certain rules, but it works very well in many scenarios. If you want a simple parser, and don't want to install libraries, give this a shot:
Third party alternatives to SimpleHtmlDom that use DOM instead of String Parsing: phpQuery, Zend_Dom, QueryPath and FluentDom.
Another option you can try is QueryPath. It's inspired by jQuery, but on the server in PHP and used in Drupal.
I created a library named PHPPowertools/DOM-Query, which allows you to crawl HTML5 and XML documents just like you do with jQuery.
Under the hood, it uses symfony/DomCrawler for conversion of CSS selectors to XPath selectors. It always uses the same DomDocument, even when passing one object to another, to ensure decent performance.
Example use :
Supported methods :
NOTE :
The library also includes its own zero-configuration autoloader for PSR-0 compatible libraries. The example included should work out of the box without any additional configuration. Alternatively, you can use it with composer.
XML_HTMLSax
is rather stable - even if it's not maintained any more. Another option could be to pipe you HTML through Html Tidy and then parse it with standard XML tools.