How can one parse HTML/XML and extract information from it?
相关问题
- Views base64 encoded blob in HTML with PHP
- Laravel Option Select - Default Issue
- Illegal to have multiple roots (start tag in epilo
- PHP Recursively File Folder Scan Sorted by Modific
- Correctly parse PDF paragraphs with Python
phpQuery and QueryPath are extremely similar in replicating the fluent jQuery API. That's also why they're two of the easiest approaches to properly parse HTML in PHP.
Examples for QueryPath
Basically you first create a queryable DOM tree from an HTML string:
The resulting object contains a complete tree representation of the HTML document. It can be traversed using DOM methods. But the common approach is to use CSS selectors like in jQuery:
Mostly you want to use simple
#id
and.class
orDIV
tag selectors for->find()
. But you can also use XPath statements, which sometimes are faster. Also typical jQuery methods like->children()
and->text()
and particularly->attr()
simplify extracting the right HTML snippets. (And already have their SGML entities decoded.)QueryPath also allows injecting new tags into the stream (
->append
), and later output and prettify an updated document (->writeHTML
). It can not only parse malformed HTML, but also various XML dialects (with namespaces), and even extract data from HTML microformats (XFN, vCard)..
phpQuery or QueryPath?
Generally QueryPath is better suited for manipulation of documents. While phpQuery also implements some pseudo AJAX methods (just HTTP requests) to more closely resemble jQuery. It is said that phpQuery is often faster than QueryPath (because of fewer overall features).
For further information on the differences see this comparison on the wayback machine from tagbyte.org. (Original source went missing, so here's an internet archive link. Yes, you can still locate missing pages, people.)
And here's a comprehensive QueryPath introduction.
Advantages
->find("a img, a object, div a")
We have created quite a few crawlers for our needs before. At the end of the day, it is usually simple regular expressions that do the thing best. While libraries listed above are good for the reason they are created, if you know what you are looking for, regular expressions is a safer way to go, as you can handle also non-valid HTML/XHTML structures, which would fail, if loaded via most of the parsers.
This sounds like a good task description of W3C XPath technology. It's easy to express queries like "return all
href
attributes inimg
tags that are nested in<foo><bar><baz> elements
." Not being a PHP buff, I can't tell you in what form XPath may be available. If you can call an external program to process the HTML file you should be able to use a command line version of XPath. For a quick intro, see http://en.wikipedia.org/wiki/XPath.I have written a general purpose XML parser that can easily handle GB files. It's based on XMLReader and it's very easy to use:
Here's the github repo: XmlExtractor
Try Simple HTML DOM Parser
Examples:
How to get HTML elements:
How to modify HTML elements:
Extract content from HTML:
Scraping Slashdot:
With FluidXML you can query and iterate XML using XPath and CSS Selectors.
https://github.com/servo-php/fluidxml