How can one parse HTML/XML and extract information from it?
相关问题
- Views base64 encoded blob in HTML with PHP
- Laravel Option Select - Default Issue
- Illegal to have multiple roots (start tag in epilo
- PHP Recursively File Folder Scan Sorted by Modific
- Correctly parse PDF paragraphs with Python
For HTML5, html5 lib has been abandoned for years now. The only HTML5 library I can find with a recent update and maintenance records is html5-php which was just brought to beta 1.0 a little over a week ago.
The Symfony framework has bundles which can parse the HTML, and you can use CSS style to select the DOMs instead of using XPath.
Yes you can use simple_html_dom for the purpose. However I have worked quite a lot with the simple_html_dom, particularly for web scrapping and have found it to be too vulnerable. It does the basic job but I won't recommend it anyways.
I have never used curl for the purpose but what I have learned is that curl can do the job much more efficiently and is much more solid.
Kindly check out this link:scraping-websites-with-curl
QueryPath is good, but be careful of "tracking state" cause if you didn't realise what it means, it can mean you waste a lot of debugging time trying to find out what happened and why the code doesn't work.
What it means is that each call on the result set modifies the result set in the object, it's not chainable like in jquery where each link is a new set, you have a single set which is the results from your query and each function call modifies that single set.
in order to get jquery-like behaviour, you need to branch before you do a filter/modify like operation, that means it'll mirror what happens in jquery much more closely.
$results
now contains the result set forinput[name='forename']
NOT the original query"div p"
this tripped me up a lot, what I found was that QueryPath tracks the filters and finds and everything which modifies your results and stores them in the object. you need to do this insteadthen
$results
won't be modified and you can reuse the result set again and again, perhaps somebody with much more knowledge can clear this up a bit, but it's basically like this from what I've found.Simple HTML DOM is a great open-source parser:
simplehtmldom.sourceforge
It treats DOM elements in an object-oriented way, and the new iteration has a lot of coverage for non-compliant code. There are also some great functions like you'd see in JavaScript, such as the "find" function, which will return all instances of elements of that tag name.
I've used this in a number of tools, testing it on many different types of web pages, and I think it works great.
One general approach I haven't seen mentioned here is to run HTML through Tidy, which can be set to spit out guaranteed-valid XHTML. Then you can use any old XML library on it.
But to your specific problem, you should take a look at this project: http://fivefilters.org/content-only/ -- it's a modified version of the Readability algorithm, which is designed to extract just the textual content (not headers and footers) from a page.