How can one parse HTML/XML and extract information from it?
相关问题
- Views base64 encoded blob in HTML with PHP
- Laravel Option Select - Default Issue
- Illegal to have multiple roots (start tag in epilo
- PHP Recursively File Folder Scan Sorted by Modific
- Correctly parse PDF paragraphs with Python
JSON and array from XML in three lines:
Ta da!
For 1a and 2: I would vote for the new Symfony Componet class DOMCrawler ( DomCrawler ). This class allows queries similar to CSS Selectors. Take a look at this presentation for real-world examples: news-of-the-symfony2-world.
The component is designed to work standalone and can be used without Symfony.
The only drawback is that it will only work with PHP 5.3 or newer.
This is commonly referred to as screen scraping, by the way. The library I have used for this is Simple HTML Dom Parser.
You could try using something like HTML Tidy to cleanup any "broken" HTML and convert the HTML to XHTML, which you can then parse with a XML parser.
I recommend PHP Simple HTML DOM Parser.
It really has nice features, like:
Advanced Html Dom is a simple HTML DOM replacement that offers the same interface, but it's DOM-based which means none of the associated memory issues occur.
It also has full CSS support, including jQuery extensions.