I would like to grab data from a table without using regular expressions. I've enjoyed using simplexml for parsing RSS feeds and would like to know if it can be used to grab a table from another page.
Eg. Grab the page with curl or simply file_get_contents(); then use simplexml to grab contents?
You can use the loadHTML
function from the DOM module, and then import that DOM into SimpleXML via simplexml_import_dom
:
$html = file_get_contents('http://example.com/');
$doc = new DOMDocument();
$doc->loadHTML($html);
$sxml = simplexml_import_dom($doc);
If this is XHTML — yes, it's definitely possible. True XHTML is just XML in the end, so it can be parsed with an XML parser.
SimpleXML, however, only accepts strict XML. If you can't get valid XHTML it looks like putting it through the less-strict DOMDocument
library first will do the trick (source here):
<?php
$html = file_get_contents('http://...');
$doc = new DOMDocument();
$doc->strictErrorChecking = FALSE;
$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);
?>
My version - tolerant to errors and problems with the encoding
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->strictErrorChecking = FALSE;
$doc->loadHTML(mb_convert_encoding($this->html_content, 'HTML-ENTITIES', 'UTF-8'));
libxml_use_internal_errors(false);
$xml = simplexml_import_dom($doc);
It may depend on a page. If page is in XHTML (most web pages nowadays) then any XML parser should do, otherwise look for SGML parser. Here's a similar question, you might be interested in: Error Tolerant HTML/XML/SGML parsing in PHP