Can simplexml be used to rifle through html?

2020-02-08 06:48发布

问题:

I would like to grab data from a table without using regular expressions. I've enjoyed using simplexml for parsing RSS feeds and would like to know if it can be used to grab a table from another page.

Eg. Grab the page with curl or simply file_get_contents(); then use simplexml to grab contents?

回答1:

You can use the loadHTML function from the DOM module, and then import that DOM into SimpleXML via simplexml_import_dom:

$html = file_get_contents('http://example.com/');
$doc = new DOMDocument();
$doc->loadHTML($html);
$sxml = simplexml_import_dom($doc);


回答2:

If this is XHTML — yes, it's definitely possible. True XHTML is just XML in the end, so it can be parsed with an XML parser.

SimpleXML, however, only accepts strict XML. If you can't get valid XHTML it looks like putting it through the less-strict DOMDocument library first will do the trick (source here):

<?php
  $html = file_get_contents('http://...');
  $doc = new DOMDocument();
  $doc->strictErrorChecking = FALSE;
  $doc->loadHTML($html);
  $xml = simplexml_import_dom($doc);
?>


回答3:

My version - tolerant to errors and problems with the encoding

libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->strictErrorChecking = FALSE;
$doc->loadHTML(mb_convert_encoding($this->html_content, 'HTML-ENTITIES',  'UTF-8'));
libxml_use_internal_errors(false);
$xml = simplexml_import_dom($doc);


回答4:

It may depend on a page. If page is in XHTML (most web pages nowadays) then any XML parser should do, otherwise look for SGML parser. Here's a similar question, you might be interested in: Error Tolerant HTML/XML/SGML parsing in PHP