Can simplexml be used to rifle through html?

2020-02-08 06:17发布

I would like to grab data from a table without using regular expressions. I've enjoyed using simplexml for parsing RSS feeds and would like to know if it can be used to grab a table from another page.

Eg. Grab the page with curl or simply file_get_contents(); then use simplexml to grab contents?

4条回答
虎瘦雄心在
2楼-- · 2020-02-08 06:46

If this is XHTML — yes, it's definitely possible. True XHTML is just XML in the end, so it can be parsed with an XML parser.

SimpleXML, however, only accepts strict XML. If you can't get valid XHTML it looks like putting it through the less-strict DOMDocument library first will do the trick (source here):

<?php
  $html = file_get_contents('http://...');
  $doc = new DOMDocument();
  $doc->strictErrorChecking = FALSE;
  $doc->loadHTML($html);
  $xml = simplexml_import_dom($doc);
?>
查看更多
家丑人穷心不美
3楼-- · 2020-02-08 06:57

You can use the loadHTML function from the DOM module, and then import that DOM into SimpleXML via simplexml_import_dom:

$html = file_get_contents('http://example.com/');
$doc = new DOMDocument();
$doc->loadHTML($html);
$sxml = simplexml_import_dom($doc);
查看更多
孤傲高冷的网名
4楼-- · 2020-02-08 06:59

My version - tolerant to errors and problems with the encoding

libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->strictErrorChecking = FALSE;
$doc->loadHTML(mb_convert_encoding($this->html_content, 'HTML-ENTITIES',  'UTF-8'));
libxml_use_internal_errors(false);
$xml = simplexml_import_dom($doc);
查看更多
唯我独甜
5楼-- · 2020-02-08 07:00

It may depend on a page. If page is in XHTML (most web pages nowadays) then any XML parser should do, otherwise look for SGML parser. Here's a similar question, you might be interested in: Error Tolerant HTML/XML/SGML parsing in PHP

查看更多
登录 后发表回答