Bulletproofing SimpleXMLElement

2019-03-20 17:31发布

问题:

Everyone knows that we should always use DOM techniques instead of regexes to extract content from HTML, but I get the feeling that I can never trust the SimpleXML extension or similar ones.

I'm coding a OpenID implementation right now, and I tried using SimpleXML to do the HTML discovery - but my very first test (with alixaxel.myopenid.com) yielded a lot of errors:

Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 27: parser error : Opening and ending tag mismatch: link line 11 and head in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: </head> in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 64: parser error : Entity 'copy' not defined in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: &copy; 2008 <a href="http://janrain.com/">JanRain, Inc.</a> in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 66: parser error : Entity 'trade' not defined in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: myOpenID&trade; and the myOpenID&trade; website are in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 66: parser error : Entity 'trade' not defined in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: myOpenID&trade; and the myOpenID&trade; website are in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 77: parser error : Opening and ending tag mismatch: link line 8 and html in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: </html> in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 78: parser error : Premature end of data in tag head line 3 in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 78: parser error : Premature end of data in tag html line 2 in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6

I recall there was a way to make SimpleXML always parse a file, independently if the document contains errors or not - I can't remember the specific implementation though, but I think it involved using DOMDocument. What is the best way to make sure SimpleXML always parses any given document?

And please don't suggest using Tidy, I think the extension is slow and it's not available on many systems.

回答1:

You can load the HTML with DOM's loadHTML then import the result to SimpleXML.

IIRC, it will still choke on some stuff but it will accept pretty much anything that exists in the real world of broken websites.

$html = '<html><head><body><div>stuff & stuff</body></html>';

// disable PHP errors
$old = libxml_use_internal_errors(true);

$dom = new DOMDocument;
$dom->loadHTML($html);

// restore the old behaviour
libxml_use_internal_errors($old);

$sxe = simplexml_import_dom($dom);
die($sxe->asXML());


回答2:

you could always try a SAX parser... A bit more robust to errors.

May not be as efficient on large XML.