Reading in Malformed XML (unencoded XML entities)

2020-02-12 23:21发布

问题:

I'm having some trouble parsing malformed XML in PHP. In particular I'm querying a third party webservice that returns data in an XML format without encoding the XML entities in actual data. For example one of the the elements contains an ASCII heart, '<3', without the quotes, which the XML parser sees as an opening tag. It should be '&lt;3'.

Right now I'm simply passing the XML string into a SimpleXMLElement which, predictably, fails on these instances. I've done some looking around and it seems like PHP Tidy package might be able to help me, but the amount of configuration you can do is overwhelming :(

Thus, I'm just wondering if anyone else has had a problem like this and, if so, how they were able to solve it.

Thanks!

回答1:

Try tidy.repairString:

php > $tidy = new tidy();
php > $repaired = $tidy->repairString("<foo>I <3 Philadelphia</foo>", array("input-xml"=>1));
php > print($repaired);
<foo>I &lt;3 Philadelphia</foo>
php > $el = new SimpleXMLElement($repaired);


回答2:

  1. Read the content as a string.
  2. htmlspecialchars(preg_replace('/[\x-\x8\xb-\xc\xe-\x1f]/','',$string))
  3. Load the transformed string in SimpleXMLElement

It worked for me so far.