Suggest proper approach to parse invalid xml respo

2019-08-15 22:16发布

I am using php to parse xml response of an API. Here is a sample response -

$xml = '<?xml version="1.0"?>
                    <q:response xmlns:q="http://api-url">
                        <q:impression>
                            <q:content>
                                <html>
                                        <meta name="HandheldFriendly" content="True">
                                        <meta name="viewport" content="width=device-width, user-scalable=no">
                                        <meta http-equiv="cleartype" content="on">
                                    </head>
                                    <body style="margin:0px;padding:0px;">
                                        <iframe scrolling="no" src="http://api-response-url/with/lots?of=parameters&somethingmore=someval" width="320px" height="50px" style="border:none;"></iframe>
                                    </body>
                                </html>
                            </q:content>
                            <q:cpc>0.02</q:cpc>
                        </q:impression>
                    </q:response>';

Note the following points -

The response has some invalid markup like this -

  • <head> tag start inside <html> is not there but it is closed.
  • <meta> tags inside <html> are not closed.
  • The iframe's src attribute contains a URL with multiple params separated by &. So, this and any other possible URLs need to be urlencoded before the $dom->loadXML(); (see my code below).

Requirement

  • I need to read whatever is there inside the <q:content></q:content> tags.
  • I need to parse invalid markup (as I am getting) and properly read the content.
  • url's need to be encoded for the characters as listed in What characters do I need to escape in XML documents?. This needs to be done with the current logic I am following.

Current code

So, far I have this code which works fine if the contents inside the <q:content></q:content> tags is valid markup -

$dom = new DOMDocument;

$dom->loadXML($xml); // load the XML string defined above - works only if entire xml is valid 

$adHtml = "";

foreach ($dom->getElementsByTagNameNS('http://api-url', '*') as $element) 
{
    if($element->localName == "content")
    {
         $children = $element->childNodes; 

         foreach ($children as $child) 
         {
              $adHtml .= $child->ownerDocument->saveXML($child); 
         }

    }

}

echo $adHtml; //Have got necessary contents here

Check working code here (with valid markup and single param in iframe src).

What I am thinking now

Now, going with the solution given by @hakre in my previous question -

  • I tried with DOMDocument::loadHTML() and it fails as I expected. Gives warnings like - Warning: DOMDocument::loadHTML(): Tag q:response invalid in Entity, line: 2

  • escape a specific part of the string for characters listed in What characters do I need to escape in XML documents?.

Question

Finally, if I have to "escape a specific part of the string" (in my case look for whatever is there in between the <q:content></q:content>) as given in that answer to urlencode whatever is there, then why shouldn't I look for the those delimiters (<q:content></q:content>) in the first place and return that? Then what is the benefit of using DOMDocument::loadXML() in such cases? I guess this is a pretty common case...

So, my question is given this Requirement and the points given under Note the following points -, what is the most clever way to proceed?

1条回答
戒情不戒烟
2楼-- · 2019-08-15 22:45

One can make many valid choices when implementing a standard. However, there are no valid choices in violating a standard. You need to present to those sending you this data some of their valid choices in implementing the XML standard.

One of those choices would be to place the HTML content within CDATA. Another would be to encode the HTML.

It is simply not acceptable for them to send you garbage and to call it XML. Maybe they don't realize that it's not valid XML, but it's simply not. If they don't believe you, then you should simply try to open the "XML" in a standard XML editor such as XMLspy. Let them appeal to XMLspy as a third party which can tell them whether their XML is valid.

They can then be free to choose how to produce valid XML, and you'll be required to handle their choice.

查看更多
登录 后发表回答