I am using php to parse xml
response of an API. Here is a sample response -
$xml = '<?xml version="1.0"?>
<q:response xmlns:q="http://api-url">
<q:impression>
<q:content>
<html>
<meta name="HandheldFriendly" content="True">
<meta name="viewport" content="width=device-width, user-scalable=no">
<meta http-equiv="cleartype" content="on">
</head>
<body style="margin:0px;padding:0px;">
<iframe scrolling="no" src="http://api-response-url/with/lots?of=parameters&somethingmore=someval" width="320px" height="50px" style="border:none;"></iframe>
</body>
</html>
</q:content>
<q:cpc>0.02</q:cpc>
</q:impression>
</q:response>';
Note the following points -
The response has some invalid markup like this -
<head>
tag start inside<html>
is not there but it is closed.<meta>
tags inside<html>
are not closed.- The iframe's
src
attribute contains a URL with multiple params separated by&
. So, this and any other possible URLs need to be urlencoded before the$dom->loadXML();
(see my code below).
Requirement
- I need to read whatever is there inside the
<q:content></q:content>
tags. - I need to parse invalid markup (as I am getting) and properly read the content.
- url's need to be encoded for the characters as listed in What characters do I need to escape in XML documents?. This needs to be done with the current logic I am following.
Current code
So, far I have this code which works fine if the contents inside the <q:content></q:content>
tags is valid markup -
$dom = new DOMDocument;
$dom->loadXML($xml); // load the XML string defined above - works only if entire xml is valid
$adHtml = "";
foreach ($dom->getElementsByTagNameNS('http://api-url', '*') as $element)
{
if($element->localName == "content")
{
$children = $element->childNodes;
foreach ($children as $child)
{
$adHtml .= $child->ownerDocument->saveXML($child);
}
}
}
echo $adHtml; //Have got necessary contents here
Check working code here (with valid markup and single param in iframe src).
What I am thinking now
Now, going with the solution given by @hakre in my previous question -
I tried with
DOMDocument::loadHTML()
and it fails as I expected. Gives warnings like -Warning: DOMDocument::loadHTML(): Tag q:response invalid in Entity, line: 2
escape a specific part of the string for characters listed in What characters do I need to escape in XML documents?.
Question
Finally, if I have to "escape a specific part of the string" (in my case look for whatever is there in between the <q:content></q:content>
) as given in that answer to urlencode whatever is there, then why shouldn't I look for the those delimiters (<q:content></q:content>
) in the first place and return that? Then what is the benefit of using DOMDocument::loadXML()
in such cases? I guess this is a pretty common case...
So, my question is given this Requirement and the points given under Note the following points -, what is the most clever way to proceed?
One can make many valid choices when implementing a standard. However, there are no valid choices in violating a standard. You need to present to those sending you this data some of their valid choices in implementing the XML standard.
One of those choices would be to place the HTML content within
CDATA
. Another would be to encode the HTML.It is simply not acceptable for them to send you garbage and to call it XML. Maybe they don't realize that it's not valid XML, but it's simply not. If they don't believe you, then you should simply try to open the "XML" in a standard XML editor such as XMLspy. Let them appeal to XMLspy as a third party which can tell them whether their XML is valid.
They can then be free to choose how to produce valid XML, and you'll be required to handle their choice.