When to CDATA vs. Escape & Vice Versa?

2020-07-03 06:49发布

I'm creating XML documents with values fetched from a DB. Occasionally due to a legacy implementation, I'll pullback a value that contains a char that's invalid when not properly escaped (& for example).

So the question becomes, should I CDATA or Escape? Are certain situations more appropriate for one vs. the other?

Examples:

<Email>foo&bar@domain.com</Email>

I'd lean towards CDATA here.

<Name>Bob & Tom</Name>

I'd lean towards escaping here.

I want to avoid blindly CDATA'ing every time, but from a performance perspective it seems like that's the logical choice. That will be always faster than looking for an invalid char, and if it exists then wrap.

Thoughts?

5条回答
Lonely孤独者°
2楼-- · 2020-07-03 07:02

I think CDATA will be faster - it has to scan for the end character, make a copy from start to end and pass that back - one copy. With reading escaped data, it has to use a buffer, append to it as it scans for escaped characters and when it finished, covert the buffer to a string and pass that back. So, escaping will use more memory and will have to do an extra copy. Though you probably will only notice a difference in large sets of data and high number of transactions. So if its small fields, don't worry about it - use either.

查看更多
▲ chillily
3楼-- · 2020-07-03 07:04

I think that there is no real difference. I prefer to use CDATA for everything because I don't have to care about the characters to escape and the only thing I must take care of are the "]]>" in the content, which btw ARE allowed if you split the CDATA opening and closing tags into multiple fragments.

Example (in PHP)

<?php

function getXMLContent($content)
{
    if
    (
        (strpos($content, '<') !== false) ||
        (strpos($content, '>') !== false) ||
        (strpos($content, '&') !== false) ||
        (strpos($content, '"') !== false) ||
        (strpos($content, '\'') !== false)
    )
    {
        // If value contains ']]>', we need to break it into multiple CDATA tags
        return "<![CDATA[". str_replace(']]>', ']]]]><![CDATA[>', $content) ."]]>";
    }
    else
    {
        // Value does not contain any special characters which needs to be wrapped / encoded / escaped
        return $content;
    }
}

echo getXMLContent("Hello little world!");
echo PHP_EOL . PHP_EOL;
echo getXMLContent("This < is > a & hard \" test ' for ]]> XML!");

?>

Returns

Hello little world!

<![CDATA[This < is > a & hard " test ' for ]]]]><![CDATA[> XML!]]>

If you put that into a XML structure like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<test>
    <![CDATA[This < is > a & hard " test ' for ]]]]><![CDATA[> XML!]]>
</test>

... save it to a file (like test.xml) and open it with a browser, you'll see, that the browser (or any other XML application / parser) will show you the correct ouput string:

This < is > a & hard " test ' for ]]> XML!
查看更多
做个烂人
4楼-- · 2020-07-03 07:05

Wrap with CDATA in these conditions: If you have doubtfull data and you are thnking to escape those Data is used for display , because then that app is also going to unescape. Escape same data element repeatedly - more number of parsing & escape will impact performance.

查看更多
地球回转人心会变
5楼-- · 2020-07-03 07:20

CDATA is primarily useful, IMO, for human readability. As far as a machine is concerned, there's no difference between CDATA and escaped text other than the length, at most. Perhaps the escaped version will take a little bit longer to process, but I say perhaps, because this shouldn't be a significant factor unless your application is mostly IO-bound.

Are people likely to be reading the XML? If not, just let the XML parser do what it does and don't worry about CDATA vs escaped text. If people will be reading this XML, then perhaps CDATA can be the better choice.

If you're going to have an XML element whose value is XML, then for this case, CDATA may be the better choice.

For more information, see for example the XML FAQ's question, When should I use a CDATA Marked Section?

查看更多
Luminary・发光体
6楼-- · 2020-07-03 07:26

I've seen people use CDATA for the above which is OK, and for wrapping things that are not XML - such as e.g. JSON or CSS - and that's a better reason to use it. The problem happens when people use it to quote element-based markup such as HTML, and then the confusion happens.

People do not expect

<![CDATA[<foo>bar</foo>]]>

to be identical to

&lt;foo&gt;bar&lt;/foo&gt;

as far as XML systems are concerned.

See RSS tag soup for examples of the horror of escaping levels.

You also have to be sure that the character sequence ']]>' will never appear in your wrapped data since that's the terminator.

So unless readability is paramount or you are wrapping non-element markup, I recommend avoiding CDATA.

查看更多
登录 后发表回答