When is it required to escape characters in XML?

2020-06-11 06:16发布

问题:

When should we replace < > & " ' in XML to characters like &lt etc.

My understanding is that it's just to make sure that if the content part of XML has > < the parser will not treat is start or end of a tag.

Also, if I have a XML like:

<hello>mor>ning<hello>

should this be replaced to either:

  • &lthello&gtmor&gtning&lthello&gt
  • &lthello&gtmor>ning&lthello&gt
  • <hello>mor&gtning<hello>

I don't understand why replacing is needed. When exactly is it required and what exactly (tags or text) should be replaced?

回答1:

<, >, &, " and ' all have special meanings in XML (such as "start of entity" or "attribute value delimiter").

In order to have those characters appear as data (instead of for their special meaning) they can be represented by entities (&lt; for < and so on).

Sometimes those special meanings are context sensitive (e.g. " doesn't mean "attribute delimiter" outside of a tag) and there are places where they can appear raw as data. Rather then worry about those exceptions, it is simplest to just always represent them as entities if you want to avoid their special meaning. Then the only gotcha is explicit CDATA sections where the special meaning doesn't hold (and & won't start an entity).

should this be replaced to either

It shouldn't be represented as any of those. Entities must be terminated with a semi-colon.

How you should represent it depends on which bit of your example of data and which is markup. You haven't said, for example, if <hello> is supposed to be data or the start tag for a hello element.



回答2:

Section 2.4 of the XML Specification clearly states:

The ampersand character (&) and the left angle bracket (<) must not appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they must be escaped using either numeric character references or the strings " &amp; " and " &lt; " respectively. The right angle bracket (>) may be represented using the string " &gt; ", and must, for compatibility, be escaped using either " &gt; " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.



回答3:

You have to encode all characters that have a special meaning in XML but should not be interpreted by the parser.

Assuming your XML is

<hello>mor>ning</hello> 

you would encode it as

<hello>mor&gt;ning</hello>

or use a CDATA [Wikipedia] section:

<hello><![CDATA[mor>ning]]></hello>


回答4:

You can see this explanation enter link description here but basically, characters like < and > are important when parsing the xml document. If extra of these special characters are included in the xml node text or attribute text, the parser will not be able to properly understand the document. If you are sending xml to some web service, all of the special characters should be properly escaped.



回答5:

https://github.com/savonrb/gyoku/blob/master/README.md

You can use Gyoku not to escape the characters in CDATA.