When should we replace < > & " '
in XML to characters like <
etc.
My understanding is that it's just to make sure that if the content part of XML has > <
the parser will not treat is start or end of a tag.
Also, if I have a XML like:
<hello>mor>ning<hello>
should this be replaced to either:
<hello>mor>ning<hello>
<hello>mor>ning<hello>
<hello>mor>ning<hello>
I don't understand why replacing is needed. When exactly is it required and what exactly (tags or text) should be replaced?
<
, >
, &
, "
and '
all have special meanings in XML (such as "start of entity" or "attribute value delimiter").
In order to have those characters appear as data (instead of for their special meaning) they can be represented by entities (<
for <
and so on).
Sometimes those special meanings are context sensitive (e.g. " doesn't mean "attribute delimiter" outside of a tag) and there are places where they can appear raw as data. Rather then worry about those exceptions, it is simplest to just always represent them as entities if you want to avoid their special meaning. Then the only gotcha is explicit CDATA sections where the special meaning doesn't hold (and &
won't start an entity).
should this be replaced to either
It shouldn't be represented as any of those. Entities must be terminated with a semi-colon.
How you should represent it depends on which bit of your example of data and which is markup. You haven't said, for example, if <hello>
is supposed to be data or the start tag for a hello element.
Section 2.4 of the XML Specification clearly states:
The ampersand character (&) and the left angle bracket (<) must not
appear in their literal form, except when used as markup delimiters,
or within a comment, a processing instruction, or a CDATA section. If
they are needed elsewhere, they must be escaped using either numeric
character references or the strings " & " and " < "
respectively. The right angle bracket (>) may be represented using the
string " > ", and must, for compatibility, be escaped using either
" > " or a character reference when it appears in the string " ]]>
" in content, when that string is not marking the end of a CDATA
section.
You have to encode all characters that have a special meaning in XML but should not be interpreted by the parser.
Assuming your XML is
<hello>mor>ning</hello>
you would encode it as
<hello>mor>ning</hello>
or use a CDATA
[Wikipedia] section:
<hello><![CDATA[mor>ning]]></hello>
You can see this explanation enter link description here
but basically, characters like < and > are important when parsing the xml document. If extra of these special characters are included in the xml node text or attribute text, the parser will not be able to properly understand the document. If you are sending xml to some web service, all of the special characters should be properly escaped.
https://github.com/savonrb/gyoku/blob/master/README.md
You can use Gyoku not to escape the characters in CDATA.