The default encoding in Office Open XML is UTF-8
. So Unicode is already possible. Nevertheless does Microsoft defining:
ECMA-376 Part 1 22.4 Variant Types 22.4.2.4 bstr (Basic String):
22.4.2.4 bstr (Basic String)
This element defines a binary basic string variant type, which can store any valid Unicode character. Unicode characters that cannot be directly represented in XML as defined by the XML 1.0 specification, shall be escaped using the Unicode numerical character representation escape character format
_xHHHH_
, where H represents a hexadecimal character in the character's value. [Example: The Unicode character 8 is not permitted in an XML 1.0 document, so it shall be escaped as_x0008_
. end example] To store the literal form of an escape sequence, the initial underscore shall itself be escaped (i.e. stored as_x005F_
). [Example: The string literal_x0008_
would be stored as_x005F_x0008_
. end example]The possible values for this element are defined by the W3C XML Schema string datatype.
This extends the W3C XML Schema string datatype. So that the character sequence _xHHHH_
does have a special meaning as a kind of entity like &#xHHHH;
. And that means that everyone who needs parsing Office Open XML (*.xlsx
, *.docx
, *.pptx
) must bearing in mind this while parsing. For example if you put "Text _x1234_ text"
into an Excel
cell, then Excel
does storing this as "Text _x005F_x1234_ text"
in the XML. So the string stored in the file is different from the string which was entered and also is different from the string which Excel
will showing in the cell. For example if you put "Text _x1234_ text"
as string cell content into the XML, then Excel
will showing "Text ሴ text"
into the cell.
See: XSSFCell in Apache POI encodes certain character sequences as unicode character
It is clear to me that XML 1.0 does having some characters that cannot be directly represented in XML. But this are control characters and other users of XML are able fulfilling the restrictions without such extensions. They are using other properly defined encodings (Base64 for ex.) if content having control characters in it is needed.
So I am always nor looking for some useful use cases for this _xHHHH_
within a string.
Questions:
Can someone enlighting me why this special Unicode numerical character representation escape character format
_xHHHH_
in Office Open XML is necessary at all?Can someone giving any useful use cases for this
_xHHHH_
within a string?