Useful use cases for escape character format _xHHH

2019-07-26 06:49发布

问题:

The default encoding in Office Open XML is UTF-8. So Unicode is already possible. Nevertheless does Microsoft defining: ECMA-376 Part 1 22.4 Variant Types 22.4.2.4 bstr (Basic String):

22.4.2.4 bstr (Basic String)

This element defines a binary basic string variant type, which can store any valid Unicode character. Unicode characters that cannot be directly represented in XML as defined by the XML 1.0 specification, shall be escaped using the Unicode numerical character representation escape character format _xHHHH_, where H represents a hexadecimal character in the character's value. [Example: The Unicode character 8 is not permitted in an XML 1.0 document, so it shall be escaped as _x0008_. end example] To store the literal form of an escape sequence, the initial underscore shall itself be escaped (i.e. stored as _x005F_). [Example: The string literal _x0008_ would be stored as _x005F_x0008_. end example]

The possible values for this element are defined by the W3C XML Schema string datatype.

This extends the W3C XML Schema string datatype. So that the character sequence _xHHHH_ does have a special meaning as a kind of entity like &#xHHHH;. And that means that everyone who needs parsing Office Open XML (*.xlsx, *.docx, *.pptx) must bearing in mind this while parsing. For example if you put "Text _x1234_ text" into an Excel cell, then Excel does storing this as "Text _x005F_x1234_ text" in the XML. So the string stored in the file is different from the string which was entered and also is different from the string which Excel will showing in the cell. For example if you put "Text _x1234_ text" as string cell content into the XML, then Excel will showing "Text ሴ text" into the cell.

See: XSSFCell in Apache POI encodes certain character sequences as unicode character

It is clear to me that XML 1.0 does having some characters that cannot be directly represented in XML. But this are control characters and other users of XML are able fulfilling the restrictions without such extensions. They are using other properly defined encodings (Base64 for ex.) if content having control characters in it is needed.

So I am always nor looking for some useful use cases for this _xHHHH_ within a string.

Questions:

  1. Can someone enlighting me why this special Unicode numerical character representation escape character format _xHHHH_ in Office Open XML is necessary at all?

  2. Can someone giving any useful use cases for this _xHHHH_ within a string?

回答1:

As an use case, our all DB is isolated as an requirement and we need to test some jobs/crons/webservices on different DB's, now we need to export some data in an excel and feed to the job as an input file for another DB to check if it's working as expected. Our architecture is required this due to some privileges restriction.

Hope it will be an useful case for you :)