XML encoding setup and specific charsets

2019-07-21 06:07发布

问题:

I have to read a big XML document (gigabytes) which has &#XX; charset, where XX is less than 31. Usually, I am aware that these charsets (<32) are reserved for ASCII device control.

The author of the file decided to use this charset inside the text and to change it is something that is out of my hands.

I have tried different xml encoding scheme declarations, beyond UTF-8, when declaring the header of xml file: <?xml version="1.0" encoding ="UTF-8"?>, but have no success when trying to render it in my XML parser.

To make the problem reproducible and clear, consider the simple xml file below (which, for example, has the  charset after the name Fred):

<?xml version="1.0" encoding ="UTF-8"?> 
<TABLE> 
 <GRADES> 
 <STUDENT> Fred &#01; </STUDENT> 
 <TEST1> 1 </TEST1> 
 <TEST2> 2 </TEST2> 
 <FINAL> 3 </FINAL> 
 </GRADES> 
 <GRADES> 
 <STUDENT> Wilma </STUDENT> 
 <TEST1> 1 </TEST1> 
 <TEST2> 2 </TEST2> 
 <FINAL> 3 </FINAL> 
 </GRADES> 
</TABLE>

When I read these file in different browsers, I get the error:

error on line 4 at column 22: xmlParseCharRef: invalid xmlChar value 1

I know that a possible solution is to pre-process the original file, finding and replacing the chars that causes the error, but does anybody know any other way to workaround this problem? Is there any specific encoding which supports &#XX; charset (XX < 32) ?

回答1:

Not all characters are legal in XML 1.0. (http://www.w3.org/TR/REC-xml/#charsets)

If your tools support XML 1.1, switching them into that mode will allow some of the previously forbidden characters (http://www.w3.org/TR/xml11/#charsets)

The usual solution is not to try to put control characters into an XML document. Instead, encode the binary data as hex or base64 or some other text representation, and let the application code convert it back to binary when needed.