I have to read a big XML document (gigabytes) which has &#XX; charset, where XX is less than 31. Usually, I am aware that these charsets (<32) are reserved for ASCII device control.
The author of the file decided to use this charset inside the text and to change it is something that is out of my hands.
I have tried different xml encoding scheme declarations, beyond UTF-8, when declaring the header of xml file: <?xml version="1.0" encoding ="UTF-8"?>
, but have no success when trying to render it in my XML parser.
To make the problem reproducible and clear, consider the simple xml file below (which, for example, has the charset after the name Fred):
<?xml version="1.0" encoding ="UTF-8"?>
<TABLE>
<GRADES>
<STUDENT> Fred  </STUDENT>
<TEST1> 1 </TEST1>
<TEST2> 2 </TEST2>
<FINAL> 3 </FINAL>
</GRADES>
<GRADES>
<STUDENT> Wilma </STUDENT>
<TEST1> 1 </TEST1>
<TEST2> 2 </TEST2>
<FINAL> 3 </FINAL>
</GRADES>
</TABLE>
When I read these file in different browsers, I get the error:
error on line 4 at column 22: xmlParseCharRef: invalid xmlChar value 1
I know that a possible solution is to pre-process the original file, finding and replacing the chars that causes the error, but does anybody know any other way to workaround this problem? Is there any specific encoding which supports &#XX; charset (XX < 32) ?