How to handle when some special UTF-8 characters a

2019-08-10 23:32发布

I have several xml files to process. sample file is given below

  <DOC>
  <DOCNO>2431.eng</DOCNO>
  <TITLE>The Hot Springs of Baños del Inca near Cajamarca</TITLE>
  <DESCRIPTION>view of several pools with steaming water; people, houses and 
   trees behind it, and a mountain range in the distant background;</DESCRIPTION>
   <NOTES>Until 1532 the place was called Pulltumarca, before it was renamed to
   "Baños  del Inca" (baths of the Inka) with the arrival of the Spaniards . 
   Today, Baños del Inca is the most-visited therapeutic bath of Peru.</NOTES>
   <LOCATION>Cajamarca, Peru</LOCATION>
   </DOC>        

While using the xmlread() matlab function I get the following error.

    [Fatal Error] 2431.eng:3:29: Invalid byte 2 of 4-byte UTF-8 sequence.
    ??? Java exception occurred:
    org.xml.sax.SAXParseException: Invalid byte 2 of 4-byte UTF-8 sequence.
    at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
    at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)

    Error in ==> xmlread at 98
    parseResult = p.parse(fileName);

Any suggestions of how to get around this problem?

1条回答
我只想做你的唯一
2楼-- · 2019-08-11 00:00

The sample you posted works just fine.

As the error message says, I think your actual files are incorrectly encoded. Remember that not all possible byte sequences are valid UTF-8 sequences: http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

A quick way to check is to open the file in Firefox. If the XML file has encoding problems, you'll see an error message like:

XML Parsing Error: not well-formed


EDIT:

So I took a look at the file: Your problem is that XML parsers treat files without the <?xml ... ?> declaration line as UTF-8, but your file looks to be encoded as ISO-8859-1 (Latin 1) or Windows-1252 (CP-1252) instead.

For instance, the SAX parser choked on the following token: Baños. This character "n letter with tilde", which is U+00F1, has different representation in the two encoding:

  • in ISO-8859-1, it is represented as one byte: 0xF1
  • in UTF-8, it is represented as two bytes: 0xC3 0xB1

While UTF-8 was designed to be backward compatibility with ASCII, the character ñ falls into the extended ASCII range, which are all represented as two or more bytes in UTF-8.

So when the substring ño stored in Latin-1 as 11110001 01101111 is interpreted as being UTF-8 encoded, the parser sees the first byte and recognizes it as the beginning of a 4-byte UTF-8 sequence of the form 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx. But since it clearly does not follow that format, an error is thrown:

org.xml.sax.SAXParseException: Invalid byte 2 of 4-byte UTF-8 sequence.

Bottom line is: Always use an XML declaration! In your case, add the following line at the beginning of all your files:

<?xml version="1.0" encoding="ISO-8859-1"?>

or better yet, modify the program that generates these files to write the said line.

After this change, MATLAB (or really Java) should be able read the XML file correctly:

>> doc = xmlread('2431.eng');
>> doc.saveXML([])
ans =
<?xml version="1.0" encoding="UTF-16"?>
<DOC>
<DOCNO>annotations/02/2431.eng</DOCNO>
<TITLE>The Hot Springs of Baños del Inca near Cajamarca</TITLE>
<DESCRIPTION>view of several pools with steaming water; people, houses and trees behind it, and a mountain range in the distant background;</DESCRIPTION>
<NOTES>Until 1532 the place was called Pulltumarca, before it was renamed to "Baños del Inca" (baths of the Inka) with the arrival of the Spaniards . Today, Baños del Inca is the most-visited therapeutic bath of Peru.</NOTES>
<LOCATION>Cajamarca, Peru</LOCATION>
<DATE>October 2002</DATE>
<IMAGE>images/02/2431.jpg</IMAGE>
<THUMBNAIL>thumbnails/02/2431.jpg</THUMBNAIL>
</DOC>

(Note: Apparently once MATLAB reads the file, it internally re-encodes it as UTF-16)

查看更多
登录 后发表回答