I have several xml files to process. sample file is given below
<DOC>
<DOCNO>2431.eng</DOCNO>
<TITLE>The Hot Springs of Baños del Inca near Cajamarca</TITLE>
<DESCRIPTION>view of several pools with steaming water; people, houses and
trees behind it, and a mountain range in the distant background;</DESCRIPTION>
<NOTES>Until 1532 the place was called Pulltumarca, before it was renamed to
"Baños del Inca" (baths of the Inka) with the arrival of the Spaniards .
Today, Baños del Inca is the most-visited therapeutic bath of Peru.</NOTES>
<LOCATION>Cajamarca, Peru</LOCATION>
</DOC>
While using the xmlread() matlab function I get the following error.
[Fatal Error] 2431.eng:3:29: Invalid byte 2 of 4-byte UTF-8 sequence.
??? Java exception occurred:
org.xml.sax.SAXParseException: Invalid byte 2 of 4-byte UTF-8 sequence.
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
Error in ==> xmlread at 98
parseResult = p.parse(fileName);
Any suggestions of how to get around this problem?
The sample you posted works just fine.
As the error message says, I think your actual files are incorrectly encoded. Remember that not all possible byte sequences are valid UTF-8 sequences: http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
A quick way to check is to open the file in Firefox. If the XML file has encoding problems, you'll see an error message like:
EDIT:
So I took a look at the file: Your problem is that XML parsers treat files without the
<?xml ... ?>
declaration line as UTF-8, but your file looks to be encoded as ISO-8859-1 (Latin 1) or Windows-1252 (CP-1252) instead.For instance, the SAX parser choked on the following token:
Baños
. This character "n letter with tilde", which is U+00F1, has different representation in the two encoding:While UTF-8 was designed to be backward compatibility with ASCII, the character
ñ
falls into the extended ASCII range, which are all represented as two or more bytes in UTF-8.So when the substring
ño
stored in Latin-1 as11110001 01101111
is interpreted as being UTF-8 encoded, the parser sees the first byte and recognizes it as the beginning of a 4-byte UTF-8 sequence of the form11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
. But since it clearly does not follow that format, an error is thrown:Bottom line is: Always use an XML declaration! In your case, add the following line at the beginning of all your files:
or better yet, modify the program that generates these files to write the said line.
After this change, MATLAB (or really Java) should be able read the XML file correctly:
(Note: Apparently once MATLAB reads the file, it internally re-encodes it as UTF-16)