I want to know whether there is quick way to find whether an XML document is correctly encoded in UTF-8 and does not contains any characters which is not allowed in XML UTF-8 encoding.
<?xml version="1.0" encoding="utf-8"?>
thanks in advance, George
EDIT1: here is the content of my XML file, in both text form and in binary form.
http://tinypic.com/view.php?pic=2r2akvr&s=5
I have tried to use tools like xmlstarlet to check, the result is correct (invalid because of out of range of UTF-8), but the error message is not correct, because in my posted link above, there is no char whose value is 0xDFDD. Any ideas?
BTW: I can send the XML file to anyone, but I did not find a way to upload the file as attachment here. If anyone needs this file for analysis, please feel free to let me know.
D:\xmlstarlet-1.0.1-win32\xmlstarlet-1.0.1>xml val a.xml
a.xml:2: parser error : Char 0xDFDD out of allowed range
<URL>student=1砜濏磦</URL>
^
a.xml:2: parser error : Char 0xDFDD out of allowed range
<URL>student=1砜濏磦</URL>
^
a.xml:2: parser error : internal error
<URL>student=1砜濏磦</URL>
^
a.xml:2: parser error : Extra content at the end of the document
<URL>student=1砜濏磦</URL>
^
a.xml - invalid
EDIT2: I have used the tool libxml to check the validation of XML file as well, but met with an error when start this tool. Here is a screen snapshot. Any ideas?
http://tinypic.com/view.php?pic=2ildjpe&s=5
OS is Windows Server 2003 x64.
libxml2 can do it, it is available as a library (to integrate into your programs) or through the command-line tool xmllint. Here is an example with xmllint:
I don't know what's causing your problem, but it isn't a limitation of UTF-8 or an error in the encoding process. UTF-8 can encode every character known to Unicode, and the problematic byte sequences (
ED BF 9D
andED B4 82
) are valid--that is, the first byte starts with1110
to indicate a three-byte sequence, and each of the other two bytes starts with10
as continuation bytes are supposed to. It's the values they're trying to encode that are invalid.Your problem characters are
U+DFDD
andU+DD02
. The fact that there are two characters from the range used for surrogate pairs might seem to suggest that they were meant to be a surrogate pair, but that doesn't work. It's UTF-16 that employs surrogate pairs; UTF-8 would encode the character as a single, four-byte sequence.Another possibility is modified UTF-8, which does encode each byte of the surrogate pair separately. But that doesn't work either: a surrogate pair is always made up of one byte from the high-surrogate range (
U+DC00..U+DFFF
) and one from the low-surrogate range (U+D800..U+DBFF
). These characters are both from the high-surrogate range.So it appears to be a matter of bad data rather than faulty encoding. It would help a lot if we knew what those characters were supposed to be. Failing that, some info about what kind of data you're expecting (what languages, for example), where the data came from, what's been done to it... that kind of thing.
I presume you want to do this programmatically? In that case, this is highly dependent on what programming language you're using - which language would it be?
For example, I have used this code before in PHP. preg_match allows a /u modifier (which I think is PHP-specific) which treats the pattern, and the string it is being matched against, as UTF-8. A side-effect is that the whole string is checked for UTF-8 validity each time you do this. HTML/XHTML doesn't allow C0/C1 control codes apart from tab, new line, space etc, so I also added a way to check for those here too.
Another way would be to use the DOM, which is available in many languages. The DOM document object has a LoadXML method which loads the document from an XML formatted string. This will fail if the document you input is not valid according to whatever character encoding it has specified, but won't specifically enforce UTF-8 encoding, but if it was successful you can then check the "encoding" property of the document object to see what encoding it was.
The easiest way to do this is to simply run the XML through a command line utility to perform this check.
I always have a copy of XMLStar available for stuff like this. It'll indicate immediately if it can/cannot parse your XML, and thus indicate whether the encoding is correct or not.
If you're looking for a coded method to do this, simply load the XML into your XML parser of choice. An encoding error will immediately trigger a parser exception (since the encoding is wrong, parsing can't take place, by definition)
e.g.
Next use the load method to load the XML document from the specified stream.
Try these out
http://validator.w3.org/#validate_by_input
http://www.w3schools.com/XML/xml_validator.asp