XML encoding issue

I want to know whether there is quick way to find whether an XML document is correctly encoded in UTF-8 and does not contains any characters which is not allowed in XML UTF-8 encoding.

<?xml version="1.0" encoding="utf-8"?>

thanks in advance, George

EDIT1: here is the content of my XML file, in both text form and in binary form.

http://tinypic.com/view.php?pic=2r2akvr&s=5

I have tried to use tools like xmlstarlet to check, the result is correct (invalid because of out of range of UTF-8), but the error message is not correct, because in my posted link above, there is no char whose value is 0xDFDD. Any ideas?

BTW: I can send the XML file to anyone, but I did not find a way to upload the file as attachment here. If anyone needs this file for analysis, please feel free to let me know.

D:\xmlstarlet-1.0.1-win32\xmlstarlet-1.0.1>xml val a.xml
a.xml:2: parser error : Char 0xDFDD out of allowed range
<URL>student=1砜濏磦</URL>
              ^
a.xml:2: parser error : Char 0xDFDD out of allowed range
<URL>student=1砜濏磦</URL>
              ^
a.xml:2: parser error : internal error
<URL>student=1砜濏磦</URL>
              ^
a.xml:2: parser error : Extra content at the end of the document
<URL>student=1砜濏磦</URL>
              ^
a.xml - invalid

EDIT2: I have used the tool libxml to check the validation of XML file as well, but met with an error when start this tool. Here is a screen snapshot. Any ideas?

http://tinypic.com/view.php?pic=2ildjpe&s=5

OS is Windows Server 2003 x64.

标签： .net xml unicode encoding utf-8

5条回答

【Aperson】

2楼-- · 2019-06-01 06:12

libxml2 can do it, it is available as a library (to integrate into your programs) or through the command-line tool xmllint. Here is an example with xmllint:

[Proper file] 
% head test.xml
<?xml version="1.0" encoding="utf-8"?>
<café>Ils s'étaient ...

% xmllint --noout test.xml
% 

[One byte in a multibyte character removed]
% xmllint --noout test.xml
test.xml:2: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xC3 0x74 0x61 0x69
<café>Ils s'Ãtaient ...
             ^

0人赞添加讨论(0) 举报

Lonely孤独者°

3楼-- · 2019-06-01 06:21

I don't know what's causing your problem, but it isn't a limitation of UTF-8 or an error in the encoding process. UTF-8 can encode every character known to Unicode, and the problematic byte sequences (ED BF 9D and ED B4 82) are valid--that is, the first byte starts with 1110 to indicate a three-byte sequence, and each of the other two bytes starts with 10 as continuation bytes are supposed to. It's the values they're trying to encode that are invalid.

Unicode and ISO/IEC 10646 do not, and will never, assign characters to any of the code points in the U+D800–U+DFFF range, so an individual code value from a surrogate pair does not ever represent a character. -Wikipedia

Your problem characters are U+DFDD and U+DD02. The fact that there are two characters from the range used for surrogate pairs might seem to suggest that they were meant to be a surrogate pair, but that doesn't work. It's UTF-16 that employs surrogate pairs; UTF-8 would encode the character as a single, four-byte sequence.

Another possibility is modified UTF-8, which does encode each byte of the surrogate pair separately. But that doesn't work either: a surrogate pair is always made up of one byte from the high-surrogate range (U+DC00..U+DFFF) and one from the low-surrogate range (U+D800..U+DBFF). These characters are both from the high-surrogate range.

So it appears to be a matter of bad data rather than faulty encoding. It would help a lot if we knew what those characters were supposed to be. Failing that, some info about what kind of data you're expecting (what languages, for example), where the data came from, what's been done to it... that kind of thing.

0人赞添加讨论(0) 举报

贼婆χ

4楼-- · 2019-06-01 06:22

I presume you want to do this programmatically? In that case, this is highly dependent on what programming language you're using - which language would it be?

For example, I have used this code before in PHP. preg_match allows a /u modifier (which I think is PHP-specific) which treats the pattern, and the string it is being matched against, as UTF-8. A side-effect is that the whole string is checked for UTF-8 validity each time you do this. HTML/XHTML doesn't allow C0/C1 control codes apart from tab, new line, space etc, so I also added a way to check for those here too.

function validate($allowcontrolcodes = false)
    // returns true if this is a valid utf-8 string, false otherwise.  
    // if allowcontrolcodes is false (default), then most C0 codes below 0x20, as
    // well as C1 codes 127-159, will be denied - recommend false for html/xml
    {
        if ($this->string=='') return '';
        return preg_match($allowcontrolcodes
            ? '/^[\x00-\x{d7ff}\x{e000}-\x{10ffff}]++$/u'
            : '/^[\x20-\x7e\x0a\x09\x0d\x{a0}-\x{d7ff}\x{e000}-\x{10ffff}]++$/u',
            $this->string) ? true : false;  
    }

Another way would be to use the DOM, which is available in many languages. The DOM document object has a LoadXML method which loads the document from an XML formatted string. This will fail if the document you input is not valid according to whatever character encoding it has specified, but won't specifically enforce UTF-8 encoding, but if it was successful you can then check the "encoding" property of the document object to see what encoding it was.

0人赞添加讨论(0) 举报

Bombasti

5楼-- · 2019-06-01 06:22

The easiest way to do this is to simply run the XML through a command line utility to perform this check.

I always have a copy of XMLStar available for stuff like this. It'll indicate immediately if it can/cannot parse your XML, and thus indicate whether the encoding is correct or not.

If you're looking for a coded method to do this, simply load the XML into your XML parser of choice. An encoding error will immediately trigger a parser exception (since the encoding is wrong, parsing can't take place, by definition)

e.g.

XmlDocument xDoc = new XmlDocument();

Next use the load method to load the XML document from the specified stream.

xDoc.Load("sampleXML.xml");

0人赞添加讨论(0) 举报

做自己的国王

6楼-- · 2019-06-01 06:39

Try these out

0人赞添加讨论(0) 举报

XML encoding issue

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间