How do I properly handle ￿ in UTF-8 XML?

2019-05-22 22:18发布

问题:

I'm having some issues representing , a valid UTF-16 construct and also apparently valid in a Windows filename, in XML for consumption in SQL Server XML (2012).

Take this for example:

DECLARE @Xml xml;
SET @Xml = N'<?xml version="1.0"?><FileName>풜〣&#xFFFF;&#xFFFF;</FileName>'

-- Result: XML parsing: line 1, character 41, illegal xml character

This is, however, legal Unicode (a "Unicode Noncharacter" '￿'): http://www.fileformat.info/info/unicode/char/ffff/index.htm

So, I tried this:

DECLARE @Xml xml;
SET @Xml = N'<?xml version="1.0" encoding="utf-16"?><FileName>풜〣&#xFFFF;&#xFFFF;</FileName>'

-- Result: XML parsing: line 1, character 59, illegal xml character

So - how am I supposed to represent this file name accurately in XML? I can't just remove the characters, they are indeed &#xFFFF; characters as reported by Get-ChildItem, and I will need to retain a handle to this file.

I can get the XML parsing to work by replacing &#xFFFF; with &#xEF;&#xBF;&#xBF; which is the UTF-8 representation of \uFFFF according to this link I then try to take this XML and insert it to a nvarchar column, and I need this to be the proper representation of the filename.

DECLARE @Xml xml;
SET @Xml = N'<?xml version="1.0"?><FileName>풜〣&#xEF;&#xBF;&#xBF;&#xEF;&#xBF;&#xBF;</FileName>'
SELECT F.Item.value('.', 'nvarchar(2000)') FROM @Xml.nodes('//FileName') as F(Item)

-- Returns 풜〣ï¿¿ï¿¿ (not correct)

回答1:

&#xFFFF; (i.e. decimal 65535) is not a legal character, even according to the link provided in the question that states "This is, however, legal UTF-16". That link shows it to be illegal, non-character, and cannot be represented in any way (according to their test pages).

Also, according to Unicode.org:

Noncharacters
These codes are intended for process-internal uses.

FFFE ￾ <not a character>
• may be used to detect byte order by contrast with FEFF 
→ FEFF = zero width no-break space

FFFF ￿ <not a character>

And, according to W3C, the list of valid characters is:

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */


In order to get this into XML (at least in terms of the SQL Server XML datatype), you need to first replace any occurrences of &#xFFFE; and &#xFFFF; with a custom escape sequence, such as \uFFFE; and \uFFFF; resectively. Then, when converting back to NVARCHAR you can simply replace \uFFFE; with NCHAR(65534) and \uFFFF; with NCHAR(65535) respectively.

OR, you can Base64 Encode the values (rather easy on the app code side) and decode on the way out. If you need to access this on the database side, you can create your own SQLCLR functions to Base64 Encode and Decode, or just grab the Free version of the SQL# library (which I am the author of) which includes Convert_ToBase64 and Convert_FromBase64 and can be used as follows:

DECLARE @Encoded NVARCHAR(200),
        @Decoded NVARCHAR(200);

SET @Encoded =
    SQL#.Convert_ToBase64(CONVERT(VARBINARY(200), N'f' + NCHAR(65535) + N'g'), 'None');

SELECT CONVERT(XML, N'<test>' + @Encoded + N'</test>');

SET @Decoded = SQL#.Convert_FromBase64(@Encoded);
SELECT @Encoded AS [Encoded],
       @Decoded AS [Decoded],
       DATALENGTH(@Decoded) AS [NumBytes], -- 6 bytes = 3 characters (most of the time)
       UNICODE(SUBSTRING(@Decoded, 2, 1)) AS [TaDa!] -- get value of middle character

Returns:

<test>ZgD//2cA</test>

and then:

Encoded     Decoded     NumBytes    TaDa!
ZgD//2cA    fg          6           65535

It seems that it should not be used as part of a filename (I know that is not your doing), or perhaps it is really a valid character that is being read incorrectly by Get-ChildItem.

I will need to retain a handle to this file.

Is it possible to rename the file to remove the invalid characters prior to importing the names/info into SQL Server? Just a thought.


And just FYI, you are not allowed to change the encoding via the xml declaration, at least not so easily: SQL Server 2008R2 and creating XML document



回答2:

The set of characters allowed in XML documents is defined by production 2 of the XML specification. It excludes U+FFFF (which is defined as a non-character by Unicode, and which at the time XML was being developed was not allowed in information interchange using Unicode).

That means you cannot represent U+FFFF literally in an XML document, or using an XML numeric character reference. You can of course invent your own escape mechanism, or use something like URI escaping, to encode the character in your data; before inserting the data into applications that allow U+FFFF, you will of course have to unescape your representation.

I wonder why non-characters are allowed in Windows file names.