I'm having some issues representing 
, a valid UTF-16 construct and also apparently valid in a Windows filename, in XML for consumption in SQL Server XML (2012).
Take this for example:
DECLARE @Xml xml;
SET @Xml = N'<?xml version="1.0"?><FileName>풜〣</FileName>'
-- Result: XML parsing: line 1, character 41, illegal xml character
This is, however, legal Unicode (a "Unicode Noncharacter" ''): http://www.fileformat.info/info/unicode/char/ffff/index.htm
So, I tried this:
DECLARE @Xml xml;
SET @Xml = N'<?xml version="1.0" encoding="utf-16"?><FileName>풜〣</FileName>'
-- Result: XML parsing: line 1, character 59, illegal xml character
So - how am I supposed to represent this file name accurately in XML? I can't just remove the characters, they are indeed 
characters as reported by Get-ChildItem, and I will need to retain a handle to this file.
I can get the XML parsing to work by replacing 
with ï¿¿
which is the UTF-8 representation of \uFFFF
according to this link I then try to take this XML and insert it to a nvarchar
column, and I need this to be the proper representation of the filename.
DECLARE @Xml xml;
SET @Xml = N'<?xml version="1.0"?><FileName>풜〣ï¿¿ï¿¿</FileName>'
SELECT F.Item.value('.', 'nvarchar(2000)') FROM @Xml.nodes('//FileName') as F(Item)
-- Returns 풜〣ï¿¿ï¿¿ (not correct)

(i.e. decimal 65535) is not a legal character, even according to the link provided in the question that states "This is, however, legal UTF-16". That link shows it to be illegal, non-character, and cannot be represented in any way (according to their test pages).
Also, according to Unicode.org:
Noncharacters
These codes are intended for process-internal uses.
FFFE <not a character>
• may be used to detect byte order by contrast with FEFF
→ FEFF = zero width no-break space
FFFF <not a character>
And, according to W3C, the list of valid characters is:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
In order to get this into XML (at least in terms of the SQL Server XML datatype), you need to first replace any occurrences of 
and 
with a custom escape sequence, such as \uFFFE;
and \uFFFF;
resectively. Then, when converting back to NVARCHAR you can simply replace \uFFFE;
with NCHAR(65534)
and \uFFFF;
with NCHAR(65535)
respectively.
OR, you can Base64 Encode the values (rather easy on the app code side) and decode on the way out. If you need to access this on the database side, you can create your own SQLCLR functions to Base64 Encode and Decode, or just grab the Free version of the SQL# library (which I am the author of) which includes Convert_ToBase64
and Convert_FromBase64
and can be used as follows:
DECLARE @Encoded NVARCHAR(200),
@Decoded NVARCHAR(200);
SET @Encoded =
SQL#.Convert_ToBase64(CONVERT(VARBINARY(200), N'f' + NCHAR(65535) + N'g'), 'None');
SELECT CONVERT(XML, N'<test>' + @Encoded + N'</test>');
SET @Decoded = SQL#.Convert_FromBase64(@Encoded);
SELECT @Encoded AS [Encoded],
@Decoded AS [Decoded],
DATALENGTH(@Decoded) AS [NumBytes], -- 6 bytes = 3 characters (most of the time)
UNICODE(SUBSTRING(@Decoded, 2, 1)) AS [TaDa!] -- get value of middle character
Returns:
<test>ZgD//2cA</test>
and then:
Encoded Decoded NumBytes TaDa!
ZgD//2cA fg 6 65535
It seems that it should not be used as part of a filename (I know that is not your doing), or perhaps it is really a valid character that is being read incorrectly by Get-ChildItem
.
I will need to retain a handle to this file.
Is it possible to rename the file to remove the invalid characters prior to importing the names/info into SQL Server? Just a thought.
And just FYI, you are not allowed to change the encoding via the xml declaration, at least not so easily: SQL Server 2008R2 and creating XML document
The set of characters allowed in XML documents is defined by production 2 of the XML specification. It excludes U+FFFF (which is defined as a non-character by Unicode, and which at the time XML was being developed was not allowed in information interchange using Unicode).
That means you cannot represent U+FFFF literally in an XML document, or using an XML numeric character reference. You can of course invent your own escape mechanism, or use something like URI escaping, to encode the character in your data; before inserting the data into applications that allow U+FFFF, you will of course have to unescape your representation.
I wonder why non-characters are allowed in Windows file names.