Why does casting a UTF-8 VARCHAR column to XML req

2019-07-31 10:48发布

问题:

I am trying to convert data in a varchar column to XML but I was getting errors with certain characters. Running this ...

-- This fails
DECLARE @Data VARCHAR(1000) = '<?xml version="1.0" encoding="utf-8"?><NewDataSet>Test¦</NewDataSet>';
SELECT CAST(@Data AS XML) AS DataXml

... results in the following error

Msg 9420, Level 16, State 1, Line 3
XML parsing: line 1, character 55, illegal xml character

It appears that it's the broken pipe character that is causing the error but I thought that it was a valid character for UTF-8. Looking at the XML spec it appears to be valid.

When I change it to this ...

-- This works
DECLARE @Data VARCHAR(1000) = '<?xml version="1.0" encoding="utf-8"?><NewDataSet>Test¦</NewDataSet>';
SELECT CAST(REPLACE(CAST(@Data AS NVARCHAR(MAX)), 'encoding="utf-8"', '') AS XML) AS DataXml

... it works without error (replacing encoding string to utf-16 also works). I'm using SQL Server 2008 R2 with SQL_Latin1_General_CP1_CI_AS Coallation.

Can anyone tell my why I need to convert to NVARCHAR and strip the encoding="utf-8" for this to work?

Thanks,

Edit

It appears that this also works ...

DECLARE @Data VARCHAR(1000) = '<?xml version="1.0" encoding="utf-8"?><NewDataSet>Test¦</NewDataSet>';
SELECT CAST(REPLACE(@Data, 'encoding="utf-8"', '') AS XML) AS DataXml

Removing the utf-8 encoding from the prolog is sufficient for SQL Server to do the conversion.

回答1:

Your pipe character is using Unicode codepoint U+00A6 BROKEN BAR instead of U+007C VERTICAL LINE. U+00A6 is outside of ASCII. VARCHAR does not support non-ASCII characters. That is why you have to use NVARCHAR instead, which is designed to handle Unicode data.