SQL Server - defining an XML type column with UTF-

2019-01-09 18:30发布

问题:

The default encoding for an XML type field defined in an SQL Server is UTF-16. I have no trouble inserting into that field with UTF-16 encoded XML streams.

But if I tried to insert into the field with UTF-8 encoded XML stream, the insert attempt would receive the error response
unable to switch encoding.

QUESTION: Is there a way to define a SQL Server column/field as having UTF-8 encoding?

Further info

The insertion operations are performed using Spring JDBCTemplate.

The XML Stream was produced by JAXB Marshaller set to UTF-8 or UTF-16 encoding.

private String marshall(myDAO myTao, JAXBEncoding jaxbEncoding)
throws JAXBException{
    JAXBContext jc = JAXBContext.newInstance(ObjectFactory.class);
    m = jc.createMarshaller();
    m.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, Boolean.TRUE);
    if (jaxbEncoding!=null)
        m.setProperty(Marshaller.JAXB_ENCODING, jaxbEncoding.toString());
    StringWriter strw = new StringWriter();
    m.marshal(myTao, strw);
    String strw.toString();
}

Where ...

public enum JAXBEncoding {
    UTF8("UTF-8"),
    UTF16("UTF-16")
    ;

    private String value;
    private JAXBEncoding(String value){
        this.value = value;
    }

    public String toString(){
        return this.value;
    }
}

回答1:

Is there a way to define a SQL Server column/field as having UTF-8 encoding?

No, the only Unicode encoding in SQL Server is UTF-16 Little Endian, which is how the NCHAR, NVARCHAR, NTEXT (deprecated as of SQL Server 2005 so don't use this in new development; besides, it sucks compared to NVARCHAR(MAX) anyway), and XML datatypes are handled. You do not get a choice of Unicode encodings like some other RDBMS's allow.

You can insert UTF-8 encoded XML into SQL Server, provided you follow these three rules:

  1. The incoming string has to be of datatype VARCHAR, not NVARCHAR (as NVARCHAR is always UTF-16 Little Endian, hence the error about not being able to switch the encoding).
  2. The XML has an XML declaration that explicitly states that the encoding of the XML is indeed UTF-8: <?xml version="1.0" encoding="UTF-8" ?>.
  3. The byte sequence needs to be the actual UTF-8 bytes.

For example, we can import a UTF-8 encoded XML document containing the screaming face emoji (and we can get the UTF-8 byte sequence for that Supplementary Character by following that link):

SET NOCOUNT ON;
DECLARE @XML XML = '<?xml version="1.0" encoding="utf-8"?><root><test>'
                    + CHAR(0xF0) + CHAR(0x9F) + CHAR(0x98) + CHAR(0xB1)
                    + '</test></root>';

SELECT @XML;
PRINT CONVERT(NVARCHAR(MAX), @XML);

Returns (in both "Results" and "Messages" tabs):

<root><test>