While researching options for storing mostly-English-but-sometimes-not data in a SQL Server database that can potentially be quite large, I'm leaning toward storing most string data as UTF-8 encoded.
However, Microsoft chose UCS-2 for reasons that I don't fully understand which is causing me to second-guess that leaning. The documentation for SQL Server 2012 does show how to create a UTF-8 UDT, but the decision for UCS-2 presumably pervades SQL Server.
Wikipedia (which interestingly notes that UCS-2 is obsolete in favor of UTF-16) notes that UTF-8 is a variable-width character set capable of encoding any Unicode data point and that it provides the de facto standard encoding for interchange of Unicode text
. So, it feels like any Unicode character can be represented in UTF-8, and since most text will be English, the representation will be nearly twice as compact as with UCS-2 (I know disk is "cheap", but disk cache isn't, and memory isn't in comparison to the data sizes I'm dealing with. Many operations degrade exponentially when the working set is larger than available RAM).
What problems might I encounter by swimming up the UCS-2 stream?
What do you mean by "swimming up the UCS-2 stream"?
Here are your options:
Not recommended but possible:
See the comments to read about the severe drawbacks that these approaches have.
Unlike some other RDBMS's that allow for choosing an encoding, SQL Server stores Unicode data only in UTF-16 (Little Endian), and non-Unicode data in an 8-bit encoding (Extended ASCII, DBCS, or EBCDIC) for whatever Code Page is implied by the Collation of the field.
Their decision to choose UCS-2 makes sense enough given that UTF-16 was introduced in mid-1996 and fully specified in 2000. A lot of other systems use (or used) it as well (please see: https://en.wikipedia.org/wiki/UTF-16#Usage). Their decision to continue with it might be more questionable, though it is probably due to Windows and .NET being UTF-16. The physical layout of the bytes is the same between UCS-2 and UTF-16, so upgrading systems from UCS-2 to support UTF-16 should be purely functional with no need to alter any existing data.
Um, no. Creating a custom User-Defined Type via SQLCLR is not, in any way, going to get you a replacement of any native type. It is very handy for creating something to handle specialized data. But strings, even of a different encoding, are far from specialized. Going this route for your string data would destroy any amount of usability of your system, not to mention performance as you wouldn't be able to use any built-in string functions. If you were able to save anything on disk space, those gains would be erased by what you would lose in overall performance. Storing a UDT is done by serializing it to a
VARBINARY
. So in order to do any string comparison OR sorting, outside of a "binary" / "ordinal" comparison, you would have to convert all other values, one by one, back to UTF-8 to then do the string compare that can account of linguistic differences.Also, that "documentation" is really just sample code / proof of concept stuff. The code was written in 2003 ( http://msftengprodsamples.codeplex.com/SourceControl/latest#Kilimanjaro_Trunk/Programmability/CLR/UTF8String/CS/UTF8String/Utf8String.cs ) for SQL Server 2005. I saw a script to test functionality, but nothing involving performance.
Yes, very much so. By default, the handling of the built-in functions is only for UCS-2. But starting in SQL Server 2012, you can get them to handle the full UTF-16 character set (well, as of Unicode Version 5 or 6, depending on your OS and version of the .NET Framework) by using one of the collations that has a name ending in
_SC
(i.e. Supplementary Characters).Correct. UTF-16 and UCS-2 both use 2-byte code points. But UTF-16 uses some of them in pairs (i.e. Surrogate Pairs) to map additional characters. The code points used for these pairs are reserved for this purpose in UCS-2 and hence are not used to map to any usable symbols. This is why you can store any Unicode character in SQL Server and it will be stored and retrieved correctly.
Correct, though misleading. Yes, UTF-8 is variable-width, but UTF-16 is also minorly variable since all of the Supplementary Characters are composed of two double-byte code points. Hence UTF-16 uses either 2 or 4 bytes per symbol, though UCS-2 is always 2 bytes. But that is not the misleading part. What is misleading is the implication that any other Unicode encoding isn't capable of encoding all other code points. While UCS-2 can hold them but not interpret them, both UTF-16 and UTF-32 can both map all Unicode code points, just like UTF-8.
This may be true, but it is entirely irrelevant from an operational perspective.
Again, true, but entirely irrelevant since UTF-16 and UTF-32 also map all Unicode code points.
Depending on circumstances this could very well be true, and you are correct to be concerned about such wasteful usage. However, as I mentioned in the question that lead to this one ( UTF-8 Support, SQL Server 2012 and the UTF8String UDT ), you have a few options to mitigate the amount of space wasted if most rows can fit into
VARCHAR
yet some need to beNVARCHAR
. The best option is to enable ROW COMPRESSION or PAGE COMPRESSION (Enterprise Editon only!). Starting in SQL Server 2008 R2, they allow non-MAXNVARCHAR
fields to use the "Standard Compression Scheme for Unicode" which is at least as good as UTF-8, and in some cases it is even better than UTF-8.NVARCHAR(MAX)
fields cannot use this fancy compression, but their IN ROW data can benefit from regular ROW and/or PAGE Compression. Please see the following for a description of this compression and a chart comparing data sizes for: raw UCS-2 / UTF-16, UTF-8, and UCS-2 / UTF-16 with data compression enabled.SQL Server 2008 R2 - UCS2 compression what is it - Impact on SAP systems
Please also see the MSDN page for Data Compression for more details as there are some restrictions (beyond it being available only in Enterprise Edition -- BUT made available to all editions starting with SQL Server 2016, SP1 !!) and some circumstances when compression might make things worse.
The veracity of that statement depends on how one defines "disk". If you are speaking in terms of commodity parts that you can purchase off the shelf at a store for use in your desktop / laptop, then sure. But, if speaking in terms of enterprise-level storage that will be used for your Production systems, then have fun explaining to whomever controls the budget that they shouldn't reject the million-plus-dollar SAN that you want because it is "cheap" ;-).
None that I can think of. Well, as long as you don't follow any horrible advice to do something like implementing that UDT, or converting all of the strings to
VARBINARY
, or usingNVARCHAR(MAX)
for all string fields ;-). But of all of the things you could worry about, SQL Server using UCS-2 / UTF-16 shouldn't be one of them.But, if for some reason this issue of no native support for UTF-8 is super important, then you might need to find another RDBMS to use that does allow for UTF-8.
UPDATE 2018-10-02
While this is not a viable option yet, SQL Server 2019 introduces native support for UTF-8 in
VARCHAR
/CHAR
datatypes. There are currently too many bugs with it for it to be used, but if they are fixed, then this is an option for some scenarios. Please see my post, "Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?", for a detailed analysis of this new feature.