Researching the pro's and con's of SQL Server's VARCHAR vs. NVARCHAR for my particular application, I came to the realization that it would be ideal if SQL Server supported UTF8 natively. Several SO posts indicate that it does not, e.g.:
Is VARCHAR like totally 1990s?
What are the main performance differences between varchar and nvarchar SQL Server data types?
However then I came across this article in the MSDN documentation for SQL Server 2012 that shows how to create a UTF8String user-defined data type:
http://msdn.microsoft.com/en-us/library/ff877964(v=sql.110).aspx
It seems that UDT would allow for the space (memory, disk) benefits of 8-bits-per-character while being flexible enough to store any string that can be represented in UTF-8. Is that correct? Are there downsides to this strategy (e.g. performance cost of executing the managed code for each row, ...)?
Creating a custom User-Defined Type via SQLCLR is not, in any way, going to get you a replacement of any native type. It is very handy for creating something to handle specialized data. But strings, even of a different encoding, are far from specialized. Going this route for your string data would destroy any amount of usability of your system, not to mention performance as you wouldn't be able to use any built-in string functions.
If you were able to save anything on disk space, those gains would be erased by what you would lose in overall performance. Storing a UDT is done by serializing it to a
VARBINARY
. So in order to do any string comparison OR sorting, outside of a "binary" / "ordinal" comparison, you would have to convert all other values, one by one, back to UTF-8 to then do the string compare that can account of linguistic differences. And that conversion would need to be done within the UDT. This means that, like the XML datatype, you would create the UDT to hold a particular value, and then expose a method of that UDT to accept a string parameter to do the comparison (i.e.Utf8String.Compare(alias.field1)
or, if defining an operator for the type, thenUtf8string1 = Utf8string2
and have the=
operator get the string in the UTF-8 encoding and then do theCompareInfo.Compare()
).In addition to the above considerations, you also need to consider that passing values back and forth through the SQLCLR API has a cost, especially when using either
NVARCHAR(MAX)
orVARBINARY(MAX)
as opposed toNVARCHAR(1 - 4000)
andVARBINARY(1 - 4000)
respectively (please do not confuse this distinction as implying anything about usingSqlChars
/SqlBytes
vsSqlString
/SqlBinary
).Finally (at least in terms of using a UDT), please do not look past the fact that the UDT being inquired about is sample code. The only testing noted is purely functional, nothing around scalability or "lessons learned after working with this for a year". The functional test code is shown here at the following CodePlex page and should be looked at before proceeding with this decision as it gives a sense of how you would need to write your queries in order to interact with it (which is fine for a field or two, but not for most / all string fields):
http://msftengprodsamples.codeplex.com/SourceControl/latest#Kilimanjaro_Trunk/Programmability/CLR/UTF8String/Scripts/Test.sql
Given the number of persisted computed columns and indexes added, was any space really saved? ;-)
Where space (disk, memory, etc) is the concern, you have three options:
If you are using SQL Server 2008 or newer, and are on Enterprise Edition, then you can enable Data Compression. Data Compression can (but won't "always") compress Unicode data in
NCHAR
andNVARCHAR
fields. The determining factors are:NCHAR(1 - 4000)
andNVARCHAR(1 - 4000)
use the Standard Compression Scheme for Unicode, but only starting in SQL Server 2008 R2, AND only for IN ROW data, not OVERFLOW! This appears to be better than the regular ROW / PAGE compression algorithm.NVARCHAR(MAX)
andXML
(and I guess alsoVARBINARY(MAX)
,TEXT
, andNTEXT
) data that is IN ROW (not off row in LOB or OVERFLOW pages) can be at least PAGE compressed, and maybe also ROW compressed (not sure about this last one).If using a version older than 2008 or not on Enterprise Edition, you can have two fields: one
VARCHAR
and oneNVARCHAR
. For example, let's say you are storing URLs which are mostly all base ASCII characters (values 0 - 127) and hence fit intoVARCHAR
, but sometimes have Unicode characters. Your schema can include the following 3 fields:In this model you only SELECT from the
[URL]
computed column. For inserting and updating, you determine which field to use by seeing if converting alters the incoming value, which has to be ofNVARCHAR
type:If you have fields that should only ever have characters that fit into a particular Code Page of an Extended ASCII character set, then just use
VARCHAR
.P.S. Just to have this stated for clarity: the new
_SC
Collations that were introduced in SQL Server 2012 simply allow for:But, even without the new
_SC
Collations, you can still store any Unicode character into an XML orN
-prefixed type, and retrieve it with no data loss. However, when using the older Collations (i.e. no version number in the name), all Supplementary Characters equate to each other. You need to use the_90
and_100
Collations which at least get you binary / code point comparisons and sorting; they can't take into account linguistic rules since they have no particular mappings of the Supplementary Characters (and hence have no weights or normalization rules).Try the following: