Is there a rule when we must use the Unicode types?
I have seen that most of the European languages (German, Italian, English, ...) are fine in the same database in VARCHAR columns.
I am looking for something like:
- If you have Chinese --> use NVARCHAR
- If you have German and Arabic --> use NVARCHAR
What about the collation of the server/database?
I don't want to use always NVARCHAR like suggested here What are the main performance differences between varchar and nvarchar SQL Server data types?
The real reason you want to use NVARCHAR is when you have different languages in the same column, you need to address the columns in T-SQL without decoding, you want to be able to see the data "natively" in SSMS, or you want to standardize on Unicode.
If you treat the database as dumb storage, it is perfectly possible to store wide strings and different (even variable-length) encodings in VARCHAR (for instance UTF-8). The problem comes when you are attempting to encode and decode, especially if the code page is different for different rows. It also means that the SQL Server will not be able to deal with the data easily for purposes of querying within T-SQL on (potentially variably) encoded columns.
Using NVARCHAR avoids all this.
I would recommend NVARCHAR for any column which will have user-entered data in it which is relatively unconstrained.
I would recommend VARCHAR for any column which is a natural key (like a vehicle license plate, SSN, serial number, service tag, order number, airport callsign, etc) which is typically defined and constrained by a standard or legislation or convention. Also VARCHAR for user-entered, and very constrained (like a phone number) or a code (ACTIVE/CLOSED, Y/N, M/F, M/S/D/W, etc). There is absolutely no reason to use NVARCHAR for those.
So for a simple rule:
VARCHAR when guaranteed to be constrained NVARCHAR otherwise
Josh says: "....Something to keep in mind when you are using Unicode although you can store different languages in a single column you can only sort using a single collation. There are some languages that use latin characters but do not sort like other latin languages. Accents is a good example of this, I can't remeber the example but there was a eastern european language whose Y didn't sort like the English Y. Then there is the spanish ch which spanish users expet to be sorted after h."
I'm a native Spanish Speaker and "ch" is not a letter but two "c" and "h" and the Spanish alphabet is like: abcdefghijklmn ñ opqrstuvwxyz We don't expect "ch" after "h" but "i" The alphabet is the same as in English except for the ñ or in HTML "ñ ;"
Alex
You should use NVARCHAR anytime you have to store multiple languages. I believe you have to use it for the Asian languages but don't quote me on it.
Here's the problem if you take Russian for example and store it in a varchar, you will be fine so long as you define the correct code page. But let's say your using a default english sql install, then the russian characters will not be handled correctly. If you were using NVARCHAR() they would be handled properly.
Edit
Ok let me quote MSDN and maybee I was to specific but you don't want to store more then one code page in a varcar column, while you can you shouldn't
Here's some more:
So Georgian or Hindi really need to be stored as nvarchar. Arabic is also a problem:
Something to keep in mind when you are using Unicode although you can store different languages in a single column you can only sort using a single collation. There are some languages that use latin characters but do not sort like other latin languages. Accents is a good example of this, I can't remeber the example but there was a eastern european language whose Y didn't sort like the English Y. Then there is the spanish ch which spanish users expet to be sorted after h.
All in all with all the issues you have to deal with when dealing with internalitionalization. It is my opinion that is easier to just use Unicode characters from the start, avoid the extra conversions and take the space hit. Hence my statement earlier.
TL;DR;
Unicode - (nchar, nvarchar, and ntext)
Non-unicode - (char, varchar, and text).
From MSDN
Assuming you are using default SQL collation
SQL_Latin1_General_CP1_CI_AS
then following script should print out all the symbols that you can fit inVARCHAR
since it uses one byte to store one character (256 total) if you don't see it on the list printed - you needNVARCHAR
.If you change collation to lets say japanese you will notice that all the weird European letters turned into normal and some symbols into
?
marks.Otherwise your sorting will go weird.
Greek would need UTF-8 on N column types: αβγ ;)