Is there a rule when we must use the Unicode types?
I have seen that most of the European languages (German, Italian, English, ...) are fine in the same database in VARCHAR columns.
I am looking for something like:
- If you have Chinese --> use NVARCHAR
- If you have German and Arabic --> use NVARCHAR
What about the collation of the server/database?
I don't want to use always NVARCHAR like suggested here
What are the main performance differences between varchar and nvarchar SQL Server data types?
The real reason you want to use NVARCHAR is when you have different languages in the same column, you need to address the columns in T-SQL without decoding, you want to be able to see the data "natively" in SSMS, or you want to standardize on Unicode.
If you treat the database as dumb storage, it is perfectly possible to store wide strings and different (even variable-length) encodings in VARCHAR (for instance UTF-8). The problem comes when you are attempting to encode and decode, especially if the code page is different for different rows. It also means that the SQL Server will not be able to deal with the data easily for purposes of querying within T-SQL on (potentially variably) encoded columns.
Using NVARCHAR avoids all this.
I would recommend NVARCHAR for any column which will have user-entered data in it which is relatively unconstrained.
I would recommend VARCHAR for any column which is a natural key (like a vehicle license plate, SSN, serial number, service tag, order number, airport callsign, etc) which is typically defined and constrained by a standard or legislation or convention. Also VARCHAR for user-entered, and very constrained (like a phone number) or a code (ACTIVE/CLOSED, Y/N, M/F, M/S/D/W, etc). There is absolutely no reason to use NVARCHAR for those.
So for a simple rule:
VARCHAR when guaranteed to be constrained
NVARCHAR otherwise
You should use NVARCHAR anytime you have to store multiple languages. I believe you have to use it for the Asian languages but don't quote me on it.
Here's the problem if you take Russian for example and store it in a varchar, you will be fine so long as you define the correct code page. But let's say your using a default english sql install, then the russian characters will not be handled correctly. If you were using NVARCHAR() they would be handled properly.
Edit
Ok let me quote MSDN and maybee I was to specific but you don't want to store more then one code page in a varcar column, while you can you shouldn't
When you deal with text data that is
stored in the char, varchar,
varchar(max), or text data type, the
most important limitation to consider
is that only information from a single
code page can be validated by the
system. (You can store data from
multiple code pages, but this is not
recommended.) The exact code page used
to validate and store the data depends
on the collation of the column. If a
column-level collation has not been
defined, the collation of the database
is used. To determine the code page
that is used for a given column, you
can use the COLLATIONPROPERTY
function, as shown in the following
code examples:
Here's some more:
This example illustrates the fact that
many locales, such as Georgian and
Hindi, do not have code pages, as they
are Unicode-only collations. Those
collations are not appropriate for
columns that use the char, varchar, or
text data type
So Georgian or Hindi really need to be stored as nvarchar. Arabic is also a problem:
Another problem you might encounter is
the inability to store data when not
all of the characters you wish to
support are contained in the code
page. In many cases, Windows considers
a particular code page to be a "best
fit" code page, which means there is
no guarantee that you can rely on the
code page to handle all text; it is
merely the best one available. An
example of this is the Arabic script:
it supports a wide array of languages,
including Baluchi, Berber, Farsi,
Kashmiri, Kazakh, Kirghiz, Pashto,
Sindhi, Uighur, Urdu, and more. All of
these languages have additional
characters beyond those in the Arabic
language as defined in Windows code
page 1256. If you attempt to store
these extra characters in a
non-Unicode column that has the Arabic
collation, the characters are
converted into question marks.
Something to keep in mind when you are using Unicode although you can store different languages in a single column you can only sort using a single collation. There are some languages that use latin characters but do not sort like other latin languages. Accents is a good example of this, I can't remeber the example but there was a eastern european language whose Y didn't sort like the English Y. Then there is the spanish ch which spanish users expet to be sorted after h.
All in all with all the issues you have to deal with when dealing with internalitionalization. It is my opinion that is easier to just use Unicode characters from the start, avoid the extra conversions and take the space hit. Hence my statement earlier.
Greek would need UTF-8 on N column types: αβγ ;)
Josh says:
"....Something to keep in mind when you are using Unicode although you can store different languages in a single column you can only sort using a single collation. There are some languages that use latin characters but do not sort like other latin languages. Accents is a good example of this, I can't remeber the example but there was a eastern european language whose Y didn't sort like the English Y. Then there is the spanish ch which spanish users expet to be sorted after h."
I'm a native Spanish Speaker and "ch" is not a letter but two "c" and "h" and the Spanish alphabet is like:
abcdefghijklmn ñ opqrstuvwxyz
We don't expect "ch" after "h" but "i"
The alphabet is the same as in English except for the ñ or in HTML "ñ ;"
Alex
TL;DR;
Unicode - (nchar, nvarchar, and ntext)
Non-unicode - (char, varchar, and text).
From MSDN
Collations in SQL Server provide sorting rules, case, and accent
sensitivity properties for your data. Collations that are used with
character data types such as char and varchar dictate the code page
and corresponding characters that can be represented for that data
type.
Assuming you are using default SQL collation SQL_Latin1_General_CP1_CI_AS
then following script should print out all the symbols that you can fit in VARCHAR
since it uses one byte to store one character (256 total) if you don't see it on the list printed - you need NVARCHAR
.
declare @i int = 0;
while (@i < 256)
begin
print cast(@i as varchar(3)) + ' '+ char(@i) collate SQL_Latin1_General_CP1_CI_AS
print cast(@i as varchar(3)) + ' '+ char(@i) collate Japanese_90_CI_AS
set @i = @i+1;
end
If you change collation to lets say japanese you will notice that all the weird European letters turned into normal and some symbols into ?
marks.
Unicode is a standard for mapping code points to characters. Because
it is designed to cover all the characters of all the languages of the
world, there is no need for different code pages to handle different
sets of characters. If you store character data that reflects multiple
languages, always use Unicode data types (nchar, nvarchar, and ntext)
instead of the non-Unicode data types (char, varchar, and text).
Otherwise your sorting will go weird.