- VARCHAR does not store Unicode characters.
- NVARCHAR does store Unicode characters.
- Today's applications should always be Unicode compatible.
- NVARCHAR takes twice the amount of space to store it.
- Point 4 doesn't matter because storage space is extremely inexpensive.
Ergo: When designing SQL Server databases today, one should always use NVARCHAR.
Is this sound reasoning? Does anyone disagree with any of the premises? Are there any reasons to choose VARCHAR over NVARCHAR today?
You should choose VARCHAR over NVARCHAR for many different types of columns, and the choice would be on a per-column basis.
Typical columns which would not require the extra overhead NVARCHAR incurs would be:
ID-type columns: License plates, SSNs, Patient Chart identifiers etc.
Code columns: International currency codes (USD, UKP, etc.), ISO country codes (US, UK, etc), Language codes (en-us, etc), accounting segment codes, etc
Postal code and zip code columns.
These sorts of questions always have the same answer: it depends. There is no magical rule that you should follow blindly. Even the use of GOTO in modern programming languages can be justified: Is it ever advantageous to use 'goto' in a language that supports loops and functions? If so, why?
So the answer is: use your head and think about the particular situation. In this particular instance keep in mind that you can always convert from varchar to nvarchar in the database if it turns out your requirements change.
I've seen some database where the indices (indexes?...different debate) have been larger than the data. If one can get away with half the storage demands (varchar) within the index then one assumes that equates to twice the hit density of a given page and more efficient fill-factoring leading to faster data retrieval/writing/locking & less storage requirements (already mentioned).
As others have pointed out, it's not just the cost of the storage.
The length of a column will affect the number of rows per page. Having fewer rows per page means that fewer can fit into your caches, which drops performance. I am assuming that in MSSQL, a NVARCHAR column which is indexed will use up more space in the index. Which means fewer index entries per block, therefore more blocks in the index, therefore more seeks when scanning (or searching) indexes, which slows down indexed access too.
So it loses you performance on every single front. If you genuinely don't care (or can measure the performance and are happy with it, of course), then that's fine. But if you have a genuine requirement to store unicode characters, of course, use NVARCHAR.
I may be that the maintainability gained by using NVARCHAR throughout your database outweighs any performance cost.
I'm no expert on the subject. But any reason why you couldn't use UTF-8 to get a combination of small space and unicode?
I believe that comparison of nvarchars is more costly than varchars so it's perfectly valid and even preferred in places where you really don't need unicode capabilities, i.e., for some internal IDs.
And storage cost still does matter. If you have billions of rows then those "small" differences get big pretty fast.