We are creating multi-language subsites on our website.
I would like to use the 2-letter language codes. Spanish and French are easy. They will get URLs like:
mydomain.com/es
mydomain.com/fr
but I run into a problem with Traditional and Simplified chinese. Are there standards for which 2 letter codes to use for these languages?
mydomain.com/zh
mydomain.com/?
There is indeed a standard representation for this. As people have run into the exact same problem you are seeing -- same language, but different dialects or characters -- they've extended the two-letter language code with a two-letter region code. So you might have a universal French page at
mydomain.com/fr
, but internationalizing for French Canadian readers might leave you with amydomain.com/fr_CA
(Canada) andmydomain.com/fr_FR
(France). Some platforms use a dash instead of an underscore to separate the language and region codes (hencefr-CA
andfr-FR
).The standard locale for simplified Chinese is
zh_CN
. The standard locale for traditional Chinese iszh_TW
.I hesitate to point you towards the actual BCP 47 standards documents, as they're, uh, a little heavy on the detail and a little light on the readability. Just go with standard locale identifiers, like the ones in used by Java, and you'll be fine.
Language is dependent upon where it is spoken (doh!), so language and locale codes reflect that reality.
zh
is the basic language code, but because there are two major forms of it, there arezh_Hans
andzh_Hant
, but they are still only language codes, not locales.Location-specific
To fully specify which language is used in a particular location, the country code still has to be suffixed, so making
zh_Hans_HK
andzh_Hant_HK
for simplified and traditional Chinese, respectively, both as spoken in Hong Kong.Actually, the reality is that something more specific than country code is often required in many countries, but that is likely to exponentially increase the complexity and maintenance of databases like CLDR, plus the support infrastructure to feed into it, like IP to location details extraction, is not generally available or accurate enough.
Fixed text
Now, if the code is just to specify which set of fixed strings to use in the user interface, or even whole pages sets on a site, a country suffix is not really necessary, unless there are more than a few places where the language varies significantly enough (location-based info) to bother creating a whole separate resource set.
The larger the resource set, the more likely that a language code based upon locale [in this context, just a language attribute, rather than a true locale, so you can call it what you like!] will be required, but at least you only have to do that when necessary.
On-the-fly values
However, if wanting to format particular variable values, like dates, times, currencies and numbers, on-the-fly, locales become important, because all the tools that support such functionality (like those based upon Unicode CLDR data) expect them. The locale for these needs to be a separate setting to the code for which in-house-generated UI language set to use, unless you want to create a resource set for every known locale, and maintain them ad nauseum!
Browser language tools
Note that when specifying locale for a web page that can be edited, as in input boxes, and spellcheck in attributes or css has been enabled for the field, the browser's language tools will spellcheck the field according to that locale.
Criteria
You have to be clear about what the resource set is providing, so consider:
Spreadsheet to minimise maintenance overhead
I use a spreadsheet to hold UI strings where each language code has a parent code, so that the cell for its version of a string has a formula that gets its string from the parent. To create a custom string for that language and string, I just overwrite the cell formula with the exact text. That minimises the amount of resource maintenance. I run a macro at the end that generates a complete resource file for each language.
@dkarp gives an excellent general answer. I will add some additional specifics regarding Chinese:
There are several countries where Chinese is the main written language. The major difference between them is whether they use simplified or traditional characters, but there are also minor regional differences (in vocabulary, etc). The standard way to distinguish these would be with a country code, e.g.
zh_CN
for mainland China,zh_SG
for Singapore,zh_TW
for Taiwan, orzh_HK
for Hong Kong.Mainland China and Singapore both use simplified characters, and the others use traditional characters. Since China and Taiwan are the two with the biggest populations, just
zh_CN
andzh_TW
are often used to distinguish the simplified and traditional character versions of a website.More correct, however, would be to use
zh_HANS
for (generic) simplified Chinese characters, andzh_HANT
for traditional Chinese characters, except for rare cases when it is meaningful to distinguish different countries.