I've read Joel's article on Unicode and I feel that I have at least a basic grasp of internationalization from a character set perspective. In addition to reading this question, I've also done some of my own research on internationalization in regards to design considerations, but I can't help but suspect that there is a lot more out there that I just don't know or don't know to ask.
Some of the things I've learned:
- Some languages read right-to-left instead of left-to-right.
- Calendar, dates, times, currency, and numbers are displayed differently from language to language.
- Design should be flexible enough to accommodate a lot more text because some languages are far more verbose than others.
- Don't take icons or colors for granted when it comes to their semantic meaning as this can vary from culture to culture.
- Geographical nomenclature varies from language to language.
Where I'm at:
- My design is flexible enough to accommodate a lot more text.
- I automatically translate each string, including error messages and help dialogs.
- I haven't come to a point yet where I've needed to display units of time, currency or numbers, but I'll be there shortly and will need to develop a solution.
- I'm using the UTF-8 character set across the board.
- My menus and various lists in the application are sorted alphabetically for each language for easier reading.
- I have a tag parser that extracts tags by filtering out stop words. The stop words list is language specific and can be swapped out.
What I'd like to know more about:
- I'm developing a downloadable PHP web application, so any specific advice in regards to PHP would be greatly appreciated. I've developed my own framework and am not interested in using other frameworks at this time.
- I know very little about non-western languages. Are there specific considerations that need to be taken into account that I haven't mentioned above? Also, how do PHP's array sorting functions handle non-western characters?
- Are there any specific gotchas that you've experienced in practice? I'm looking in terms of both the GUI and the application code itself.
- Any specific advice for working with date and time displays? Is there a breakdown according to region or language?
- I've seen a lot of projects and sites let their communities provide translation for their applications and content. Do you recommend this and what are some good strategies for ensuring that you have a good translation?
- This question is basically the extent of what I know about internationalization. What don't I know that I don't know that I should look into further?
Edit: I added the bounty because I would like to have more real-world examples from experience.
I don't have a whole lot to add to the great answers so far, but here are a few things to consider and to check.
Okay, I had more to say than I thought...
When we worked on the i18n/l10n issues of Dreamfall and Age of Conan, we came across a few issues that are worth keeping in mind. Some of these we solved, some were solved for us, and some we worked around. Some we never solved...
Make sure all your tools and all your code supports all the charsets you want to use, and double check that assumption twice during the course of the project and a couple more times to be sure.
Make sure you use a font that supports all the languages you want to use. Most fonts that claim to be unicode are only unicode in the sense that the characters it has is at the correct codepoint. It does not mean it has usable characters for all codepoints.
Text-wrapping is not only done at spaces, as some languages don't use space to separate words (chinese comes to mind). Make sure your text-wrapping routines handles text without any spaces at all.
Handling plural correctly is tricky in the easy cases, and damned hard in the hard cases. Make sure you know enough about the languages you'll be using to be able to write code to handle the plural issue correctly. Keep in mind that english (and the other "western" languages are among the easy ones.
Never break sentences and build strings with them to fit a variable, as the variable might be placed elsewhere in the sentence in a different language. Use placeholders.
Keep in mind that for some languages, the value of the placeholder might change how to write the sentence. Grammar is hard. Make sure you have a plan for dealing with it. (Specifically, make sure you have a way to classify the values you use in the placeholders according to gender, time, etc).
One thing I've learned the hard way: if you have several files that need to be translated, include an extra tag in the name, so that later you can search your whole folder for that tag.
e.g. instead of naming a file 'sample-database.txt' name the english version 'sample-database-loc-en.txt', the italian version 'sample-database-loc-it.txt
lists should be sorted, menus shouldn't. keep in mind that a given user might want to use your application in more than one language, he should still find everywhere in the same place.
the same with shortcuts, if you have any: do not translate them.
also, remember that internationalization and translation are two very different things, manage them separately.
PHP represents strings internally as byte-streams, and assumes iso-8859-1, for the cases where the encoding matters. For the most part, you can just use UTF-8 all over the place, and you'll be fine. One gotcha, if your site takes input from its users, is that you can never be 100% sure that they are submitting content in the proper encoding. You might want to use
mb_detect_encoding
to verify input, or use a hidden field with "exotic" characters to verify against.Be aware that all string-related functions in PHP, that work on a character-basis, assume that character = byte. That means that you generally can't trust string functions. Have a look at this page for more details.
Another good resource for PHP, is Nick Nettleton's cheatsheet.
A subject that is very closely related to charsets/encodings, is collation. You need your collations to match the language/culture that you are working with. At least in MySql (probably in other RDBMS'es as well), you can specify the collation on different levels, such as per-database, per-table, per-column and even in the query itself.