I would like to determine what the alphabet for a given locale is, preferably based on the browser Accept-Language header values. Anyone know how to do this, using a library if necessary ?
相关问题
- Delete Messages from a Topic in Apache Kafka
- Jackson Deserialization not calling deserialize on
- How to maintain order of key-value in DataFrame sa
- StackExchange API - Deserialize Date in JSON Respo
- Difference between Types.INTEGER and Types.NULL in
take a look at [LocaleData.getExemplarSet][1]
for example for english this returns abcdefghijklmnopqrstuvwxyz
[1]: http://icu-project.org/apiref/icu4j/com/ibm/icu/util/LocaleData.html#getExemplarSet(com.ibm.icu.util.ULocale, int)
If you just want to know the name of an appropriate character set for a users locale then you might try the nio.CharSet class.
If you really want to use the Accept-Language header, then there's an old O'Reilly article on this matter which introduces a pretty handy class called LanguageNegotiator.
I think one of those will give you a decent enough start.
This is an English answer written in Århus. Yesterday, I heard some Germans say 'Blödheit, à propos, ist dumm'. However, one of them wore a shirt that said 'I know the difference between 文字 and الْعَرَبيّة'.
What's the answer to your question for this text? Is it allowed? Isn't this an English text?
The International Components for Unicode might help here. Specifically the
UScript
class looks promising.Out of curiosity: What do you need it for?
It depends on how specific you want to get. One place to look would be at the "Suppress-Script" properties in the IANA language registry.
Some languages have multiple "alphabets" that can be used for writing. For example, Azerbaijani can be written in Latin or Arabic script. Most languages, like English, are written almost exclusively in a single script, so the correct script goes without saying, and should be "suppressed" in language codes.
So, looking at the entry for Russian, you can tell that the preferred script is Cyrillic, while for Ethiopian, it is Amharic. But German, Norwegian, and English aren't more specific than "Latin". So, with this method, you'd have a hard time hiding umlauts and thorns from Americans, or offering any script to a Kashmiri writer.