I'm integrating ICU into some in-house software. I'd like to be able to take a string such as "en_US" and get the script name "Latin" for it. (Though ultimately I actually want an ICU ScriptCode.)
I tried using ICU's Locale class, but this code:
Locale *ul = new Locale("en_US",);
LOG(ul->getScript());
Logs an empty string, despite the documentation indicating that this is the use case. I even tried it using the Locale class' static method Locale::getEnglish
and still got an empty string. I'm new to this internationalization stuff and to ICU. Is there something I'm missing? Seems like this should be a pretty straightforward task.
Edit: After reading the source code for Locale, it seems that the only time it can provide a script code is when it's passed to the constructor (ie. "en_Latn_US"). Cheers for inadequate documentation. My overall question still stands.
EDIT: I've made a new and better answer. Use that. I had no luck finding anything, so I decided to do my best to make a table myself. After searching a bit, I found this jem. Clicking (almost) any language will give you the ISO 639-1 code (and more), and clicking any script category will give you the 15924 code. I probably could have written something to tease the tables into c++ manually, but I only needed a couple dozen and couldn't justify automating it (on my jobs dime) so here's the table I made by hand:
Better: http://unicode.org/repos/cldr/trunk/common/supplemental/supplementalData.xml
C#, usings excluded for character count:
I've created a
Java
version that does this, available here. Basically, it takes the table in the above answer (with additional entries) and ports it to aMap<String, Map<String, String>>
containing the useful information, then a simple lookup method is used. To use this class in your project, just call:to get the script for the default locale.
But
en_US
doesn't have a script tag in it, it's just an identifier. What would you suggest to improve the documentation here?If you want to guess what the script is likely to be, then you can use
uloc_addLikelySubtags()
(or ICU4J equivalent) which will mapen
toen_Latn_US
, but will leavezh_Hant_CN
aszh_Hant_CN
, using the CLDR likelySubtag data.The C part of ICU4C contains uscript_getCode() call that should do what you are looking for.