I am writing a server API for an iOS application. As a part of the initialization process, the app should send the phone interface language to server via an API call.
The problem is that Apple uses something called IETF BCP 47 language identifier in its NSLocale preferredLanguages
function.
The returned values have different lengths (e.g. [aa, ab, ace, ach, ada, ady, ae, af, afa, afh, agq, ...]
, and I found very few parsers that can convert this code to a proper language identifier.
I would like to use the more common ISO-639-2 three-letters language identifier, which is ubiquitous, has many parsers in many languages, and has a standard, 3-letter representation of languages.
How can I convert a IETF BCP 47 language identifier to ISO-639-2 three-letters language identifier, preferably in Python?
BCP 47 identifiers start with a 2 letter ISO 639-1 or 3 letter 639-2, 639-3 or 639-5 language code; see the RFC 5646 Syntax section:
Language-Tag = langtag ; normal language tags
/ privateuse ; private use tag
/ grandfathered ; grandfathered tags
langtag = language
["-" script]
["-" region]
*("-" variant)
*("-" extension)
["-" privateuse]
language = 2*3ALPHA ; shortest ISO 639 code
["-" extlang] ; sometimes followed by
; extended language subtags
/ 4ALPHA ; or reserved for future use
/ 5*8ALPHA ; or registered language subtag
I don't expect Apple to use the privateuse
or grandfathered
forms, so you can assume that you are looking at ISO 639-1, ISO 639-2, ISO 639-3 or ISO 639-5 language codes here. Simply map the 2-letter ISO-639-1 codes to 3-letter ISO 639-* codes.
You can use the pycountry
package for this:
import pycountry
lang = pycountry.languages.get(alpha2=two_letter_code)
three_letter_code = lang.terminology
Demo:
>>> import pycountry
>>> lang = pycountry.languages.get(alpha2='aa')
>>> lang.terminology
u'aar'
where the terminology form is the preferred 3-letter code; there is also a bibliography form which differs only for 22 entries. See ISO 639-2 B and T codes. The package doesn't include entries from ISO 639-5 however; that list overlaps and conflicts with 639-2 in places and I don't think Apple uses such codes at all.
From RFC5646/BCP47:
Language-Tag = langtag ; normal language tags
/ privateuse ; private use tag
/ grandfathered ; grandfathered tags
langtag = language
["-" script]
["-" region]
*("-" variant)
*("-" extension)
["-" privateuse]
language = 2*3ALPHA ; shortest ISO 639 code
["-" extlang] ; sometimes followed by
; extended language subtags
/ 4ALPHA ; or reserved for future use
/ 5*8ALPHA ; or registered language subtag
privateuse = "x" 1*("-" (1*8alphanum))
grandfathered = irregular ; non-redundant tags registered
/ regular ; during the RFC 3066 era
It looks like the first segment of most BCP-47 codes should be valid ISO-639 codes though they might not be the three letter variants. A BCP-47 language code has a few variants that are not ISO-639 codes - namely those beginning with x-
or i-
as well as a number of legacy codes that match the grandfathered
portion of the grammar:
irregular = "en-GB-oed" ; irregular tags do not match
/ "sgn-BE-FR" ; also includes i- prefixed codes
/ "sgn-BE-NL"
/ "sgn-CH-DE"
regular = "art-lojban" ; these tags match the 'langtag'
/ "cel-gaulish" ; production, but their subtags
/ "no-bok" ; are not extended language
/ "no-nyn" ; or variant subtags: their meaning
/ "zh-guoyu" ; is defined by their registration
/ "zh-hakka" ; and all of these are deprecated
/ "zh-min" ; in favor of a more modern
/ "zh-min-nan" ; subtag or sequence of subtags
/ "zh-xiang"
A good start would be something like the following:
def extract_iso_code(bcp_identifier):
language, _ = bcp_identifier.split('-', 1)
if 2 <= len(language) <=3:
# this is a valid ISO-639 code or is grandfathered
else:
# handle non-ISO codes
raise ValueError(bcp_identifier)
Conversion from the 2-character variant to the 3-character variant should be easy enough to handle since the mapping is well known.