How to convert IETF BCP 47 language identifier to

2019-05-01 15:10发布

问题:

I am writing a server API for an iOS application. As a part of the initialization process, the app should send the phone interface language to server via an API call.

The problem is that Apple uses something called IETF BCP 47 language identifier in its NSLocale preferredLanguages function.

The returned values have different lengths (e.g. [aa, ab, ace, ach, ada, ady, ae, af, afa, afh, agq, ...], and I found very few parsers that can convert this code to a proper language identifier.

I would like to use the more common ISO-639-2 three-letters language identifier, which is ubiquitous, has many parsers in many languages, and has a standard, 3-letter representation of languages.

How can I convert a IETF BCP 47 language identifier to ISO-639-2 three-letters language identifier, preferably in Python?

回答1:

BCP 47 identifiers start with a 2 letter ISO 639-1 or 3 letter 639-2, 639-3 or 639-5 language code; see the RFC 5646 Syntax section:

Language-Tag  = langtag             ; normal language tags
              / privateuse          ; private use tag
              / grandfathered       ; grandfathered tags

langtag       = language
                ["-" script]
                ["-" region]
                *("-" variant)
                *("-" extension)
                ["-" privateuse]

language      = 2*3ALPHA            ; shortest ISO 639 code
                ["-" extlang]       ; sometimes followed by
                                    ; extended language subtags
              / 4ALPHA              ; or reserved for future use
              / 5*8ALPHA            ; or registered language subtag

I don't expect Apple to use the privateuse or grandfathered forms, so you can assume that you are looking at ISO 639-1, ISO 639-2, ISO 639-3 or ISO 639-5 language codes here. Simply map the 2-letter ISO-639-1 codes to 3-letter ISO 639-* codes.

You can use the pycountry package for this:

import pycountry

lang = pycountry.languages.get(alpha2=two_letter_code)
three_letter_code = lang.terminology

Demo:

>>> import pycountry
>>> lang = pycountry.languages.get(alpha2='aa')
>>> lang.terminology
u'aar'

where the terminology form is the preferred 3-letter code; there is also a bibliography form which differs only for 22 entries. See ISO 639-2 B and T codes. The package doesn't include entries from ISO 639-5 however; that list overlaps and conflicts with 639-2 in places and I don't think Apple uses such codes at all.



回答2:

From RFC5646/BCP47:

Language-Tag  = langtag             ; normal language tags
              / privateuse          ; private use tag
              / grandfathered       ; grandfathered tags

langtag       = language
                ["-" script]
                ["-" region]
                *("-" variant)
                *("-" extension)
                ["-" privateuse]

language      = 2*3ALPHA            ; shortest ISO 639 code
                ["-" extlang]       ; sometimes followed by
                                    ; extended language subtags
              / 4ALPHA              ; or reserved for future use
              / 5*8ALPHA            ; or registered language subtag

privateuse    = "x" 1*("-" (1*8alphanum))

grandfathered = irregular           ; non-redundant tags registered
              / regular             ; during the RFC 3066 era

It looks like the first segment of most BCP-47 codes should be valid ISO-639 codes though they might not be the three letter variants. A BCP-47 language code has a few variants that are not ISO-639 codes - namely those beginning with x- or i- as well as a number of legacy codes that match the grandfathered portion of the grammar:

irregular     = "en-GB-oed"         ; irregular tags do not match
              / "sgn-BE-FR"         ; also includes i- prefixed codes
              / "sgn-BE-NL"
              / "sgn-CH-DE"

regular       = "art-lojban"        ; these tags match the 'langtag'
              / "cel-gaulish"       ; production, but their subtags
              / "no-bok"            ; are not extended language
              / "no-nyn"            ; or variant subtags: their meaning
              / "zh-guoyu"          ; is defined by their registration
              / "zh-hakka"          ; and all of these are deprecated
              / "zh-min"            ; in favor of a more modern
              / "zh-min-nan"        ; subtag or sequence of subtags
              / "zh-xiang"

A good start would be something like the following:

def extract_iso_code(bcp_identifier):
    language, _ = bcp_identifier.split('-', 1)
    if 2 <= len(language) <=3:
        # this is a valid ISO-639 code or is grandfathered
    else:
        # handle non-ISO codes
        raise ValueError(bcp_identifier)

Conversion from the 2-character variant to the 3-character variant should be easy enough to handle since the mapping is well known.