How to detect text (string) language in iOS?

2020-02-08 17:59发布

For instance, given the following strings:

let textEN = "The quick brown fox jumps over the lazy dog"
let textES = "El zorro marrón rápido salta sobre el perro perezoso"
let textAR = "الثعلب البني السريع يقفز فوق الكلب الكسول"
let textDE = "Der schnelle braune Fuchs springt über den faulen Hund"

I want to detect the used language in each of them.

Let's assume the signature for the implemented function is:

func detectedLanguage<T: StringProtocol>(_ forString: T) -> String?

returns an Optional string in case of no detected language.

thus the appropriate result would be:

let englishDetectedLanguage = detectedLanguage(textEN) // => English
let spanishDetectedLanguage = detectedLanguage(textES) // => Spanish
let arabicDetectedLanguage = detectedLanguage(textAR) // => Arabic
let germanDetectedLanguage = detectedLanguage(textDE) // => German

Is there an easy approach to achieve it?

3条回答
老娘就宠你
2楼-- · 2020-02-08 18:32

I tried NSLinguisticTagger with short input text like hello, it always recognizes as Italian. Luckily, Apple recently added NLLanguageRecognizer available on iOS 12, and seems like it more accurate :D

import NaturalLanguage

if #available(iOS 12.0, *) {
    let languageRecognizer = NLLanguageRecognizer()
    languageRecognizer.processString(text)
    let code = languageRecognizer.dominantLanguage!.rawValue
    let language = Locale.current.localizedString(forIdentifier: code)
}
查看更多
啃猪蹄的小仙女
3楼-- · 2020-02-08 18:39

Latest versions (iOS 12+)

Briefly:

You could achieve it by using NLLanguageRecognizer, as:

import NaturalLanguage

func detectedLanguage(for string: String) -> String? {
    let recognizer = NLLanguageRecognizer()
    recognizer.processString(string)
    guard let languageCode = recognizer.dominantLanguage?.rawValue else { return nil }
    let detectedLanguage = Locale.current.localizedString(forIdentifier: languageCode)
    return detectedLanguage
}

Older versions (iOS 11+)

Briefly:

You could achieve it by using NSLinguisticTagger, as:

func detectedLanguage<T: StringProtocol>(for string: T) -> String? {
    let recognizer = NLLanguageRecognizer()
    recognizer.processString(String(string))
    guard let languageCode = recognizer.dominantLanguage?.rawValue else { return nil }
    let detectedLanguage = Locale.current.localizedString(forIdentifier: languageCode)
    return detectedLanguage
}

Details:

First of all, you should be aware of what are you asking about is mainly relates to the world of Natural language processing (NLP).

Since NLP is more than text language detection, the rest of the answer will not contains specific NLP information.

Obviously, implementing such a functionality is not that easy, especially when starting to care about the details of the process such as splitting into sentences and even into words, after that recognising names and punctuations etc... I bet you would think of "what a painful process! it is not even logical to do it by myself"; Fortunately, iOS does supports NLP (actually, NLP APIs are available for all Apple platforms, not only the iOS) to make what are you aiming for to be easy to be implemented. The core component that you would work with is NSLinguisticTagger:

Analyze natural language text to tag part of speech and lexical class, identify names, perform lemmatization, and determine the language and script.

NSLinguisticTagger provides a uniform interface to a variety of natural language processing functionality with support for many different languages and scripts. You can use this class to segment natural language text into paragraphs, sentences, or words, and tag information about those segments, such as part of speech, lexical class, lemma, script, and language.

As mentioned in the class documentation, the method that you are looking for - under Determining the Dominant Language and Orthography section- is dominantLanguage(for:):

Returns the dominant language for the specified string.

.

.

Return Value

The BCP-47 tag identifying the dominant language of the string, or the tag "und" if a specific language cannot be determined.

You might notice that the NSLinguisticTagger is exist since back to iOS 5. However, dominantLanguage(for:) method is only supported for iOS 11 and above, that's because it has been developed on top of the Core ML Framework:

. . .

Core ML is the foundation for domain-specific frameworks and functionality. Core ML supports Vision for image analysis, Foundation for natural language processing (for example, the NSLinguisticTagger class), and GameplayKit for evaluating learned decision trees. Core ML itself builds on top of low-level primitives like Accelerate and BNNS, as well as Metal Performance Shaders.

enter image description here

Based on the returned value from calling dominantLanguage(for:) by passing "The quick brown fox jumps over the lazy dog":

NSLinguisticTagger.dominantLanguage(for: "The quick brown fox jumps over the lazy dog")

would be "en" optional string. However, so far that is not the desired output, the expectation is to get "English" instead! Well, that is exactly what you should get by calling the localizedString(forLanguageCode:) method from Locale Structure and passing the gotten language code:

Locale.current.localizedString(forIdentifier: "en") // English

Putting all together:

As mentioned in the "Quick Answer" code snippet, the function would be:

func detectedLanguage<T: StringProtocol>(_ forString: T) -> String? {
    guard let languageCode = NSLinguisticTagger.dominantLanguage(for: String(forString)) else {
        return nil
    }

    let detectedLanguage = Locale.current.localizedString(forIdentifier: languageCode)

    return detectedLanguage
}

Output:

It would be as expected:

let englishDetectedLanguage = detectedLanguage(textEN) // => English
let spanishDetectedLanguage = detectedLanguage(textES) // => Spanish
let arabicDetectedLanguage = detectedLanguage(textAR) // => Arabic
let germanDetectedLanguage = detectedLanguage(textDE) // => German

Note That:

There still cases for not getting a language name for a given string, like:

let textUND = "SdsOE"
let undefinedDetectedLanguage = detectedLanguage(textUND) // => Unknown language

Or it could be even nil:

let rubbish = "000747322"
let rubbishDetectedLanguage = detectedLanguage(rubbish) // => nil

Still find it a not bad result for providing a useful output...


Furthermore:

About NSLinguisticTagger:

Although I will not going to dive deep in NSLinguisticTagger usage, I would like to note that there are couple of really cool features exist in it more than just simply detecting the language for a given a text; As a pretty simple example: using the lemma when enumerating tags would be so helpful when working with Information retrieval, since you would be able to recognize the word "driving" passing "drive" word.

Official Resources

Apple Video Sessions:

Also, for getting familiar with the CoreML:

查看更多
欢心
4楼-- · 2020-02-08 18:50

You can use NSLinguisticTagger's tagAt method. It support iOS 5 and later.

func detectLanguage<T: StringProtocol>(for text: T) -> String? {
    let tagger = NSLinguisticTagger.init(tagSchemes: [.language], options: 0)
    tagger.string = String(text)

    guard let languageCode = tagger.tag(at: 0, scheme: .language, tokenRange: nil, sentenceRange: nil) else { return nil }
    return Locale.current.localizedString(forIdentifier: languageCode)
}

detectLanguage(for: "The quick brown fox jumps over the lazy dog")              // English
detectLanguage(for: "El zorro marrón rápido salta sobre el perro perezoso")     // Spanish
detectLanguage(for: "الثعلب البني السريع يقفز فوق الكلب الكسول")                // Arabic
detectLanguage(for: "Der schnelle braune Fuchs springt über den faulen Hund")   // German
查看更多
登录 后发表回答