Detect language from string in PHP

2019-01-01 01:05发布

In PHP, is there a way to detect the language of a string? Suppose the string is in UTF-8 format.

16条回答
爱死公子算了
2楼-- · 2019-01-01 01:26
若你有天会懂
3楼-- · 2019-01-01 01:27

You can probably use the Google Translate API to detect the language and translate it if necessary.

查看更多
春风洒进眼中
4楼-- · 2019-01-01 01:27

I would take documents from various languages and reference them against Unicode. You could then use some bayesian reasoning to determine which language it is by the just the unicode characters used. This would seperate French from English or Russian.

I am not sure exactly on what else could be done except lookup the words in language dictionaries to determine the language (using a similar probabilistic approach).

查看更多
深知你不懂我心
5楼-- · 2019-01-01 01:29

As Google Translate API is going closing down as a free service, you can try this free alternative, which is a replacement for Google Translate API:

http://detectlanguage.com

查看更多
孤独寂梦人
6楼-- · 2019-01-01 01:29

Text_LanguageDetect pear package produced terrible results: "luxury apartments downtown" is detected as Portuguese...

Google API is still the best solution, they give 300$ free credit and warn before charging you anything

Below is a super simple function that uses file_get_contents to download the lang detected by the API, so no need to download or install libraries etc.

function guess_lang($str) {

    $str = str_replace(" ", "%20", $str);

    $content = file_get_contents("https://translation.googleapis.com/language/translate/v2/detect?key=YOUR_API_KEY&q=".$str);

    $lang = (json_decode($content, true));

    if(isset($lang))
        return $lang["data"]["detections"][0][0]["language"];
 }

Execute:

echo guess_lang("luxury apartments downtown montreal"); // returns "en"

You can get your Google Translate API key here: https://console.cloud.google.com/apis/library/translate.googleapis.com/

This is a simple example for short phrases to get you going. For more complex applications you'll want to restrict your API key and use the library obviously.

查看更多
忆尘夕之涩
7楼-- · 2019-01-01 01:32

try to use ascii encode. i use that code to determine ru\en languages in my social bot project

function language($string) {
        $ru = array("208","209","208176","208177","208178","208179","208180","208181","209145","208182","208183","208184","208185","208186","208187","208188","208189","208190","208191","209128","209129","209130","209131","209132","209133","209134","209135","209136","209137","209138","209139","209140","209141","209142","209143");
        $en = array("97","98","99","100","101","102","103","104","105","106","107","108","109","110","111","112","113","114","115","116","117","118","119","120","121","122");
        $htmlcharacters = array("<", ">", "&amp;", "&lt;", "&gt;", "&");
        $string = str_replace($htmlcharacters, "", $string);
        //Strip out the slashes
        $string = stripslashes($string);
        $badthings = array("=", "#", "~", "!", "?", ".", ",", "<", ">", "/", ";", ":", '"', "'", "[", "]", "{", "}", "@", "$", "%", "^", "&", "*", "(", ")", "-", "_", "+", "|", "`");
        $string = str_replace($badthings, "", $string);
        $string = mb_strtolower($string);
        $msgarray = explode(" ", $string);
        $words = count($msgarray);
        $letters = str_split($msgarray[0]);
        $letters = ToAscii($letters[0]);
        $brackets = array("[",",","]");
        $letters = str_replace($brackets,  "", $letters);
        if (in_array($letters, $ru)) {
            $result = 'Русский' ; //russian
        } elseif (in_array($letters, $en)) {
            $result = 'Английский'; //english
        } else {
            $result = 'ошибка' . $letters; //error
        }} return $result;  
查看更多
登录 后发表回答