Detect language from string in PHP-第3页回答

2楼-- · 2019-01-01 01:40

You could implement a module of Apache Tika with Java, insert the results into a txt file, a DB, etc and then read from the file, db, whatever with php. If you don't have that much content, you could use Google's API, although keep in mind your calls will be limited, and you can only send a restricted number of characters to the API. At the time of writing I'd finished testing version 1 (which turned out to be not so accurate) and the labs version 2 (i ditched after i read that there's a 100,000 chars cap per day) of the API.

0人赞添加讨论(0) 举报

冷夜・残月

3楼-- · 2019-01-01 01:45

I know this is an old post, but here is what I developed after not finding any viable solution.

other suggestions are all too heavy and too cumbersome for my situation
I support a finite number of languages on my website (at the moment two: 'en' and 'de' - but solution is generalised for more).
I need a plausible guess about the language of a user-generated string, and I have a fallback (the language setting of the user).
So I want a solution with minimal false positives - but don't care so much about false negatives.

The solution uses the 20 most common words in a language, counts the occurrences of those in the haystack. Then it just compares the counts of the first and second most counted languages. If the runner-up number is less than 10% of the winner, the winner takes it all.

Code - Any suggestions for speed improvement are more than welcome!

    function getTextLanguage($text, $default) {
      $supported_languages = array(
          'en',
          'de',
      );
      // German word list
      // from http://wortschatz.uni-leipzig.de/Papers/top100de.txt
      $wordList['de'] = array ('der', 'die', 'und', 'in', 'den', 'von', 
          'zu', 'das', 'mit', 'sich', 'des', 'auf', 'für', 'ist', 'im', 
          'dem', 'nicht', 'ein', 'Die', 'eine');
      // English word list
      // from http://en.wikipedia.org/wiki/Most_common_words_in_English
      $wordList['en'] = array ('the', 'be', 'to', 'of', 'and', 'a', 'in', 
          'that', 'have', 'I', 'it', 'for', 'not', 'on', 'with', 'he', 
          'as', 'you', 'do', 'at');
      // clean out the input string - note we don't have any non-ASCII 
      // characters in the word lists... change this if it is not the 
      // case in your language wordlists!
      $text = preg_replace("/[^A-Za-z]/", ' ', $text);
      // count the occurrences of the most frequent words
      foreach ($supported_languages as $language) {
        $counter[$language]=0;
      }
      for ($i = 0; $i < 20; $i++) {
        foreach ($supported_languages as $language) {
          $counter[$language] = $counter[$language] + 
            // I believe this is way faster than fancy RegEx solutions
            substr_count($text, ' ' .$wordList[$language][$i] . ' ');;
        }
      }
      // get max counter value
      // from http://stackoverflow.com/a/1461363
      $max = max($counter);
      $maxs = array_keys($counter, $max);
      // if there are two winners - fall back to default!
      if (count($maxs) == 1) {
        $winner = $maxs[0];
        $second = 0;
        // get runner-up (second place)
        foreach ($supported_languages as $language) {
          if ($language <> $winner) {
            if ($counter[$language]>$second) {
              $second = $counter[$language];
            }
          }
        }
        // apply arbitrary threshold of 10%
        if (($second / $max) < 0.1) {
          return $winner;
        } 
      }
      return $default;
    }

0人赞添加讨论(0) 举报

刘海飞了

4楼-- · 2019-01-01 01:47

I have had good results with https://github.com/patrickschur/language-detection and am using it in production:

It uses ngrams in languages to detect the most likely language (the longer your string / the more words, the more accurate it will be), which seems like a solid proven method.
110 languages are supported, but you can also limit the number of languages to only those you are interested in.
Trainer and Language detector can easily be improved / customized. It uses the Universal Declaration of Human Rights in each of the languages as the foundation to detect a language, but if you know what type of sentences you experience you can easily extend or replace the used texts in each language and get better results fast. "Training" this library to become better is easy.
I would suggest to increase setMaxNgrams (I set it to 9000) in the Trainer and run it once, and then also use that setting in the Language detector class. Changing the ngrams number is a bit unintuitive (I had to look through the code to find out how it works), which is a drawback, and the default (310) is always too low in my opinion. More ngrams makes the guessing a lot better.
Because the library is very small, it was relatively easy to understand what is happening and how to tweak it.

My usage: I am analyzing emails for a CRM system to know what language an email was written in, so sending the text to a third party service was not an option. Even though the Universal Declaration of Human Rights is probably not the best basis to categorize the language of emails (as emails often have formulaic parts like greetings, which are not part of the Human Rights Declaration) it identifies the correct language in like 99% of cases, if there are at least 5 words in it.

Update: I managed to improve language recognition in emails to basically 100% when using the language-detection library with the following methods:

Add additional common phrases to the (relevant) language samples, like "Greetings", "Best regards", "Sincerely". These kind of expressions are not used in the Universal Declaration of Human Rights. Commonly used phrases help the language recognition a lot, especially formulaic ones used often my humans ("Hello", "Have a nice day") if you are analyzing human communication.
Set the maximum ngram length to 4 (instead of the default 3).
Keep the maxNgrams at 9000 as before.

These do make the library a bit slower, so I would suggest to use them in an async way if possible and measure the performance. In my case it is more than fast enough and much more accurate.

0人赞添加讨论(0) 举报

几人难应

5楼-- · 2019-01-01 01:51

I've used the Text_LanguageDetect pear package with some reasonable results. It's dead simple to use, and it has a modest 52 language database. The downside is no detection of Eastern Asian languages.

require_once 'Text/LanguageDetect.php';
$l = new Text_LanguageDetect();
$result = $l->detect($text, 4);
if (PEAR::isError($result)) {
    echo $result->getMessage();
} else {
    print_r($result);
}

results in:

Array
(
    [german] => 0.407037037037
    [dutch] => 0.288065843621
    [english] => 0.283333333333
    [danish] => 0.234526748971
)

0人赞添加讨论(0) 举报

Detect language from string in PHP

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间