Detect language from string in PHP

2019-01-01 01:05发布

In PHP, is there a way to detect the language of a string? Suppose the string is in UTF-8 format.

16条回答
君临天下
2楼-- · 2019-01-01 01:40

You could implement a module of Apache Tika with Java, insert the results into a txt file, a DB, etc and then read from the file, db, whatever with php. If you don't have that much content, you could use Google's API, although keep in mind your calls will be limited, and you can only send a restricted number of characters to the API. At the time of writing I'd finished testing version 1 (which turned out to be not so accurate) and the labs version 2 (i ditched after i read that there's a 100,000 chars cap per day) of the API.

查看更多
冷夜・残月
3楼-- · 2019-01-01 01:45

I know this is an old post, but here is what I developed after not finding any viable solution.

  • other suggestions are all too heavy and too cumbersome for my situation
  • I support a finite number of languages on my website (at the moment two: 'en' and 'de' - but solution is generalised for more).
  • I need a plausible guess about the language of a user-generated string, and I have a fallback (the language setting of the user).
  • So I want a solution with minimal false positives - but don't care so much about false negatives.

The solution uses the 20 most common words in a language, counts the occurrences of those in the haystack. Then it just compares the counts of the first and second most counted languages. If the runner-up number is less than 10% of the winner, the winner takes it all.

Code - Any suggestions for speed improvement are more than welcome!

    function getTextLanguage($text, $default) {
      $supported_languages = array(
          'en',
          'de',
      );
      // German word list
      // from http://wortschatz.uni-leipzig.de/Papers/top100de.txt
      $wordList['de'] = array ('der', 'die', 'und', 'in', 'den', 'von', 
          'zu', 'das', 'mit', 'sich', 'des', 'auf', 'für', 'ist', 'im', 
          'dem', 'nicht', 'ein', 'Die', 'eine');
      // English word list
      // from http://en.wikipedia.org/wiki/Most_common_words_in_English
      $wordList['en'] = array ('the', 'be', 'to', 'of', 'and', 'a', 'in', 
          'that', 'have', 'I', 'it', 'for', 'not', 'on', 'with', 'he', 
          'as', 'you', 'do', 'at');
      // clean out the input string - note we don't have any non-ASCII 
      // characters in the word lists... change this if it is not the 
      // case in your language wordlists!
      $text = preg_replace("/[^A-Za-z]/", ' ', $text);
      // count the occurrences of the most frequent words
      foreach ($supported_languages as $language) {
        $counter[$language]=0;
      }
      for ($i = 0; $i < 20; $i++) {
        foreach ($supported_languages as $language) {
          $counter[$language] = $counter[$language] + 
            // I believe this is way faster than fancy RegEx solutions
            substr_count($text, ' ' .$wordList[$language][$i] . ' ');;
        }
      }
      // get max counter value
      // from http://stackoverflow.com/a/1461363
      $max = max($counter);
      $maxs = array_keys($counter, $max);
      // if there are two winners - fall back to default!
      if (count($maxs) == 1) {
        $winner = $maxs[0];
        $second = 0;
        // get runner-up (second place)
        foreach ($supported_languages as $language) {
          if ($language <> $winner) {
            if ($counter[$language]>$second) {
              $second = $counter[$language];
            }
          }
        }
        // apply arbitrary threshold of 10%
        if (($second / $max) < 0.1) {
          return $winner;
        } 
      }
      return $default;
    }
查看更多
刘海飞了
4楼-- · 2019-01-01 01:47

I have had good results with https://github.com/patrickschur/language-detection and am using it in production:

  • It uses ngrams in languages to detect the most likely language (the longer your string / the more words, the more accurate it will be), which seems like a solid proven method.
  • 110 languages are supported, but you can also limit the number of languages to only those you are interested in.
  • Trainer and Language detector can easily be improved / customized. It uses the Universal Declaration of Human Rights in each of the languages as the foundation to detect a language, but if you know what type of sentences you experience you can easily extend or replace the used texts in each language and get better results fast. "Training" this library to become better is easy.
  • I would suggest to increase setMaxNgrams (I set it to 9000) in the Trainer and run it once, and then also use that setting in the Language detector class. Changing the ngrams number is a bit unintuitive (I had to look through the code to find out how it works), which is a drawback, and the default (310) is always too low in my opinion. More ngrams makes the guessing a lot better.
  • Because the library is very small, it was relatively easy to understand what is happening and how to tweak it.

My usage: I am analyzing emails for a CRM system to know what language an email was written in, so sending the text to a third party service was not an option. Even though the Universal Declaration of Human Rights is probably not the best basis to categorize the language of emails (as emails often have formulaic parts like greetings, which are not part of the Human Rights Declaration) it identifies the correct language in like 99% of cases, if there are at least 5 words in it.

Update: I managed to improve language recognition in emails to basically 100% when using the language-detection library with the following methods:

  • Add additional common phrases to the (relevant) language samples, like "Greetings", "Best regards", "Sincerely". These kind of expressions are not used in the Universal Declaration of Human Rights. Commonly used phrases help the language recognition a lot, especially formulaic ones used often my humans ("Hello", "Have a nice day") if you are analyzing human communication.
  • Set the maximum ngram length to 4 (instead of the default 3).
  • Keep the maxNgrams at 9000 as before.

These do make the library a bit slower, so I would suggest to use them in an async way if possible and measure the performance. In my case it is more than fast enough and much more accurate.

查看更多
几人难应
5楼-- · 2019-01-01 01:51

I've used the Text_LanguageDetect pear package with some reasonable results. It's dead simple to use, and it has a modest 52 language database. The downside is no detection of Eastern Asian languages.

require_once 'Text/LanguageDetect.php';
$l = new Text_LanguageDetect();
$result = $l->detect($text, 4);
if (PEAR::isError($result)) {
    echo $result->getMessage();
} else {
    print_r($result);
}

results in:

Array
(
    [german] => 0.407037037037
    [dutch] => 0.288065843621
    [english] => 0.283333333333
    [danish] => 0.234526748971
)
查看更多
登录 后发表回答