In PHP, is there a way to detect the language of a string? Suppose the string is in UTF-8 format.
相关问题
- Views base64 encoded blob in HTML with PHP
- Laravel Option Select - Default Issue
- PHP Recursively File Folder Scan Sorted by Modific
- Can php detect if javascript is on or not?
- Using similar_text and strpos together
You could implement a module of Apache Tika with Java, insert the results into a txt file, a DB, etc and then read from the file, db, whatever with php. If you don't have that much content, you could use Google's API, although keep in mind your calls will be limited, and you can only send a restricted number of characters to the API. At the time of writing I'd finished testing version 1 (which turned out to be not so accurate) and the labs version 2 (i ditched after i read that there's a 100,000 chars cap per day) of the API.
I know this is an old post, but here is what I developed after not finding any viable solution.
The solution uses the 20 most common words in a language, counts the occurrences of those in the haystack. Then it just compares the counts of the first and second most counted languages. If the runner-up number is less than 10% of the winner, the winner takes it all.
Code - Any suggestions for speed improvement are more than welcome!
I have had good results with https://github.com/patrickschur/language-detection and am using it in production:
My usage: I am analyzing emails for a CRM system to know what language an email was written in, so sending the text to a third party service was not an option. Even though the Universal Declaration of Human Rights is probably not the best basis to categorize the language of emails (as emails often have formulaic parts like greetings, which are not part of the Human Rights Declaration) it identifies the correct language in like 99% of cases, if there are at least 5 words in it.
Update: I managed to improve language recognition in emails to basically 100% when using the language-detection library with the following methods:
These do make the library a bit slower, so I would suggest to use them in an async way if possible and measure the performance. In my case it is more than fast enough and much more accurate.
I've used the Text_LanguageDetect pear package with some reasonable results. It's dead simple to use, and it has a modest 52 language database. The downside is no detection of Eastern Asian languages.
results in: