I'm creating an application for detecting the language of short texts, with an average of < 100 characters and contains slang (e.g tweets, user queries, sms).
All the libraries I tested work well for normal web pages but not for very short text. The library that's giving the best results so far is Chrome's Language Detection (CLD) library which I had to build as a shared library.
CLD fails when the text is made of very short words. After looking at the source code of CLD, I see that it uses 4-grams so that could be the reason.
The approach I'm thinking of right now to improve the accuracy is:
- Remove brand names, numbers, urls and words like "software", "download", "internet"
- Use a dictionary When the text contains a number of short words above a threashold or when it contains too few words.
- The dictionary is created from wikipedia news articles + hunspell dictionaries.
What dataset is most suitable for this task? And how can I improve this approach?
So far I'm using EUROPARL and Wikipedia articles. I'm using NLTK for most of the work.
Also omit scientific names or names of medicines etc. Your approach seems quite fine to me. I think wikipedia is the best option for creating a dictionary as it contains standard language. If you are not running out of time, you can also use newspapers.
Language detection for very short texts is the topic of current research, so no conclusive answer can be given. An algorithm for Twitter data can be found in Carter, Tsagkias and Weerkamp 2011. See also the references there.
Yes, this is a topic of research and there is some progress that has been made.
For example, the author of "language-detection" at http://code.google.com/p/language-detection/ has created new profiles for short messages. Currently, it supports 17 languages.
I have compared it with Bing Language Detector on a collection of about 500 tweets which are mostly in English and Spanish. The accuracy is as follows:
For more information, you can check his blog out: http://shuyo.wordpress.com/2011/11/28/language-detection-supported-17-language-profiles-for-short-messages/