Language detection for very short text [closed]

I'm creating an application for detecting the language of short texts, with an average of < 100 characters and contains slang (e.g tweets, user queries, sms).

All the libraries I tested work well for normal web pages but not for very short text. The library that's giving the best results so far is Chrome's Language Detection (CLD) library which I had to build as a shared library.

CLD fails when the text is made of very short words. After looking at the source code of CLD, I see that it uses 4-grams so that could be the reason.

The approach I'm thinking of right now to improve the accuracy is:

Remove brand names, numbers, urls and words like "software", "download", "internet"
Use a dictionary When the text contains a number of short words above a threashold or when it contains too few words.
The dictionary is created from wikipedia news articles + hunspell dictionaries.

What dataset is most suitable for this task? And how can I improve this approach?

So far I'm using EUROPARL and Wikipedia articles. I'm using NLTK for most of the work.

标签： nlp nltk language-detection

3条回答

够拽才男人

2楼-- · 2020-05-16 09:48

Also omit scientific names or names of medicines etc. Your approach seems quite fine to me. I think wikipedia is the best option for creating a dictionary as it contains standard language. If you are not running out of time, you can also use newspapers.

0人赞添加讨论(0) 举报

Melony?

3楼-- · 2020-05-16 09:57

Language detection for very short texts is the topic of current research, so no conclusive answer can be given. An algorithm for Twitter data can be found in Carter, Tsagkias and Weerkamp 2011. See also the references there.

0人赞添加讨论(0) 举报

狗以群分

4楼-- · 2020-05-16 09:57

Yes, this is a topic of research and there is some progress that has been made.

For example, the author of "language-detection" at http://code.google.com/p/language-detection/ has created new profiles for short messages. Currently, it supports 17 languages.

I have compared it with Bing Language Detector on a collection of about 500 tweets which are mostly in English and Spanish. The accuracy is as follows:

   Bing = 71.97%
   Language-Detection Tool with new profiles = 89.75%

For more information, you can check his blog out: http://shuyo.wordpress.com/2011/11/28/language-detection-supported-17-language-profiles-for-short-messages/

0人赞添加讨论(0) 举报

Language detection for very short text [closed]

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间