Detect language of text [duplicate]

This question already has an answer here:

How to detect the language of a string? 8 answers

Is there any C# library which can detect the language of a particular piece of text? i.e. for an input text "This is a sentence", it should detect the language as "English". Or for "Esto es una sentencia" it should detect the language as "Spanish".

I understand that language detection from text is not a deterministic problem. But both Google Translate and Bing Translator have an "Auto detect" option, which best-guesses the input language. Is there something similar available publicly, preferably in C#?

标签： c# language-detection

7条回答

The star\"

2楼-- · 2019-01-18 03:09

Language detection is a pretty hard thing to do.

Some languages are much easier to detect than others simply due to the diacritics and digraphs/trigraphs used. For example, double-acute accents are used almost exclusively in Hungarian. The dotless i ‘ı’, is used exclusively [I think] in Turkish, t-comma (not t-cedilla) is used only in Romanian, and the eszett ‘ß’ occurs only in German.

Some digraphs, trigraphs and tetragraphs are also a good give-away. For example, you'll most likely find ‘eeuw’ and ‘ieuw’ primarily in Dutch, and ‘tsch’ and ‘dsch’ primarily in German etc.

More giveaways would include common words or common prefixes/suffixes used in a particular language. Sometimes even the punctuation that is used can help determine a language (quote-style and use, etc).

If such a library exists I would like to know about it, since I'm working on one myself.

0人赞添加讨论(0) 举报

你好瞎i

3楼-- · 2019-01-18 03:12

You'll want a machine learning algorithm based on hidden markov chains, process a bunch of texts in different languages.

Then when it gets to the unidentified text, the language that has the closer 'score' is the winner.

0人赞添加讨论(0) 举报

甜甜的少女心

4楼-- · 2019-01-18 03:16

Here you have a simple detector based on bigram statistics (basically means learning from a big set which bigrams occur more frequently on each language and then count those in a piece of text, comparing to your previously detected values):

http://allantech.blogspot.com/2007/07/automatic-language-detection.html

This is probably good enough for many (most?) applications and doesn't require Internet access.

Of course it will perform worse than Google's or Bing's algorithm (which themselves aren't great). If you need excellent detection performance you would have to do both a lot of hard work and over huge amounts of data.

The other option would be to leverage Google's or Bing APIs if your app has Internet access.

0人赞添加讨论(0) 举报

你好瞎i

5楼-- · 2019-01-18 03:19

I've found that "textcat" is very useful for this. I've used a PHP implementation, PHP Text Cat, based on this this original implementation, and found it reliable. If you have a look at the sources, you'll find it's not a terrifyingly difficult thing to implement in the language of your choice. The hard work -- the letter combinations that are relevant to a particular language -- is all in there as data.

0人赞添加讨论(0) 举报

男人必须洒脱

6楼-- · 2019-01-18 03:20

Yes indeed, TextCat is very good for language identification. And it has a lot of implementations in different languages.

There were no ports in .Net. So I have written one: NTextCat (NuGet, Online Demo).

It is pure .NET Framework DLL + command line interface to it. By default, it uses a profile of 14 languages.

Any feedback is very appreciated! New ideas and feature requests are welcomed too :)

0人赞添加讨论(0) 举报

太酷不给撩

7楼-- · 2019-01-18 03:26

Please find a C# implementation based on of 3grams analysis here:

http://idsyst.hu/development/language_detector.html

0人赞添加讨论(0) 举报

1 2 下一页

Detect language of text [duplicate]

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间