What's the best way to detect the language of a string?
相关问题
- Sorting 3 numbers without branching [closed]
- Graphics.DrawImage() - Throws out of memory except
- Why am I getting UnauthorizedAccessException on th
- 求获取指定qq 资料的方法
- How to know full paths to DLL's from .csproj f
Make a statistical analyses of the string: Split the string into words. Get a dictionary for every language you want to test for. And then find the language that has the highest word count.
In C# every string in memory will be unicode, and is not encoded. Also in text files the encoding is not stored. (Sometimes only an indication of 8-bit or 16-bit).
If you want to make a distinction between two languages, you might find some simple tricks. For example if you want to recognize English from Dutch, the string that contains the "y" is mostly English. (Unreliable but fast).
You may use the C# package for language identification from Microsoft Research:
Download the package from the above link.
If you mean the natural (ie human) language, this is in general a Hard Problem. What language is "server" - English or Turkish? What language is "chat" - English or French? What language is "uno" - Italian or Spanish (or Latin!) ?
Without paying attention to context, and doing some hard natural language processing (<----- this is the phrase to google for) you haven't got a chance.
You might enjoy a look at Frengly - it's a nice UI onto the Google Translate service which attempts to guess the language of the input text...
We can use
Regex.IsMatch(text, "[\\uxxxx-\\uxxxx]+")
to detect an specific language. Here xxxx is the 4 digit Unicode id of a character.To detect Arabic:
A statistical approach using digraphs or trigraphs is a very good indicator. For example, here are the most common digraphs in English in order: http://www.letterfrequency.org/#digraph-frequency (one can find better or more complete lists). This method may have a better success rate than word analysis for short snippets of text because there are more digraphs in text than there are complete words.
Fast answer: NTextCat (NuGet, Online Demo)
Long answer:
Currently the best way seems to use classifiers trained to classify piece of text into one (or more) of languages from predefined set.
There is a Perl tool called TextCat. It has language models for 74 most popular languages. There is a huge number of ports of this tool into different programming languages.
There were no ports in .Net. So I have written one: NTextCat on GitHub.
It is pure .NET Framework DLL + command line interface to it. By default, it uses a profile of 14 languages.
Any feedback is very appreciated! New ideas and feature requests are welcomed too :)
Alternative is to use numerous online services (e.g. one from Google mentioned, detectlanguage.com, langid.net, etc.).