How to detect language of user entered text? [clos

2020-01-24 03:54发布

I am dealing with an application that is accepting user input in different languages (currently 3 languages fixed). The requirement is that users can enter text and dont bother to select the language via a provided checkbox in the UI.

Is there an existing Java library to detect the language of a text?

I want something like this:

text = "To be or not to be thats the question."

// returns ISO 639 Alpha-2 code
language = detect(text);

print(language);

result:

EN

I dont want to know how to create a language detector by myself (i have seen plenty of blogs trying to do that). The library should provide a simple APi and also work completely offline. Open-source or commercial closed doesn't matter.

i also found this questions on SO (and a few more):

How to detect language
How to detect language of text?

7条回答
甜甜的少女心
3楼-- · 2020-01-24 04:16

This Language Detection Library for Java should give more than 99% accuracy for 53 languages.

Alternatively, there is Apache Tika, a library for content analysis that offers much more than just language detection.

查看更多
趁早两清
4楼-- · 2020-01-24 04:20

An alternative is the JLangDetect but it's not very robust and has a limited language base. Good thing is it's an Apache license, if it satisfies your requirements, you can use it. I'm guessing here, but do you release the space key between the single and double jump event? Version 0.2 has been released here.

In version 0.4 it is very robust. I have been using this in many projects of my own and never had any major problems. Also, when it comes to speed it is comparable to very specialized language detectors (e.g., few languages only).

查看更多
一纸荒年 Trace。
5楼-- · 2020-01-24 04:27

Google offers an API that can do this for you. I just stumbled across this yesterday and didn't keep a link, but if you, umm, Google for it you should manage to find it.

This was somewhere near the description of their translation API, which will translate text for you into any language you like. There's another call just for guessing the input language.

Google is among the world's leaders in mechanical translation; they base their stuff on extremely large corpuses of text (most of the Internet, kinda) and a statistical approach that usually "gets" it right simply by virtue of having a huge sample space.

EDIT: Here's the link: http://code.google.com/apis/ajaxlanguage/

EDIT 2: If you insist on "offline": A well upvoted answer was the suggestion of Guess-Language. It's a C++ library and handles about 60 languages.

查看更多
兄弟一词,经得起流年.
6楼-- · 2020-01-24 04:27

Detect Language API also provides Java client.

Example:

List<Result> results = DetectLanguage.detect("Hello world");

Result result = results.get(0);

System.out.println("Language: " + result.language);
System.out.println("Is reliable: " + result.reliable);
System.out.println("Confidence: " + result.confidence);
查看更多
Juvenile、少年°
7楼-- · 2020-01-24 04:31

here is another option : Language Detection Library for Java

this is a library in Java.

查看更多
登录 后发表回答