Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 3 years ago.
I am dealing with an application that is accepting user input in different languages (currently 3 languages fixed). The requirement is that users can enter text and dont bother to select the language via a provided checkbox in the UI.
Is there an existing Java library to detect the language of a text?
I want something like this:
text = "To be or not to be thats the question."
// returns ISO 639 Alpha-2 code
language = detect(text);
print(language);
result:
EN
I dont want to know how to create a language detector by myself (i have seen plenty of blogs trying to do that). The library should provide a simple APi and also work completely offline. Open-source or commercial closed doesn't matter.
i also found this questions on SO (and a few more):
How to detect language
How to detect language of text?
This Language Detection Library for Java should give more than 99% accuracy for 53 languages.
Alternatively, there is Apache Tika, a library for content analysis that offers much more than just language detection.
Google offers an API that can do this for you. I just stumbled across this yesterday and didn't keep a link, but if you, umm, Google for it you should manage to find it.
This was somewhere near the description of their translation API, which will translate text for you into any language you like. There's another call just for guessing the input language.
Google is among the world's leaders in mechanical translation; they base their stuff on extremely large corpuses of text (most of the Internet, kinda) and a statistical approach that usually "gets" it right simply by virtue of having a huge sample space.
EDIT: Here's the link: http://code.google.com/apis/ajaxlanguage/
EDIT 2: If you insist on "offline": A well upvoted answer was the suggestion of Guess-Language. It's a C++ library and handles about 60 languages.
Detect Language API also provides Java client.
Example:
List<Result> results = DetectLanguage.detect("Hello world");
Result result = results.get(0);
System.out.println("Language: " + result.language);
System.out.println("Is reliable: " + result.reliable);
System.out.println("Confidence: " + result.confidence);
An alternative is the JLangDetect but it's not very robust and has a limited language base. Good thing is it's an Apache license, if it satisfies your requirements, you can use it. I'm guessing here, but do you release the space key between the single and double jump event? Version 0.2 has been released here.
In version 0.4 it is very robust. I have been using this in many projects of my own and never had any major problems. Also, when it comes to speed it is comparable to very specialized language detectors (e.g., few languages only).
here is another option : Language Detection Library for Java
this is a library in Java.
Just a working code from already available solution from cybozu labs:
package com.et.generate;
import java.util.ArrayList;
import com.cybozu.labs.langdetect.Detector;
import com.cybozu.labs.langdetect.DetectorFactory;
import com.cybozu.labs.langdetect.LangDetectException;
import com.cybozu.labs.langdetect.Language;
public class LanguageCodeDetection {
public void init(String profileDirectory) throws LangDetectException {
DetectorFactory.loadProfile(profileDirectory);
}
public String detect(String text) throws LangDetectException {
Detector detector = DetectorFactory.create();
detector.append(text);
return detector.detect();
}
public ArrayList<Language> detectLangs(String text) throws LangDetectException {
Detector detector = DetectorFactory.create();
detector.append(text);
return detector.getProbabilities();
}
public static void main(String args[]) {
try {
LanguageCodeDetection ld = new LanguageCodeDetection();
String profileDirectory = "C:/profiles/";
ld.init(profileDirectory);
String text = "Кремль россий";
System.out.println(ld.detectLangs(text));
System.out.println(ld.detect(text));
} catch (LangDetectException e) {
e.printStackTrace();
}
}
}
Output:
[ru:0.9999983255911719]
ru
Profiles can be downloaded from:
https://language-detection.googlecode.com/files/langdetect-09-13-2011.zip