I need a sample code to help me detect farsi language web pages by apache tika toolkit.
LanguageIdentifier identifier = new LanguageIdentifier("فارسی");
String language = identifier.getLanguage();
I have download apache.tika jar files and add them to the classpath. but this code gives error for Farsi language but it works for english. how can I add Farsi to languageIdentifier package of tika?
Tika doesn't ship with a language profile for the Farsi language yet. As of version 1.0 27 languages are supported out of the box:
In your example the input is misdetected as
li
(Lithuanian) with a distance of 0.41, which is above the certainty threshold of 0.022. See the source code for more information on the inner works ofLanguageIdentifier
.The Farsi language (Persian, ISO 639-1 2-letter code
fa
) is not recognized by default. If you want Tika to recognize another language, you have to create a language profile first.For this the following steps are necessary:
Find a text corpus for your language. I found the Hamshahri Collection. This should be sufficient. Download the corpus or parts of it and create a plain text file out of the XML.
Create an ngram file for the language identifier. This can be done using TikaCLI:
java -jar tika-app-1.0.jar --create-profile=fa -eUTF-8 fa-corpus.txt
This will a file calledfa.ngp
which contains the n-grams.Configure Tika so that it recognizes the new language. Either do this programmatically using
LanguageIdentifier.initProfiles()
or put a property file with the nametika.language.override.properties
into the classpath. Make sure the ngram file is in the classpath as well.If you now run Tika, it should correctly detect your language.
Update: Detailed the steps necessary to create a language profile.