Could please anybody who managed to do that explain how to do that :-)
Do I need to get n-gram files for the language I need to add ?
Is it a matter of creating tika.language.override.properties
, add some other lang codes and add lang-code.ngp n-gram file on the classPath ? In that case, where do I get it and why Tika doesn't support more languages, if it is just a matter of this ?
There are currently these languages supported for language detection
da,de,et,el,en,es,fi,fr,hu,is,it,lt,nl,no,pl,pt,ru,sv,th
and tika uses traditional n-gram notation
er_ 132232
_de 103517
en_ 82666
et_ 80661
for 65286
_fo 57945
de_ 51382
der 44049
at_ 41915
det 41381
_og 40344
_at 39482
ing 38707
den 36795
og_ 36577
_me 34924
nde 34528
This lang detection application currently supports these languages, but has kinda different n-gram files
af bg cs de en fa fr he hr id ja ko ml ne no pl ro sk sq sw te tl uk vi zh-tw ar bn da el es fi gu hi hu it kn mk mr nl pa pt ru so sv ta th tr ur zh-cn
in JSON notation
{"freq":{"D":9246,"E":2445,"F":2510,"G":3299,"A":6930,"B":3706,"C":2451,"L":2519,"M":3951,"N":3334,"O":2514,"H" ....
It looks like as of TIKA-490, it should be possible to add new language profiles. TIKA-546 seems to indicate it isn't yet as easy as it might be, and in the mean time you'll need to start with Nutch's NGramProfile tool and tweak the output.
I'd suggest you try using the Nutch tool to generate the files, then look at the comments on TIKA-490 for details on how to use them.