Adding language profile to Apache Tika

2019-06-17 04:55发布

问题:

Could please anybody who managed to do that explain how to do that :-)

Do I need to get n-gram files for the language I need to add ?

Is it a matter of creating tika.language.override.properties, add some other lang codes and add lang-code.ngp n-gram file on the classPath ? In that case, where do I get it and why Tika doesn't support more languages, if it is just a matter of this ?

There are currently these languages supported for language detection

da,de,et,el,en,es,fi,fr,hu,is,it,lt,nl,no,pl,pt,ru,sv,th

and tika uses traditional n-gram notation

er_ 132232
_de 103517
en_ 82666
et_ 80661
for 65286
_fo 57945
de_ 51382
der 44049
at_ 41915
det 41381
_og 40344
_at 39482
ing 38707
den 36795
og_ 36577
_me 34924
nde 34528

This lang detection application currently supports these languages, but has kinda different n-gram files

af  bg  cs  de  en  fa  fr  he  hr  id  ja  ko  ml  ne  no  pl  ro  sk  sq  sw   te  tl  uk   vi     zh-tw ar  bn  da  el  es  fi   gu  hi  hu  it  kn  mk  mr   nl   pa  pt  ru  so   sv  ta  th   tr  ur  zh-cn

in JSON notation

{"freq":{"D":9246,"E":2445,"F":2510,"G":3299,"A":6930,"B":3706,"C":2451,"L":2519,"M":3951,"N":3334,"O":2514,"H" ....

回答1:

It looks like as of TIKA-490, it should be possible to add new language profiles. TIKA-546 seems to indicate it isn't yet as easy as it might be, and in the mean time you'll need to start with Nutch's NGramProfile tool and tweak the output.

I'd suggest you try using the Nutch tool to generate the files, then look at the comments on TIKA-490 for details on how to use them.