How to detect language

2019-02-03 09:47发布

Are there any good, open source engines out there for detecting what language a text is in, perhaps with a probability metric? One that I can run locally and doesn't query Google or Bing? I'd like to detect language for each page in about 15 million pages of OCR'ed text.

Not all documents will contain languages which use the Latin alphabet.

7条回答
混吃等死
2楼-- · 2019-02-03 10:25

You can surely build your own, given some statistics about letter frequencies, digraph frequencies, etc, of your target languages.

Then release it as open source. And voila, you have an open source engine for detecting the language of text!

查看更多
登录 后发表回答