How to detect language

Are there any good, open source engines out there for detecting what language a text is in, perhaps with a probability metric? One that I can run locally and doesn't query Google or Bing? I'd like to detect language for each page in about 15 million pages of OCR'ed text.

Not all documents will contain languages which use the Latin alphabet.

标签： detection language-detection

7条回答

ゆ、 Hurt°

2楼-- · 2019-02-03 10:03

I don't think you need anything very sophisticated - for example to detect if a document is in English, with a pretty high level of certainty, simply test if it contains the N most common English words - something like:

"the a an is to are in on in it"

If it contains all of those, I would say it is almost definitely English.

0人赞添加讨论(0) 举报

ゆ、 Hurt°

3楼-- · 2019-02-03 10:03

Check out Franc on Github. It's written in JavaScript, so you could use in a browser and maybe in Node too.

franc supports more languages than any other library, or Google;

franc is easily forked to support 335 languages; franc is just as

fast as the competition.

0人赞添加讨论(0) 举报

看我几分像从前

4楼-- · 2019-02-03 10:07

Depending on what you're doing, you might want to check out the python Natural Language Processing Toolkit (NLTK), which has some support for Bayesian Learning Algorithms.

In general, the letter and word frequencies would probably be the fastest evaluation, but the NLTK (or a bayesian learning algorithm in general) will probably be useful if you need to do anything beyond identification of the language. Bayesian methods will probably be useful also if you discover the first two methods have too high of an error rate.

0人赞添加讨论(0) 举报

神经病院院长

5楼-- · 2019-02-03 10:12

Try CLD2:

Installation

sudo -H pip install cld2-cffi

Run

import cld2

res = cld2.detect("This is a sample text.")
print(res)
res = cld2.detect("Dies ist ein Beispieltext.")
print(res)
res = cld2.detect("Je ne peut pas parler cette language.")
print(res)
res = cld2.detect(" هذه هي بعض النصوص العربية")
print(res)
res = cld2.detect("这是一些阿拉伯文字")  # Chinese?
print(res)
res = cld2.detect("これは、いくつかのアラビア語のテキストです")
print(res)
print("Supports {} languages.".format(len(cld2.LANGUAGES)))

Gives

Detections(is_reliable=True, bytes_found=23, details=(Detection(language_name=u'ENGLISH', language_code=u'en', percent=95, score=1675.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
Detections(is_reliable=True, bytes_found=27, details=(Detection(language_name=u'GERMAN', language_code=u'de', percent=96, score=1496.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
Detections(is_reliable=True, bytes_found=38, details=(Detection(language_name=u'FRENCH', language_code=u'fr', percent=97, score=1134.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
Detections(is_reliable=True, bytes_found=48, details=(Detection(language_name=u'ARABIC', language_code=u'ar', percent=97, score=1263.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
Detections(is_reliable=False, bytes_found=29, details=(Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
Detections(is_reliable=True, bytes_found=63, details=(Detection(language_name=u'Japanese', language_code=u'ja', percent=98, score=3848.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
Supports 282 languages.

Others

https://detectlanguage.com/ - a service around CLD2

0人赞添加讨论(0) 举报

smile是对你的礼貌

6楼-- · 2019-02-03 10:15

You could alternatively try Ruby's WhatLanguage gem, it's nice and simple and I've used in for Twitter data analysis. Check out: http://www.youtube.com/watch?v=lNqZ2cqOReo&list=UUJ_3fstMOH-g4yBxtvgAWkw&index=0&feature=plcp for a quick demo

0人赞添加讨论(0) 举报

地球回转人心会变

7楼-- · 2019-02-03 10:22

For future reference, the engine I ended up using is libtextcat which is under BSD license but seems not to be maintained since 2003. Still, it does a good job and integrates easily in my toolchain

0人赞添加讨论(0) 举报

1 2 下一页

How to detect language

Others

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间