Automatically determine the natural language of a

2019-02-04 23:39发布

I'm looking for a way to automatically determine the natural language used by a website page, given its URL.

In Python, a function like:

def LanguageUsed (url):
    #stuff

Which returns a language specifier (e.g. 'en' for English, 'jp' for Japanese, etc...)

Summary of Results: I have a reasonable solution working in Python using code from the PyPi for oice.langdet. It does a decent job in discriminating English vs. Non-English, which is all I require at the moment. Note that you have to fetch the html using Python urllib. Also, oice.langdet is GPL license.

For a more general solution using Trigrams in Python as others have suggested, see this Python Cookbook Recipe from ActiveState.

The Google Natural Language Detection API works very well (if not the best I've seen). However, it is Javascript and their TOS forbids automating its use.

标签: python url web nlp
7条回答
疯言疯语
2楼-- · 2019-02-04 23:58

Your best bet really is to use Google's natural language detection api. It returns an iso code for the page language, with a probability index.

See http://code.google.com/apis/ajaxlanguage/documentation/

查看更多
叼着烟拽天下
3楼-- · 2019-02-05 00:02

You might want to try ngram based detection.

TextCat DEMO (LGPL) seems to work pretty well (recognizes almost 70 languages). There is a python port provided by Thomas Mangin here using the same corpus.

Edit: TextCat competitors page provides some interesting links too.

Edit2: I wonder if making a python wrapper for http://www.mnogosearch.org/guesser/ would be difficult...

查看更多
Fickle 薄情
4楼-- · 2019-02-05 00:03

In Python, the langdetect package (found here) can do this. It is based on Googles automatic language detection and supports by default 55 languages.

It is installed by using

pip install langdetect

And then for example running

from langdetect import detect

detect("War doesn't show who's right, just who's left.")
detect("Ein, zwei, drei, vier")

Will return 'en' and 'de' respectively.

查看更多
欢心
5楼-- · 2019-02-05 00:10

nltk might help (if you have to get down to dealing with the page's text, i.e. if the headers and the url itself don't determine the language sufficiently well for your purposes); I don't think NLTK directly offers a "tell me which language this text is in" function (though NLTK is large and continuously growing, so it might in fact have it), but you can try parsing the given text according to various possible natural languages and checking which ones give the most sensible parse, wordset, &c, according to the rules for each language.

查看更多
6楼-- · 2019-02-05 00:11

This is usually accomplished by using character n-gram models. You can find here a state of the art language identifier for Java. If you need some help converting it to Python, just ask. Hope it helps.

查看更多
闹够了就滚
7楼-- · 2019-02-05 00:14

There's no general method that will work solely on URLs. You can check the top-level domain to get some idea, and look for portions of the URL that might be indicative of a language (like "en" or "es" between two slashes), and assume anything unknown is in English, but it isn't a perfect solution.

So far as I know, the only general way to determine the natural language used by a page is to grab the page's text and check for certain common words in each language. For example, if "a", "an", and "the" appear several times in the page, it's likely that it includes English text; "el" and "la" might suggest Spanish; and so on.

查看更多
登录 后发表回答