Automatically determine the natural language of a-第2页回答

Automatically determine the natural language of a

2019-02-04 23:39发布

I'm looking for a way to automatically determine the natural language used by a website page, given its URL.

In Python, a function like:

def LanguageUsed (url):
    #stuff

Which returns a language specifier (e.g. 'en' for English, 'jp' for Japanese, etc...)

Summary of Results: I have a reasonable solution working in Python using code from the PyPi for oice.langdet. It does a decent job in discriminating English vs. Non-English, which is all I require at the moment. Note that you have to fetch the html using Python urllib. Also, oice.langdet is GPL license.

For a more general solution using Trigrams in Python as others have suggested, see this Python Cookbook Recipe from ActiveState.

The Google Natural Language Detection API works very well (if not the best I've seen). However, it is Javascript and their TOS forbids automating its use.

标签： python url web nlp

7条回答

趁早两清

2楼-- · 2019-02-05 00:17

There is nothing about the URL itself that will indicate language.

One option would be to use a natural language toolkit to try to identify the language based on the content, but even if you can get the NLP part of it working, it'll be pretty slow. Also, it may not be reliable. Remember, most user agents pass something like

Accept-Language: en-US

with each request, and many large websites will serve different content based on that header. Smaller sites will be more reliable because they won't pay attention to the language headers.

You could also use server location (i.e. which country the server is in) as a proxy for language using GeoIP. It's obviously not perfect, but it is much better than using the TLD.

0人赞添加讨论(0) 举报

上一页 1 2

Automatically determine the natural language of a

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间