Python urlparse — extract domain name without subd

Need a way to extract a domain name without the subdomain from a url using Python urlparse.

For example, I would like to extract "google.com" from a full url like "http://www.google.com".

The closest I can seem to come with urlparse is the netloc attribute, but that includes the subdomain, which in this example would be www.google.com.

I know that it is possible to write some custom string manipulation to turn www.google.com into google.com, but I want to avoid by-hand string transforms or regex in this task. (The reason for this is that I am not familiar enough with url formation rules to feel confident that I could consider every edge case required in writing a custom parsing function.)

Or, if urlparse can't do what I need, does anyone know any other Python url-parsing libraries that would?

标签： python parsing url urlparse

7条回答

劳资没心，怎么记你

2楼-- · 2019-01-10 14:51

from tld import get_tld
from tld.utils import update_tld_names
update_tld_names()

result=get_tld('http://www.google.com')
print 'https://'+result

Input: http://www.google.com

Result: google.com

0人赞添加讨论(0) 举报

Anthone

3楼-- · 2019-01-10 14:52

This is not a standard decomposition of the URLs.

You cannot rely on the www. to be present or optional. In a lot of cases it will not.

So if you do want to assume that only the last two components are relevant (which also won't work for the uk, e.g. www.google.co.uk) then you can do a split('.')[-2:].

Or, which is actually less error prone, strip a www. prefix.

But in either way you cannot assume that the www. is optional, because it will NOT work every time!

Here is a list of common suffixes for domains. You can try to keep the suffix + one component.

https://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1

But how do you plan to handle for example first.last.name domains? Assume that all the users with the same last name are the same company? Initially, you would only be able to get third-level domains there. By now, you apparently can get second level, too. So for .name there is no general rule.

0人赞添加讨论(0) 举报

forever°为你锁心

4楼-- · 2019-01-10 14:55

There are multiple Python modules which encapsulate the (once Mozilla) Public Suffix List in a library, several of which don't require the input to be a URL. Even though the question asks about URL normalization specifically, my requirement was to handle just domain names, and so I'm offering a tangential answer for that.

The relative merits of publicsuffix2 over publicsuffixlist or publicsuffix are unclear, but they all seem to offer the basic functionality.

publicsuffix2:

>>> import publicsuffix  # sic
>>> publicsuffix.PublicSuffixList().get_public_suffix('www.google.co.uk')
u'google.co.uk'

Supposedly more packaging-friendly fork of publicsuffix.

publicsuffixlist:

>>> import publicsuffixlist
>>> publicsuffixlist.PublicSuffixList().privatesuffix('www.google.co.uk')
'google.co.uk'

Advertises idna support, which I however have not tested.

publicsuffix:

>>> import publicsuffix
>>> publicsuffix.PublicSuffixList(publicsuffix.fetch()).get_public_suffix('www.google.co.uk')
'google.co.uk'

The requirement to handle the updates and caching the downloaded file yourself is a bit of a complication.

0人赞添加讨论(0) 举报

甜甜的少女心

5楼-- · 2019-01-10 14:59

You probably want to check out tldextract, a library designed to do this kind of thing.

It uses the Public Suffix List to try and get a decent split based on known gTLDs, but do note that this is just a brute-force list, nothing special, so it can get out of date (although hopefully it's curated so as not to).

>>> import tldextract
>>> tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')

So in your case:

>>> extracted = tldextract.extract('http://www.google.com')
>>> "{}.{}".format(extracted.domain, extracted.suffix)
"google.com"

0人赞添加讨论(0) 举报

▲ chillily

6楼-- · 2019-01-10 14:59

Using the tldexport works fine, but apparently has a problem while parsing the blogspot.com subdomain and create a mess. If you would like to go ahead with that library, make sure to implement an if condition or something to prevent returning an empty string in the subdomain.

0人赞添加讨论(0) 举报

傲

7楼-- · 2019-01-10 15:03

This is an update, based on the bounty request for an updated answer

Start by using the tld package. A description of the package:

Extracts the top level domain (TLD) from the URL given. List of TLD names is taken from Mozilla http://mxr.mozilla.org/mozilla/source/netwerk/dns/src/effective_tld_names.dat?raw=1

from tld import get_tld
from tld.utils import update_tld_names
update_tld_names()

print get_tld("http://www.google.co.uk")
print get_tld("http://zap.co.it")
print get_tld("http://google.com")
print get_tld("http://mail.google.com")
print get_tld("http://mail.google.co.uk")
print get_tld("http://google.co.uk")

This outputs

google.co.uk
zap.co.it
google.com
google.com
google.co.uk
google.co.uk

Notice that it correctly handles country level TLDs by leaving co.uk and co.it, but properly removes the www and mail subdomains for both .com and .co.uk

The update_tld_names() call at the beginning of the script is used to update/sync the tld names with the most recent version from Mozilla.

0人赞添加讨论(0) 举报

1 2 下一页

Python urlparse — extract domain name without subd

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间