how would you extract the domain name from a URL, excluding any subdomains?
My initial simplistic attempt was:
'.'.join(urlparse.urlparse(url).netloc.split('.')[-2:])
This works for http://www.foo.com, but not http://www.foo.com.au. Is there a way to do this properly without using special knowledge about valid TLDs (Top Level Domains) or country codes (because they change).
thanks
Using this file of effective tlds which someone else found on Mozilla's website:
results in:
I'd appreciate it if someone let me know which bits of the above could be rewritten in a more pythonic way. For example, there must be a better way of iterating over the
last_i_elements
list, but I couldn't think of one. I also don't know ifValueError
is the best thing to raise. Comments?No, there is no "intrinsic" way of knowing that (e.g.)
zap.co.it
is a subdomain (because Italy's registrar DOES sell domains such asco.it
) whilezap.co.uk
isn't (because the UK's registrar DOESN'T sell domains such asco.uk
, but only likezap.co.uk
).You'll just have to use an auxiliary table (or online source) to tell you which TLD's behave peculiarly like UK's and Australia's -- there's no way of divining that from just staring at the string without such extra semantic knowledge (of course it can change eventually, but if you can find a good online source that source will also change accordingly, one hopes!-).
There are many, many TLD's. Here's the list:
http://data.iana.org/TLD/tlds-alpha-by-domain.txt
Here's another list
http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains
Here's another list
http://www.iana.org/domains/root/db/
Using python
tld
https://pypi.python.org/pypi/tld
Install
Get the TLD name as string from the URL given
or without protocol
Get the TLD as an object
Get the first level domain name as string from the URL given
Until get_tld is updated for all the new ones, I pull the tld from the error. Sure it's bad code but it works.
Here's a great python module someone wrote to solve this problem after seeing this question: https://github.com/john-kurkowski/tldextract
The module looks up TLDs in the Public Suffix List, mantained by Mozilla volunteers
Quote: