Get subdomain from URL using Python

2019-02-21 15:51发布

问题:

For example, the address is:

Address = http://lol1.domain.com:8888/some/page

I want to save the subdomain into a variable so i could do like so;

print SubAddr
>> lol1

回答1:

urlparse.urlparse will split the URL into protocol, location, port, etc. You can then split the location by . to get the subdomain.

url = urlparse.urlparse(address)
subdomain = url.hostname.split('.')[0]


回答2:

Package tldextract makes this task very easy, and then you can use urlparse as suggested if you need any further information:

>> import tldextract
>> tldextract.extract("http://lol1.domain.com:8888/some/page"
ExtractResult(subdomain='lol1', domain='domain', suffix='com')
>> tldextract.extract("http://sub.lol1.domain.com:8888/some/page"
ExtractResult(subdomain='sub.lol1', domain='domain', suffix='com')
>> urlparse.urlparse("http://sub.lol1.domain.com:8888/some/page")
ParseResult(scheme='http', netloc='sub.lol1.domain.com:8888', path='/some/page', params='', query='', fragment='')

Note that tldextract properly handles sub-domains.



回答3:

Modified version of the fantastic answer here: How to extract top-level domain name (TLD) from URL

You will need the list of effective tlds from here

from __future__ import with_statement
from urlparse import urlparse

# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tldFile:
    tlds = [line.strip() for line in tldFile if line[0] not in "/\n"]

class DomainParts(object):
    def __init__(self, domain_parts, tld):
        self.domain = None
        self.subdomains = None
        self.tld = tld
        if domain_parts:
            self.domain = domain_parts[-1]
            if len(domain_parts) > 1:
                self.subdomains = domain_parts[:-1]

def get_domain_parts(url, tlds):
    urlElements = urlparse(url).hostname.split('.')
    # urlElements = ["abcde","co","uk"]
    for i in range(-len(urlElements),0):
        lastIElements = urlElements[i:]
        #    i=-3: ["abcde","co","uk"]
        #    i=-2: ["co","uk"]
        #    i=-1: ["uk"] etc

        candidate = ".".join(lastIElements) # abcde.co.uk, co.uk, uk
        wildcardCandidate = ".".join(["*"]+lastIElements[1:]) # *.co.uk, *.uk, *
        exceptionCandidate = "!"+candidate

        # match tlds: 
        if (exceptionCandidate in tlds):
            return ".".join(urlElements[i:]) 
        if (candidate in tlds or wildcardCandidate in tlds):
            return DomainParts(urlElements[:i], '.'.join(urlElements[i:]))
            # returns ["abcde"]

    raise ValueError("Domain not in global list of TLDs")

domain_parts = get_domain_parts("http://sub2.sub1.example.co.uk:80",tlds)
print "Domain:", domain_parts.domain
print "Subdomains:", domain_parts.subdomains or "None"
print "TLD:", domain_parts.tld

Gives you:

Domain: example
Subdomains: ['sub2', 'sub1']
TLD: co.uk


回答4:

What you are looking for is in: http://docs.python.org/library/urlparse.html

for example: ".".join(urlparse('http://www.my.cwi.nl:80/%7Eguido/Python.html').netloc.split(".")[:-2])

Will do the job for you (will return "www.my")



回答5:

A very basic approach, without any sanity checking could look like:

address = 'http://lol1.domain.com:8888/some/page'

host = address.partition('://')[2]
sub_addr = host.partition('.')[0]

print sub_addr

This of course assumes that when you say 'subdomain' you mean the first part of a host name, so in the following case, 'www' would be the subdomain:

http://www.google.com/

Is that what you mean?



回答6:

For extracting the hostname, I'd use urlparse from urllib2:

>>> from urllib2 import urlparse
>>> a = "http://lol1.domain.com:8888/some/page"
>>> urlparse.urlparse(a).hostname
'lol1.domain.com'

As to how to extract the subdomain, you need to cover for the case that there FQDN could be longer. How you do this would depend on your purposes. I might suggest stripping off the two right most components.

E.g.

>>> urlparse.urlparse(a).hostname.rpartition('.')[0].rpartition('.')[0]
'lol1'


回答7:

We can use https://github.com/john-kurkowski/tldextract for this problem...

It's easy.

>>> ext = tldextract.extract('http://forums.bbc.co.uk')
>>> (ext.subdomain, ext.domain, ext.suffix)
('forums', 'bbc', 'co.uk')