For example, the address is:
Address = http://lol1.domain.com:8888/some/page
I want to save the subdomain into a variable so i could do like so;
print SubAddr
>> lol1
For example, the address is:
Address = http://lol1.domain.com:8888/some/page
I want to save the subdomain into a variable so i could do like so;
print SubAddr
>> lol1
urlparse.urlparse
will split the URL into protocol, location, port, etc. You can then split the location by .
to get the subdomain.
url = urlparse.urlparse(address)
subdomain = url.hostname.split('.')[0]
Package tldextract makes this task very easy, and then you can use urlparse as suggested if you need any further information:
>> import tldextract
>> tldextract.extract("http://lol1.domain.com:8888/some/page"
ExtractResult(subdomain='lol1', domain='domain', suffix='com')
>> tldextract.extract("http://sub.lol1.domain.com:8888/some/page"
ExtractResult(subdomain='sub.lol1', domain='domain', suffix='com')
>> urlparse.urlparse("http://sub.lol1.domain.com:8888/some/page")
ParseResult(scheme='http', netloc='sub.lol1.domain.com:8888', path='/some/page', params='', query='', fragment='')
Note that tldextract properly handles sub-domains.
Modified version of the fantastic answer here: How to extract top-level domain name (TLD) from URL
You will need the list of effective tlds from here
from __future__ import with_statement
from urlparse import urlparse
# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tldFile:
tlds = [line.strip() for line in tldFile if line[0] not in "/\n"]
class DomainParts(object):
def __init__(self, domain_parts, tld):
self.domain = None
self.subdomains = None
self.tld = tld
if domain_parts:
self.domain = domain_parts[-1]
if len(domain_parts) > 1:
self.subdomains = domain_parts[:-1]
def get_domain_parts(url, tlds):
urlElements = urlparse(url).hostname.split('.')
# urlElements = ["abcde","co","uk"]
for i in range(-len(urlElements),0):
lastIElements = urlElements[i:]
# i=-3: ["abcde","co","uk"]
# i=-2: ["co","uk"]
# i=-1: ["uk"] etc
candidate = ".".join(lastIElements) # abcde.co.uk, co.uk, uk
wildcardCandidate = ".".join(["*"]+lastIElements[1:]) # *.co.uk, *.uk, *
exceptionCandidate = "!"+candidate
# match tlds:
if (exceptionCandidate in tlds):
return ".".join(urlElements[i:])
if (candidate in tlds or wildcardCandidate in tlds):
return DomainParts(urlElements[:i], '.'.join(urlElements[i:]))
# returns ["abcde"]
raise ValueError("Domain not in global list of TLDs")
domain_parts = get_domain_parts("http://sub2.sub1.example.co.uk:80",tlds)
print "Domain:", domain_parts.domain
print "Subdomains:", domain_parts.subdomains or "None"
print "TLD:", domain_parts.tld
Gives you:
Domain: example Subdomains: ['sub2', 'sub1'] TLD: co.uk
What you are looking for is in: http://docs.python.org/library/urlparse.html
for example:
".".join(urlparse('http://www.my.cwi.nl:80/%7Eguido/Python.html').netloc.split(".")[:-2])
Will do the job for you (will return "www.my")
A very basic approach, without any sanity checking could look like:
address = 'http://lol1.domain.com:8888/some/page'
host = address.partition('://')[2]
sub_addr = host.partition('.')[0]
print sub_addr
This of course assumes that when you say 'subdomain' you mean the first part of a host name, so in the following case, 'www' would be the subdomain:
http://www.google.com/
Is that what you mean?
For extracting the hostname, I'd use urlparse from urllib2:
>>> from urllib2 import urlparse
>>> a = "http://lol1.domain.com:8888/some/page"
>>> urlparse.urlparse(a).hostname
'lol1.domain.com'
As to how to extract the subdomain, you need to cover for the case that there FQDN could be longer. How you do this would depend on your purposes. I might suggest stripping off the two right most components.
E.g.
>>> urlparse.urlparse(a).hostname.rpartition('.')[0].rpartition('.')[0]
'lol1'
We can use https://github.com/john-kurkowski/tldextract for this problem...
It's easy.
>>> ext = tldextract.extract('http://forums.bbc.co.uk')
>>> (ext.subdomain, ext.domain, ext.suffix)
('forums', 'bbc', 'co.uk')