Get the subdomain from a URL-第2页回答

Getting the subdomain from a URL sounds easy at first.

http://www.domain.example

Scan for the first period then return whatever came after the "http://" ...

Then you remember

http://super.duper.domain.example

Oh. So then you think, okay, find the last period, go back a word and get everything before!

Then you remember

http://super.duper.domain.co.uk

And you're back to square one. Anyone have any great ideas besides storing a list of all TLDs?

标签： url parsing dns subdomain

15条回答

浅入江南

2楼-- · 2019-01-01 07:24

Anyone have any great ideas besides storing a list of all TLDs?

No, because each TLD differs on what counts as a subdomain, second level domain, etc.

Keep in mind that there are top level domains, second level domains, and subdomains. Technically speaking, everything except the TLD is a subdomain.

In the domain.com.uk example, domain is a subdomain, com is a second level domain, and uk is the tld.

So the question remains more complex than at first blush, and it depends on how each TLD is managed. You'll need a database of all the TLDs that include their particular partitioning, and what counts as a second level domain and a subdomain. There aren't too many TLDs, though, so the list is reasonably manageable, but collecting all that information isn't trivial. There may already be such a list available.

Looks like http://publicsuffix.org/ is one such list - all the common suffixes (.com, .co.uk, etc) in a list suitable for searching. It still won't be easy to parse it, but at least you don't have to maintain the list.

A "public suffix" is one under which Internet users can directly register names. Some examples of public suffixes are ".com", ".co.uk" and "pvt.k12.wy.us". The Public Suffix List is a list of all known public suffixes.

The Public Suffix List is an initiative of the Mozilla Foundation. It is available for use in any software, but was originally created to meet the needs of browser manufacturers. It allows browsers to, for example:

Avoid privacy-damaging "supercookies" being set for high-level domain name suffixes

Highlight the most important part of a domain name in the user interface

Accurately sort history entries by site

Looking through the list, you can see it's not a trivial problem. I think a list is the only correct way to accomplish this...

-Adam

0人赞添加讨论(0) 举报

与风俱净

3楼-- · 2019-01-01 07:26

As already said Public Suffix List is only one way to parse domain correctly. For PHP you can try TLDExtract. Here is sample code:

$extract = new LayerShifter\TLDExtract\Extract();

$result = $extract->parse('super.duper.domain.co.uk');
$result->getSubdomain(); // will return (string) 'super.duper'
$result->getSubdomains(); // will return (array) ['super', 'duper']
$result->getHostname(); // will return (string) 'domain'
$result->getSuffix(); // will return (string) 'co.uk'

0人赞添加讨论(0) 举报

听够珍惜

4楼-- · 2019-01-01 07:28

If you're looking to extract subdomains and/or domains from an arbitrary list of URLs, this python script may be helpful. Be careful though, it's not perfect. This is a tricky problem to solve in general and it's very helpful if you have a whitelist of domains you're expecting.

Get top level domains from publicsuffix.org

import requests

url = 'https://publicsuffix.org/list/public_suffix_list.dat'
page = requests.get(url)

domains = []
for line in page.text.splitlines():
    if line.startswith('//'):
        continue
    else:
        domain = line.strip()
        if domain:
            domains.append(domain)

domains = [d[2:] if d.startswith('*.') else d for d in domains]
print('found {} domains'.format(len(domains)))

Build regex

import re

_regex = ''
for domain in domains:
    _regex += r'{}|'.format(domain.replace('.', '\.'))

subdomain_regex = r'/([^/]*)\.[^/.]+\.({})/.*$'.format(_regex)
domain_regex = r'([^/.]+\.({}))/.*$'.format(_regex)

Use regex on list of URLs

FILE_NAME = ''   # put CSV file name here
URL_COLNAME = '' # put URL column name here

import pandas as pd

df = pd.read_csv(FILE_NAME)
urls = df[URL_COLNAME].astype(str) + '/' # note: adding / as a hack to help regex

df['sub_domain_extracted'] = urls.str.extract(pat=subdomain_regex, expand=True)[0]
df['domain_extracted'] = urls.str.extract(pat=domain_regex, expand=True)[0]

df.to_csv('extracted_domains.csv', index=False)

0人赞添加讨论(0) 举报

初与友歌

5楼-- · 2019-01-01 07:30

As Adam says, it's not easy, and currently the only practical way is to use a list.

Even then there are exceptions - for example in .uk there are a handful of domains that are valid immediately at that level that aren't in .co.uk, so those have to be added as exceptions.

This is currently how mainstream browsers do this - it's necessary to ensure that example.co.uk can't set a Cookie for .co.uk which would then be sent to any other website under .co.uk.

The good news is that there's already a list available at http://publicsuffix.org/.

There's also some work in the IETF to create some sort of standard to allow TLDs to declare what their domain structure looks like. This is slightly complicated though by the likes of .uk.com, which is operated as if it were a public suffix, but isn't sold by the .com registry.

0人赞添加讨论(0) 举报

深知你不懂我心

6楼-- · 2019-01-01 07:32

It's not working it out exactly, but you could maybe get a useful answer by trying to fetch the domain piece by piece and checking the response, ie, fetch 'http://uk', then 'http://co.uk', then 'http://domain.co.uk'. When you get a non-error response you've got the domain and the rest is subdomain.

Sometimes you just gotta try it :)

Edit:

Tom Leys points out in the comments, that some domains are set up only on the www subdomain, which would give us an incorrect answer in the above test. Good point! Maybe the best approach would be to check each part with 'http://www' as well as 'http://', and count a hit to either as a hit for that section of the domain name? We'd still be missing some 'alternative' arrangements such as 'web.domain.com', but I haven't run into one of those for a while :)

0人赞添加讨论(0) 举报

唯独是你

7楼-- · 2019-01-01 07:33

For a C library (with data table generation in Python), I wrote http://code.google.com/p/domain-registry-provider/ which is both fast and space efficient.

The library uses ~30kB for the data tables and ~10kB for the C code. There is no startup overhead since the tables are constructed at compile time. See http://code.google.com/p/domain-registry-provider/wiki/DesignDoc for more details.

To better understand the table generation code (Python), start here: http://code.google.com/p/domain-registry-provider/source/browse/trunk/src/registry_tables_generator/registry_tables_generator.py

To better understand the C API, see: http://code.google.com/p/domain-registry-provider/source/browse/trunk/src/domain_registry/domain_registry.h

0人赞添加讨论(0) 举报

Get the subdomain from a URL

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间