Get the subdomain from a URL

2019-01-01 07:09发布

Getting the subdomain from a URL sounds easy at first.

http://www.domain.example

Scan for the first period then return whatever came after the "http://" ...

Then you remember

http://super.duper.domain.example

Oh. So then you think, okay, find the last period, go back a word and get everything before!

Then you remember

http://super.duper.domain.co.uk

And you're back to square one. Anyone have any great ideas besides storing a list of all TLDs?

15条回答
无色无味的生活
2楼-- · 2019-01-01 07:13

Publicsuffix.org seems the way to do. There are plenty of implementations out there to parse the contents of the publicsuffix data file file easily:

查看更多
梦醉为红颜
3楼-- · 2019-01-01 07:15
echo tld('http://www.example.co.uk/test?123'); // co.uk

/**
 * http://publicsuffix.org/
 * http://www.alandix.com/blog/code/public-suffix/
 * http://tobyinkster.co.uk/blog/2007/07/19/php-domain-class/
 */
function tld($url_or_domain = null)
{
    $domain = $url_or_domain ?: $_SERVER['HTTP_HOST'];
    preg_match('/^[a-z]+:\/\//i', $domain) and 
        $domain = parse_url($domain, PHP_URL_HOST);
    $domain = mb_strtolower($domain, 'UTF-8');
    if (strpos($domain, '.') === false) return null;

    $url = 'http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1';

    if (($rules = file($url)) !== false)
    {
        $rules = array_filter(array_map('trim', $rules));
        array_walk($rules, function($v, $k) use(&$rules) { 
            if (strpos($v, '//') !== false) unset($rules[$k]);
        });

        $segments = '';
        foreach (array_reverse(explode('.', $domain)) as $s)
        {
            $wildcard = rtrim('*.'.$segments, '.');
            $segments = rtrim($s.'.'.$segments, '.');

            if (in_array('!'.$segments, $rules))
            {
                $tld = substr($wildcard, 2);
                break;
            }
            elseif (in_array($wildcard, $rules) or 
                    in_array($segments, $rules))
            {
                $tld = $segments;
            }
        }

        if (isset($tld)) return $tld;
    }

    return false;
}
查看更多
美炸的是我
4楼-- · 2019-01-01 07:16

As already said by Adam and John publicsuffix.org is the correct way to go. But, if for any reason you cannot use this approach, here's a heuristic based on an assumption that works for 99% of all domains:

There is one property that distinguishes (not all, but nearly all) "real" domains from subdomains and TLDs and that's the DNS's MX record. You could create an algorithm that searches for this: Remove the parts of the hostname one by one and query the DNS until you find an MX record. Example:

super.duper.domain.co.uk => no MX record, proceed
duper.domain.co.uk       => no MX record, proceed
domain.co.uk             => MX record found! assume that's the domain

Here is an example in php:

function getDomainWithMX($url) {
    //parse hostname from URL 
    //http://www.example.co.uk/index.php => www.example.co.uk
    $urlParts = parse_url($url);
    if ($urlParts === false || empty($urlParts["host"])) 
        throw new InvalidArgumentException("Malformed URL");

    //find first partial name with MX record
    $hostnameParts = explode(".", $urlParts["host"]);
    do {
        $hostname = implode(".", $hostnameParts);
        if (checkdnsrr($hostname, "MX")) return $hostname;
    } while (array_shift($hostnameParts) !== null);

    throw new DomainException("No MX record found");
}
查看更多
萌妹纸的霸气范
5楼-- · 2019-01-01 07:17

Use the URIBuilder then get the URIBUilder.host attribute split it into an array on "." you now have an array with the domain split out.

查看更多
怪性笑人.
6楼-- · 2019-01-01 07:19

Just wrote a program for this in clojure based on the info from publicsuffix.org:

https://github.com/isaksky/url_dom

For example:

(parse "sub1.sub2.domain.co.uk") 
;=> {:public-suffix "co.uk", :domain "domain.co.uk", :rule-used "*.uk"}
查看更多
美炸的是我
7楼-- · 2019-01-01 07:19

You can use this lib tld.js: JavaScript API to work against complex domain names, subdomains and URIs.

tldjs.getDomain('mail.google.co.uk');
// -> 'google.co.uk'

If you are getting root domain in browser. You can use this lib AngusFu/browser-root-domain.

var KEY = '__rT_dM__' + (+new Date());
var R = new RegExp('(^|;)\\s*' + KEY + '=1');
var Y1970 = (new Date(0)).toUTCString();

module.exports = function getRootDomain() {
  var domain = document.domain || location.hostname;
  var list = domain.split('.');
  var len = list.length;
  var temp = '';
  var temp2 = '';

  while (len--) {
    temp = list.slice(len).join('.');
    temp2 = KEY + '=1;domain=.' + temp;

    // try to set cookie
    document.cookie = temp2;

    if (R.test(document.cookie)) {
      // clear
      document.cookie = temp2 + ';expires=' + Y1970;
      return temp;
    }
  }
};

Using cookie is tricky.

查看更多
登录 后发表回答