Extract registered domain from URL based on Public

2020-02-26 00:14发布

问题:

Given a URL, how do I extract the registered domain using the Public Suffix List (list of effective TLDs, e.g. this list)?

For instance, considering a.bg is a valid public suffix:

http://www.test.start.a.bg/hello.html -> start.a.bg 
http://test.start.a.bg/               -> start.a.bg
http://test.start.abc.bg/             -> abc.bg (.bg is the public suffix)

This cannot be done using simple string manipulation because the public suffix can consist of multiple levels depending on the TLD.

P.S. It doesn't matter how I read the list (database or flat file), but the list should be accessible locally so I'm not always dependent on external services.

回答1:

You can use parse_url() to extract the hostname, then use the library provided by regdom to determine the registered domain name (dn + eTLD). For example:

require_once("effectiveTLDs.inc.php");
require_once("regDomain.inc.php");

$url =  'http://www.metu.edu.tr/dhasjkdas/sadsdds/sdda/sdads.html';
echo getRegisteredDomain(parse_url($url, PHP_URL_HOST));

That will print out metu.edu.tr.

Other examples I've tried:

http://www.xyz.start.bg/hello   ->   start.bg
http://www.start.a.bg/world     ->   start.a.bg  (a.bg is a listed eTLD)
http://xyz.ma219.metu.edu.tr    ->   metu.edu.tr
http://www.google.com/search    ->   google.com
http://google.co.uk/search?asd  ->   google.co.uk

UPDATE: These libraries have been moved to: https://github.com/leth/registered-domains-php



回答2:

This question is a bit old, but there's a new solution: https://github.com/jeremykendall/php-domain-parser

This library does exactly what you want. Here's the setup:

$pslManager = new Pdp\PublicSuffixListManager();
$parser = new Pdp\Parser($pslManager->getList());
echo $parser->getRegisterableDomain('www.scottwills.co.uk');

This will print "scottwills.co.uk".



回答3:

I recomend to use TLDExtract, it has regurly updatable database that generated from PSL.

$extract = new LayerShifter\TLDExtract\Extract();

$result = $extract->parse('shop.github.com');
$result->getFullHost(); // will return (string) 'shop.github.com'
$result->getRegistrableDomain(); // will return (string) 'github.com'
$result->isValidDomain(); // will return (bool) true
$result->isIp(); // will return (bool) false