How I can parse a domain from URL in PHP? It seems that I need a country domain database.
Examples:
http://mail.google.com/hfjdhfjd/jhfjd.html -> google.com
http://www.google.bg/jhdjhf/djfhj.html -> google.bg
http://www.google.co.uk/djhdjhf.php -> google.co.uk
http://www.tsk.tr/jhjgc.aspx -> tsk.tr
http://subsub.sub.nic.tr/
-> nic.tr
http://subsub.sub.google.com.tr -> google.com.tr
http://subsub.sub.itoy.info.tr -> itoy.info.tr
Can it be done with whois request?
Edit: There are few domain names with .tr
(www.nic.tr
, www.tsk.tr
) the others are as you know: www.something.com.tr
, www.something.org.tr
Also there is no www.something.com.bg
, www.something.org.bg
. They are www.something.bg
like the Germans' .de
But there are www.something.a.bg
, www.something.b.bg
thus a.bg
, b.bg
, c.bg
and so on. (a.bg
is like co.uk
)
There on the net must be list of these top domain names.
Check how is coloured the url http://www.agrotehnika97.a.bg/
in Internet Explorer.
Check also
www.google.co.uk<br>
www.google.com.tr<br>
www.nic.tr<br>
www.tsk.tr
The domain is stored in $_SERVER['HTTP_HOST']
.
EDIT: I believe this returns the whole domain. To just get the top-level domain, you could do this:
// Add all your wanted subdomains that act as top-level domains, here (e.g. 'co.cc' or 'co.uk')
// As array key, use the last part ('cc' and 'uk' in the above examples) and the first part as sub-array elements for that key
$allowed_subdomains = array(
'cc' => array(
'co'
),
'uk' => array(
'co'
)
);
$domain = $_SERVER['HTTP_HOST'];
$parts = explode('.', $domain);
$top_level = array_pop($parts);
// Take care of allowed subdomains
if (isset($allowed_subdomains[$top_level]))
{
if (in_array(end($parts), $allowed_subdomains[$top_level]))
$top_level = array_pop($parts).'.'.$top_level;
}
$top_level = array_pop($parts).'.'.$top_level;
You can use parse_url()
to split it up and get what you want.
Here's an example...
$url = 'http://www.google.com/search?hl=en&source=hp&q=google&btnG=Google+Search&meta=lr%3D&aq=&oq=dasd';
print_r(parse_url($url));
Will echo...
Array
(
[scheme] => http
[host] => www.google.com
[path] => /search
[query] => hl=en&source=hp&q=google&btnG=Google+Search&meta=lr%3D&aq=&oq=dasd
)
I reckon you'll need a list of all suffixes used after a domain name.
http://publicsuffix.org/list/ provides an up-to-date (or so they claim) of all suffixes in use currently.
The list is actually here
Now the idea would be for you to parse up that list into a structure, with different levels split by the dot, starting by the end levels:
so for instance for the domains:
com.la
com.tr
com.lc
you'd end up with:
[la]=>[com]
[lc]=>[com]
etc...
Then you'd get the host from base_url (by using parse_url), and you'd explode it by dots. and you start matching up the values against your structure, starting with the last one:
so for google.com.tr you'd start by matching tr, then com, then you won't find a match once you get to google, which is what you want...
Regex and parse_url() aren't solution for you.
You need package that uses Public Suffix List, only in this way you can correctly extract domains with two-, third-level TLDs (co.uk, a.bg, b.bg, etc.). I recomend use TLD Extract.
Here example of code:
$extract = new LayerShifter\TLDExtract\Extract();
$result = $extract->parse('http://subsub.sub.google.com.tr');
$result->getRegistrableDomain(); // will return (string) 'google.com.tr'