This function is great, but its main flaw is that it doesn't handle domains ending with .co.uk or .com.au. How can it be modified to handle this?
function parseUrl($url) {
$r = "^(?:(?P<scheme>\w+)://)?";
$r .= "(?:(?P<login>\w+):(?P<pass>\w+)@)?";
$r .= "(?P<host>(?:(?P<subdomain>[-\w\.]+)\.)?" . "(?P<domain>[-\w]+\.(?P<extension>\w+)))";
$r .= "(?::(?P<port>\d+))?";
$r .= "(?P<path>[\w/-]*/(?P<file>[\w-]+(?:\.\w+)?)?)?";
$r .= "(?:\?(?P<arg>[\w=&]+))?";
$r .= "(?:#(?P<anchor>\w+))?";
$r = "!$r!";
preg_match ( $r, $url, $out );
return $out;
}
To clarify my reason for looking for something other than parse_url() is that I want to strip out (possibly multiple) subdomains as well.
print_r(parse_url('sub1.sub2.test.co.uk'));
Results in:
Array(
[scheme] => http
[host] => sub1.sub2.test.co.uk
)
What I want to extract is "test.co.uk" (sans subdomains), so first using parse_url is a pointless extra step where the output is the same as the input.
parse_url() does not work to extract subdomains and domain name extensions. You have to invent your own solution here.
I think a proper implementation would have to include a library of all domain name extensions, updated regularily.
https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains
http://www.iana.org/domains/root/db
https://publicsuffix.org/list/public_suffix_list.dat
https://raw.githubusercontent.com/publicsuffix/list/master/public_suffix_list.dat
Replace this bit:
With:
Where there
(?:
...)
part is a non-capturing group, with the?
making it optional.I'd probably go a step further and change that bit to this:
Since the extension don't contain number or underscore, and are usually just 2/3 letters (I think .museum is longest, at 6... so 10 is probably a safe maximum).
If you do that, you might want a case-insensitive flag added, (or put A-Z in also).
Based on your comment, you want to make the subdomain part of the match 'lazy' (only match if it has to), and thus allow the extension to capture both parts.
To do that, simply add a
?
to the end of the quanitifer, changing:to
And (in theory - haven't got PHP handy to test) that will only make the subdomain longer if it has to, so should allow the extension group to match appropriately.
Update:
Ok, assuming you've extracted the full hostname already (using parse_url as suggested in other Q/comments), try this for matching subdomain, domain, and extension parts:
This will leave a
.
on the end of the subdomain (and on the start of the extensio)n, but you can use asubstr($string,0,-1)
or similar to remove that.Expanded form for readability:
(can add comments to explain any of that, if necessary?)
What's wrong with the built-in parse_url?
This may or may not be of interest, but here's a regex I wrote that mostly conforms to RFC3986 (it's actually slightly stricter, as it disallows some of the more unusual URI syntaxes):
The named components are:
And here's the code that generates it (along with variants defined by some options):