How to make this PHP URL parsing function nearly p

2019-08-18 06:43发布

问题:

This function is great, but its main flaw is that it doesn't handle domains ending with .co.uk or .com.au. How can it be modified to handle this?

function parseUrl($url) {
    $r  = "^(?:(?P<scheme>\w+)://)?";
    $r .= "(?:(?P<login>\w+):(?P<pass>\w+)@)?";
    $r .= "(?P<host>(?:(?P<subdomain>[-\w\.]+)\.)?" . "(?P<domain>[-\w]+\.(?P<extension>\w+)))";
    $r .= "(?::(?P<port>\d+))?";
    $r .= "(?P<path>[\w/-]*/(?P<file>[\w-]+(?:\.\w+)?)?)?";
    $r .= "(?:\?(?P<arg>[\w=&]+))?";
    $r .= "(?:#(?P<anchor>\w+))?";
    $r = "!$r!";

    preg_match ( $r, $url, $out );

    return $out;
}

To clarify my reason for looking for something other than parse_url() is that I want to strip out (possibly multiple) subdomains as well.

print_r(parse_url('sub1.sub2.test.co.uk'));

Results in:

Array(
[scheme] => http
[host] => sub1.sub2.test.co.uk
)

What I want to extract is "test.co.uk" (sans subdomains), so first using parse_url is a pointless extra step where the output is the same as the input.

回答1:

This may or may not be of interest, but here's a regex I wrote that mostly conforms to RFC3986 (it's actually slightly stricter, as it disallows some of the more unusual URI syntaxes):

~^(?:(?:(?P<scheme>[a-z][0-9a-z.+-]*?)://)?(?P<authority>(?:(?P<userinfo>(?P<username>(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=])*)?:(?P<password>(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=])*)?|(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=]|:)*?)@)?(?P<host>(?P<domain>(?:[a-z](?:[0-9a-z-]*(?:[0-9a-z]))?\.)+(?:[a-z](?:[0-9a-z-]*(?:[0-9a-z]))?))|(?P<ip>(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d).(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d).(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d).(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d)))(?::(?P<port>\d+))?(?=/|$)))?(?P<path>/?(?:(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=]|:|@)+/)*(?:(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=]|:|@)+/?)?)(?:\?(?P<query>(?:(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=]|:|@)|/|\?)*?))?(?:#(?P<fragment>(?:(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=]|:|@)|/|\?)*))?$~i

The named components are:

scheme
authority
  userinfo
    username
    password
  domain
  ip
path
query
fragment

And here's the code that generates it (along with variants defined by some options):

public static function validateUri($uri, &$components = false, $flags = 0)
{
    if (func_num_args() > 3)
    {
        $flags = array_slice(func_get_args(), 2);
    }

    if (is_array($flags))
    {
        $flagsArray = $flags;
        $flags = array();
        foreach ($flagsArray as $flag)
        {
            if (is_int($flag))
            {
                $flags |= $flag;
            }
        }
    }

    // Set options.
    $requireScheme = !($flags & self::URI_ALLOW_NO_SCHEME);
    $requireAuthority = !($flags & self::URI_ALLOW_NO_AUTHORITY);
    $isRelative = (bool)($flags & self::URI_IS_RELATIVE);
    $requireMultiPartDomain = (bool)($flags & self::URI_REQUIRE_MULTI_PART_DOMAIN);

    // And we're away…

    // Some character types (taken from RFC 3986: http://tools.ietf.org/html/rfc3986).
    $hex = '[\da-f]'; // Hexadecimal digit.
    $pct = "(?:%$hex{2})"; // "Percent-encoded" value.
    $gen = '[\[\]:/?#@]'; // Generic delimiters.
    $sub = '[!$&\'()*+,;=]'; // Sub-delimiters.
    $reserved = "(?:$gen|$sub)"; // Reserved characters.
    $unreserved = '[\w.\~-]'; // Unreserved characters.
    $pChar = "(?:$unreserved|$pct|$sub|:|@)"; // Path characters.
    $qfChar = "(?:$pChar|/|\?)"; // Query/fragment characters.

    // Other entities.
    $octet = '(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d)';
    $label = '[a-z](?:[0-9a-z-]*(?:[0-9a-z]))?';

    $scheme = '(?:(?P<scheme>[a-z][0-9a-z.+-]*?)://)';

    // Authority components.
    $userInfo = "(?:(?P<userinfo>(?P<username>(?:$unreserved|$pct|$sub)*)?:(?P<password>(?:$unreserved|$pct|$sub)*)?|(?:$unreserved|$pct|$sub|:)*?)@)?";
    $ip = "(?P<ip>$octet.$octet.$octet.$octet)";
    if ($requireMultiPartDomain)
    {
        $domain = "(?P<domain>(?:$label\.)+(?:$label))";
    }
    else
    {
        $domain = "(?P<domain>(?:$label\.)*(?:$label))";
    }
    $host = "(?P<host>$domain|$ip)";
    $port = '(?::(?P<port>\d+))?';

    // Primary hierarchical URI components.
    $authority = "(?P<authority>$userInfo$host$port(?=/|$))";
    $path = "(?P<path>/?(?:$pChar+/)*(?:$pChar+/?)?)";

    // Final bits.
    $query = "(?:\?(?P<query>$qfChar*?))?";
    $fragment = "(?:#(?P<fragment>$qfChar*))?";

    // Construct the final pattern.
    $pattern = '~^';

    // Only include scheme and authority if the path is not relative.
    if (!$isRelative)
    {
        if ($requireScheme)
        {
            // If the scheme is required, then the authority must also be there.
            $pattern .= $scheme . $authority;
        }
        else if ($requireAuthority)
        {
            $pattern .= "$scheme?$authority";
        }
        else
        {
            $pattern .= "(?:$scheme?$authority)?";
        }
    }
    else
    {
        // Disallow that optional slash we put in $path.
        $pattern .= '(?!/)';
    }

    // Now add standard elements and terminate the pattern.
    $pattern .= $path . $query . $fragment . '$~i';

    // Finally, validate that sucker!
    $components = array();
    $result = (bool)preg_match($pattern, $uri, $matches);
    if ($result)
    {
        // Filter out all of the useless numerical matches.
        foreach ($matches as $key => $value)
        {
            if (!is_int($key))
            {
                $components[$key] = $value;
            }
        }

        return true;
    }
    else
    {
        return false;
    }
}


回答2:

What's wrong with the built-in parse_url?



回答3:

Replace this bit:

(?P<extension>\w+)

With:

(?P<extension>\w+(?:\.\w+)?)

Where there (?:...) part is a non-capturing group, with the ? making it optional.


I'd probably go a step further and change that bit to this:

(?P<extension>[a-z]{2,10}(?:\.[a-z]{2,10})?)

Since the extension don't contain number or underscore, and are usually just 2/3 letters (I think .museum is longest, at 6... so 10 is probably a safe maximum).

If you do that, you might want a case-insensitive flag added, (or put A-Z in also).


Based on your comment, you want to make the subdomain part of the match 'lazy' (only match if it has to), and thus allow the extension to capture both parts.

To do that, simply add a ? to the end of the quanitifer, changing:

(?P<subdomain>[-\w\.]+)

to

(?P<subdomain>[-\w\.]+?)

And (in theory - haven't got PHP handy to test) that will only make the subdomain longer if it has to, so should allow the extension group to match appropriately.


Update:
Ok, assuming you've extracted the full hostname already (using parse_url as suggested in other Q/comments), try this for matching subdomain, domain, and extension parts:

^(?P<subdomains>(?:[\w-]+\.)*?)(?P<domain>[\w-]+(?P<extension>(?:\.[a-z]{2,10}){1,2}))$

This will leave a . on the end of the subdomain (and on the start of the extensio)n, but you can use a substr($string,0,-1) or similar to remove that.

Expanded form for readability:

^
(?P<subdomains>
  (?:[\w-]+\.)*?
)
(?P<domain>
  [\w-]+
  (?P<extension>
     (?:\.[a-z]{2,10}){1,2}
   )
)$

(can add comments to explain any of that, if necessary?)



回答4:

parse_url() does not work to extract subdomains and domain name extensions. You have to invent your own solution here.

I think a proper implementation would have to include a library of all domain name extensions, updated regularily.

https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains

http://www.iana.org/domains/root/db

https://publicsuffix.org/list/public_suffix_list.dat

https://raw.githubusercontent.com/publicsuffix/list/master/public_suffix_list.dat