Using PHP, I need to check if a string contains an IPv6 address - and then extract that IPv6 address if it does.
I've got a regex that is matching a string if it's exactly an IPv6:
$matches = [];
$regex = '/^(((?=.*(::))(?!.*\3.+\3))\3?|([\dA-F]{1,4}(\3|:\b|$)|\2))(?4){5}((?4){2}|(((2[0-4]|1\d|[1-9])?\d|25[0-5])\.?\b){4})\z/i';
preg_match($regex, $ipv6, $matches);
What I'm stuck with is being able to add a wildcard on either side, so I can match things like:
- http://2001:0db8:85a3:0000:0000:8a2e:0370:7334/something/page.html
- http://2001:0db8:85a3:0000:0000:8a2e:0370:7334
- 2001:0db8:85a3:0000:0000:8a2e:0370:7334/something/page.html
Ultimately I need to do this so I can wrap square brackets around an IPv6 address, so it conforms to RFC 3986 (e.g http://[2001:0db8:85a3:0000:0000:8a2e:0370:7334]/something/page.html
).
You need to use another regexp like this:
(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))
After it, you can wrap ipv6 in your link:
<?php
$ipv6 = 'http://2001:0db8:85a3:0000:0000:8a2e:0370:7334/something/page.html';
$matches = [];
$regex = '(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))';
if (preg_match($regex, $ipv6, $matches)) {
$result = str_replace($matches[0], '[' . $matches[0] . ']', $ipv6);
}
You don't need difficult to read and understand regex
in order to verify if a string is a valid IPv6 address. The PHP function filter_var()
can do the heavylifting for you:
echo(filter_var('2001:0db8:85a3:0000:0000:8a2e:0370:7334', FILTER_VALIDATE_IP));
# 2001:0db8:85a3:0000:0000:8a2e:0370:7334
echo(filter_var('2001:0db8:85a3::8a2e:0370:7334', FILTER_VALIDATE_IP));
# 2001:0db8:85a3::8a2e:0370:7334
echo(filter_var('192.168.0.1', FILTER_VALIDATE_IP));
# 192.168.0.1
var_dump(filter_var('192.168.0.1', FILTER_VALIDATE_IP, FILTER_FLAG_IPV6));
# bool(false)
It returns the input value if it is valid (according to the filter passed as the second argument and the options passed as the third argument) or FALSE otherwise.
If the IP address is the domain of an URL then the PHP function parse_url()
can be used to extract it:
print_r(parse_url('http://2001:0db8:85a3:0000:0000:8a2e:0370:7334/something/page.html'));
# Array
# (
# [scheme] => http
# [host] => 2001:0db8:85a3:0000:0000:8a2e:0370:7334
# [path] => /something/page.html
# )
The last string in your example (2001:0db8:85a3:0000:0000:8a2e:0370:7334/something/page.html
) is not an URL. It is just some random text that happens to look like an incomplete (and invalid) URL. I don't have a simple solution for it :-(
Brief
I haven't fully tested the code so I can't be 100% sure it works, but, I ran it against a few different URLs and it seems to be working correctly.
I've taken parts of these answers:
- Looking to build some regex to validate domain names (RFC 952/ RFC 1123)
- What is the RFC complicant and working regular expression to check if a string is a valid URL
- Regular expression that matches valid IPv6 addresses
Answer
Answer with domain names
This is what I've come up with:
(?(DEFINE)
(?<scheme>[a-z][a-z0-9+.-]*)
(?<userpass>([^:@\/](:[^:@\/])?@))
(?<domain>[a-z0-9]+(-[a-z0-9]+)*(\.[a-z0-9]+(-[a-z0-9]+)*)+)
(?<ip>(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])))
(?<host>((?&domain)|(?&ip)))
(?<port>(:[\d]{1,5}))
(?<path>([^?;\#]*))
(?<query>(\?[^\#;]*))
(?<anchor>(\#.*))
)
^(?:(?&scheme):\/\/)?(?&userpass)?(?<address>(?&host))(?&port)?\/?(?&path)?(?&query)?(?&anchor)?$
Follow this link to see it in use
Answer without domain names
The above regex will match URLs containing valid domains (whether it be a domain name or address). If you want to match only IP addresses, use the following regex (which includes a simple change in the definition group named host
- I removed the reference to the definition group named domain
)
(?(DEFINE)
(?<scheme>[a-z][a-z0-9+.-]*)
(?<userpass>([^:@\/](:[^:@\/])?@))
(?<domain>[a-z0-9]+(-[a-z0-9]+)*(\.[a-z0-9]+(-[a-z0-9]+)*)+)
(?<ip>(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])))
(?<host>(?&ip))
(?<port>(:[\d]{1,5}))
(?<path>([^?;\#]*))
(?<query>(\?[^\#;]*))
(?<anchor>(\#.*))
)
^(?:(?&scheme):\/\/)?(?&userpass)?(?<address>(?&host))(?&port)?\/?(?&path)?(?&query)?(?&anchor)?$
Follow this link to see it in use
For those who like a nice long non-legible query you can use the following regex which is equivalent to the one above.
^(?:[a-z][a-z0-9+.-]*:\/\/)?(?:[^:@\/](?::[^:@\/])?@)?(?<address>(?:[0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,7}:|(?:[0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,5}(?::[0-9a-fA-F]{1,4}){1,2}|(?:[0-9a-fA-F]{1,4}:){1,4}(?::[0-9a-fA-F]{1,4}){1,3}|(?:[0-9a-fA-F]{1,4}:){1,3}(?::[0-9a-fA-F]{1,4}){1,4}|(?:[0-9a-fA-F]{1,4}:){1,2}(?::[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:(?:(?::[0-9a-fA-F]{1,4}){1,6})|:(?:(?::[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(?::[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(?:ffff(?::0{1,4}){0,1}:){0,1}(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])|(?:[0-9a-fA-F]{1,4}:){1,4}:(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9]))(?::[\d]{1,5})?\/?(?:[^?;\#]*)?(?:\?[^\#;]*)?(?:\#.*)?$
Note: Both answers use the i
(case insensitive) and x
(ignore whitespace) modifiers