I had this url regex pattern in place:
$pattern = "@\b(https?://[^\s()<>\[\]\{\}]{1,".$max_length_allowed_for_each_url."}(?:\([\w\d]+\)|([^[:punct:]\s]|/)))@";
It seemed to work pretty well at validating any URL I threw at it, until I realized that https://http://google.com (apparently even stackoverflow is considering that a valid URL (it made that URL clickable, not me, although it did remove one of the colons) so perhaps I am out of luck?) was a valid URL, when it certainly is not.
I did a little research... and learnt that I should be using filter_var instead of a regex for PHP URL validation anyways... and was disappointed to realize that it too is susceptible to this very same validation problem.
I could easily conquer it with:
str_replace(array("https://http://","http://https://"), array("http://","https://"), $url);
But... that just seems so wrong.
Well, it is a valid URI. Technically. Look at the RFC for URIs if you don't believe me.
//
.http
is a valid host name.:
is present (it's specified as*digit
, not1*digit
). (This is why Stack Overflow removed the colon -- it thought you were using the default port, so it removed it from the URI.)I suggest writing a special case for this. In a separate step, check to see if the URI starts with
https?://https?://
, and fix it.