First lets define a "URL" according to my requirements.
The only protocols optionally allowed are http://
and https://
then a mandatory domain name like stackoverflow.com
then optionally the rest of url components (path
, query
, hash
, ...)
For reference a list of valid and invalid url's according to my requirements
VALID
- stackoverflow.com
- stackoverflow.com/questions/ask
- https://stackoverflow.com/questions/ask
- http://www.amazon.com/Computers-Internet-Books/b/ref=bhp_bb0309A_comint2?ie=UTF8&node=5&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=browse&pf_rd_r=0AH7GM29WF81Q72VPFDH&pf_rd_t=101&pf_rd_p=1273387142&pf_rd_i=283155
amazon.com/Computers-Internet-Books/b/ref=bhp_bb0309A_comint2?ie=UTF8&node=5&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=browse&pf_rd_r=0AH7GM29WF81Q72VPFDH&pf_rd_t=101&pf_rd_p=1273387142&pf_rd_i=283155
http://test-site.com (filter_var reject this!!! I have domain names with dashes )
INVALID
- http://www (php filter_var allow this, yes i know is a
valid
url) - http://www..des (php filter_var allow this)
- Any url with not allowed characters in the domain name
For completeness here is my php version: 5.3.2-1ubuntu4.2
You could use
parse_url
to break up the address into its components. While it's explicitly not built to validate a URL, analyzing the resulting components and matching them against your requirements would at least be a start.It may vary but in most of the cases you don't really need to check the validity of any URL.
If it's a vital information and you trust your user enough to let him give it through a URL, you can trust him enough to give a valid URL.
If it isn't a vital information, then you just have to check for XSS attempts and display the URL that the user wanted.
You can add manually a "http://" if you don't detect one to avoid navigation problems.
I know, I don't give you an alternative as a solution, but maybe the best way to solve performance & validity problems is just to avoid unnecessary checks.
As a starting point you can use this one, it's for JS, but it's easy to convert it to work for PHP
preg_match
.For PHP should work this one:
This regexp anyway validates only the domain part, but you can work on this or split the url at the 1st slash
'/'
(after"://"
) and validate separately the domain part and the rest.BTW: It would validate also
"http://www.domain.com.com"
but this is not an error because a subdomain url could be like:"http://www.subdomain.domain.com"
and it's valid! And there is almost no way (or at least no operatively easy way) to validate for proper domain tld with a regex because you would have to write inline into your regex all possible domain tlds ONE BY ONE like this:(this last one for instance would validate only domain ending with .com/.net/.de/.it/.co.uk). New tlds always come out, so you would have to adjust you regex everytimne a new tld comes out, that's a pain in the neck!