Not really an answer to your question but validating url's is really a serious p.i.t.a
You're probably just better off validating the domainname and leave query part of the url be. That is my experience.
You could also resort to pinging the url and seeing if it results in a valid response but that might be too much for such a simple task.
Regular expressions to detect url's are abundant, google it :)
All valid characters that can be used in a URI (a URL is a type of URI) are defined in RFC 3986.
All other characters can be used in a URL provided that they are "URL Encoded" first. This involves changing the invalid character for specific "codes" (usually in the form of the percent symbol (%) followed by a hexadecimal number).
ASCII alphanumeric, "!", "$", "&", "'", "(", ")", "*", "+", ",", "-", ".", "/", ":", ";", "=", "?", "@", "_", "~", and code points in the ranges U+00A0 to U+D7FF, U+E000 to U+FDCF, U+FDF0 to U+FFFD, U+10000 to U+1FFFD, U+20000 to U+2FFFD, U+30000 to U+3FFFD, U+40000 to U+4FFFD, U+50000 to U+5FFFD, U+60000 to U+6FFFD, U+70000 to U+7FFFD, U+80000 to U+8FFFD, U+90000 to U+9FFFD, U+A0000 to U+AFFFD, U+B0000 to U+BFFFD, U+C0000 to U+CFFFD, U+D0000 to U+DFFFD, U+E1000 to U+EFFFD, U+F0000 to U+FFFFD, U+100000 to U+10FFFD.
The term "URL code points" is then used in the statement:
If c is not a URL code point and not "%", parse error.
in a several parts of the parsing algorithm, including the schema, authority, relative path, query and fragment states: so basically the entire URL.
Also, the validator http://validator.w3.org/ passes for URLs like "你好", and does not pass for URLs with characters like spaces "a b"
Of course, as mentioned by Stephen C, it is not just about characters but also about context: you have to understand the entire algorithm. But since class "URL code points" is used on key points of the algorithm, it that gives a good idea of what you can use or not.
In your supplementary question you asked if www.example.com/file[/].html is a valid URL.
That URL isn't valid because a URL is a type of URI and a valid URI must have a scheme like http: (see RFC 3986).
If you meant to ask if http://www.example.com/file[/].html is a valid URL then the answer is still no because the square bracket characters aren't valid there.
The square bracket characters are reserved for URLs in this format: http://[2001:db8:85a3::8a2e:370:7334]/foo/bar (i.e. an IPv6 literal instead of a host name)
It's worth reading RFC 3986 carefully if you want to understand the issue fully.
Not really an answer to your question but validating url's is really a serious p.i.t.a You're probably just better off validating the domainname and leave query part of the url be. That is my experience. You could also resort to pinging the url and seeing if it results in a valid response but that might be too much for such a simple task.
Regular expressions to detect url's are abundant, google it :)
All valid characters that can be used in a URI (a URL is a type of URI) are defined in RFC 3986.
All other characters can be used in a URL provided that they are "URL Encoded" first. This involves changing the invalid character for specific "codes" (usually in the form of the percent symbol (%) followed by a hexadecimal number).
This link, HTML URL Encoding Reference, contains a list of the encodings for invalid characters.
Several of Unicode character ranges are valid HTML5, although it might still not be a good idea to use them.
E.g.,
href
docs say http://www.w3.org/TR/html5/links.html#attr-hyperlink-href:Then the definition of "valid URL" points to http://url.spec.whatwg.org/, which says it aims to:
That document defines URL code points as:
The term "URL code points" is then used in the statement:
in a several parts of the parsing algorithm, including the schema, authority, relative path, query and fragment states: so basically the entire URL.
Also, the validator http://validator.w3.org/ passes for URLs like
"你好"
, and does not pass for URLs with characters like spaces"a b"
Of course, as mentioned by Stephen C, it is not just about characters but also about context: you have to understand the entire algorithm. But since class "URL code points" is used on key points of the algorithm, it that gives a good idea of what you can use or not.
See also: Unicode characters in URLs
In your supplementary question you asked if
www.example.com/file[/].html
is a valid URL.That URL isn't valid because a URL is a type of URI and a valid URI must have a scheme like
http:
(see RFC 3986).If you meant to ask if
http://www.example.com/file[/].html
is a valid URL then the answer is still no because the square bracket characters aren't valid there.The square bracket characters are reserved for URLs in this format:
http://[2001:db8:85a3::8a2e:370:7334]/foo/bar
(i.e. an IPv6 literal instead of a host name)It's worth reading RFC 3986 carefully if you want to understand the issue fully.