I came up with a couple regular expressions for PHP that will convert urls in text to anchor tags. (First it converts all www. urls to http:// then converts all urls with https?:// to a href=... html links
Note that this list doesn't state where in the URI these characters may occur.
Any other character needs to be encoded with the percent-encoding (%hh). Each part of the URI has further restrictions about what characters need to be represented by an percent-encoded word.
Okay, so according to RFC 3986, such addresses are not URIs (and therefore not URLs, since URLs are a type of URIs). If we consider ourselves beholden to the terminology of existing IETF standards, then we should properly call them IRIs (Internationalized Resource Identifiers), as defined in RFC 3987, which are technically not URIs but can be converted to URIs simply by percent-encoding all non-ASCII characters in the IRI. Normal people, though, have never heard of IRIs and simply call these URIs or URLs (and indeed there's a WHATWG effort underway to create a new, broader URL spec that simply classifies all "URIs" and "IRIs" as "URLs" to align with modern usage of those terms in the real world).
Suppose we want to adopt this meaning of URL immediately (which puts as at odds with IETF spec, but aligns us with everyday usage). In that case, what characters are valid in a URL?
:/?#[]@, which are part of the generic syntax for a URI defined in RFC 3986
!$&'()*+,;=, which aren't part of the RFC's generic syntax, but are reserved for use as syntactic components of particular URI schemes. For instance, semicolons and commas are used as part of the syntax of data URIs, and & and = are used as part of the ubiquitous ?foo=bar&qux=baz format in query strings (which isn't specified by RFC 3986).
Any of the reserved characters above can be legally used in a URI without encoding, either to serve their syntactic purpose or just as literal characters in data in some places where such use could not be misinterpreted as the character serving its syntactic purpose. (For example, although / has syntactic meaning in a URL, you can use it unencoded in a query string, because it doesn't have meaning in a query string.)
RFC 3986 also specifies some unreserved characters, which can always be used simply to represent data without any encoding:
But those block choices seem bizarre and arbitrary given the latest Unicode block definitions; this is probably because the blocks have been added to in the decade since RFC 3987 was written. The WhatWG's in-progress spec has a more generous list:
U+00A0 to U+D7FF, U+E000 to U+FDCF, U+FDF0 to U+FFFD, U+10000 to U+1FFFD, U+20000 to U+2FFFD, U+30000 to U+3FFFD, U+40000 to U+4FFFD, U+50000 to U+5FFFD, U+60000 to U+6FFFD, U+70000 to U+7FFFD, U+80000 to U+8FFFD, U+90000 to U+9FFFD, U+A0000 to U+AFFFD, U+B0000 to U+BFFFD, U+C0000 to U+CFFFD, U+D0000 to U+DFFFD, U+E0000 to U+EFFFD, U+F0000 to U+FFFFD, U+100000 to U+10FFFD
Of course, it should be noted that simply knowing which characters can legally appear in a URL isn't sufficient to recognise whether some given string is a legal URL or not, since some characters are only legal in particular parts of the URL. For example, the reserved characters [ and ] are legal as part of an IPv6 literal host in a URL like http://[1080::8:800:200C:417A]/foo but aren't legal in any other context, so the OP's example of http://example.com/file[/].html is illegal.
To add some clarification and directly address the question above, there are several classes of characters that cause problems for URLs and URIs.
There are some characters that are disallowed and should never appear in a URL/URI, reserved characters (described below), and other characters that may cause problems in some cases, but are marked as "unwise" or "unsafe". Explanations for why the characters are restricted are clearly spelled out in RFC-1738 (URLs) and RFC-2396 (URIs). Note the newer RFC-3986 (update to RFC-1738) defines the construction of what characters are allowed in a given context but the older spec offers a simpler and more general description of which characters are not allowed with the following rules.
Excluded US-ASCII Characters disallowed within the URI syntax:
control = <US-ASCII coded characters 00-1F and 7F hexadecimal>
space = <US-ASCII coded character 20 hexadecimal>
delims = "<" | ">" | "#" | "%" | <">
List of unwise characters are allowed but may cause problems:
The "reserved" syntax class above refers to those characters that are allowed within a URI, but which may not be allowed within a particular component of the generic URI syntax. Characters in the "reserved" set are not reserved in all contexts. The hostname, for example, can contain an optional username so it could be something like ftp://user@hostname/ where the '@' character has special meaning.
Here is an example of a URL that has invalid and unwise characters (e.g. '$', '[', ']') and should be properly encoded:
Some of the character restrictions for URIs/URLs are programming language dependent. For example, the '|' (0x7C) character although only marked as "unwise" in the URI spec will throw a URISyntaxException in the Java java.net.URI constructor so a URL like http://api.google.com/q?exp=a|b is not allowed and must be encoded instead as http://api.google.com/q?exp=a%7Cb if using Java with a URI object instance.
I came up with a couple regular expressions for PHP that will convert urls in text to anchor tags. (First it converts all www. urls to http:// then converts all urls with https?:// to a href=... html links
$string = preg_replace('/(https?:\/\/)([!#$&-;=?\-\[\]_a-z~%]+)/sim', '<a href="$1$2">$2</a>', preg_replace('/(\s)((www\.)([!#$&-;=?\-\[\]_a-z~%]+))/sim', '$1http://$2', $string) );
In general URIs as defined by RFC 3986 (see Section 2: Characters) may contain any of the following characters:
Note that this list doesn't state where in the URI these characters may occur.
Any other character needs to be encoded with the percent-encoding (
%
hh
). Each part of the URI has further restrictions about what characters need to be represented by an percent-encoded word.Most of the existing answers here are impractical because they totally ignore the real-world usage of addresses like:
Okay, so according to RFC 3986, such addresses are not URIs (and therefore not URLs, since URLs are a type of URIs). If we consider ourselves beholden to the terminology of existing IETF standards, then we should properly call them IRIs (Internationalized Resource Identifiers), as defined in RFC 3987, which are technically not URIs but can be converted to URIs simply by percent-encoding all non-ASCII characters in the IRI. Normal people, though, have never heard of IRIs and simply call these URIs or URLs (and indeed there's a WHATWG effort underway to create a new, broader URL spec that simply classifies all "URIs" and "IRIs" as "URLs" to align with modern usage of those terms in the real world).
Suppose we want to adopt this meaning of URL immediately (which puts as at odds with IETF spec, but aligns us with everyday usage). In that case, what characters are valid in a URL?
First of all, we have two types of RFC 3986 reserved characters:
:/?#[]@
, which are part of the generic syntax for a URI defined in RFC 3986!$&'()*+,;=
, which aren't part of the RFC's generic syntax, but are reserved for use as syntactic components of particular URI schemes. For instance, semicolons and commas are used as part of the syntax of data URIs, and&
and=
are used as part of the ubiquitous?foo=bar&qux=baz
format in query strings (which isn't specified by RFC 3986).Any of the reserved characters above can be legally used in a URI without encoding, either to serve their syntactic purpose or just as literal characters in data in some places where such use could not be misinterpreted as the character serving its syntactic purpose. (For example, although
/
has syntactic meaning in a URL, you can use it unencoded in a query string, because it doesn't have meaning in a query string.)RFC 3986 also specifies some unreserved characters, which can always be used simply to represent data without any encoding:
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-._~
Finally, the
%
character itself is allowed for percent-encodings.That leaves only the following ASCII characters that are forbidden from appearing in a URL:
"<>\^`{|}
Every other character from ASCII can legally feature in a URL.
Then RFC 3987 extends that set of unreserved characters with the following unicode character ranges:
But those block choices seem bizarre and arbitrary given the latest Unicode block definitions; this is probably because the blocks have been added to in the decade since RFC 3987 was written. The WhatWG's in-progress spec has a more generous list:
Of course, it should be noted that simply knowing which characters can legally appear in a URL isn't sufficient to recognise whether some given string is a legal URL or not, since some characters are only legal in particular parts of the URL. For example, the reserved characters
[
and]
are legal as part of an IPv6 literal host in a URL like http://[1080::8:800:200C:417A]/foo but aren't legal in any other context, so the OP's example ofhttp://example.com/file[/].html
is illegal.Use urlencode to allow arbitrary characters in your URL.
I need to select character to split urls in string, so I decided to create list of characters which could not be found in URL by myself:
So, the possible choices are the newline, tab, space, backslash and
"<>{}^|
. I guess I'll go with the space or newline. :)To add some clarification and directly address the question above, there are several classes of characters that cause problems for URLs and URIs.
There are some characters that are disallowed and should never appear in a URL/URI, reserved characters (described below), and other characters that may cause problems in some cases, but are marked as "unwise" or "unsafe". Explanations for why the characters are restricted are clearly spelled out in RFC-1738 (URLs) and RFC-2396 (URIs). Note the newer RFC-3986 (update to RFC-1738) defines the construction of what characters are allowed in a given context but the older spec offers a simpler and more general description of which characters are not allowed with the following rules.
Excluded US-ASCII Characters disallowed within the URI syntax:
List of unwise characters are allowed but may cause problems:
Characters that are reserved within a query component and/or have special meaning within a URI/URL:
The "reserved" syntax class above refers to those characters that are allowed within a URI, but which may not be allowed within a particular component of the generic URI syntax. Characters in the "reserved" set are not reserved in all contexts. The hostname, for example, can contain an optional username so it could be something like
ftp://user@hostname/
where the '@' character has special meaning.Here is an example of a URL that has invalid and unwise characters (e.g. '$', '[', ']') and should be properly encoded:
Some of the character restrictions for URIs/URLs are programming language dependent. For example, the '|' (0x7C) character although only marked as "unwise" in the URI spec will throw a URISyntaxException in the Java java.net.URI constructor so a URL like
http://api.google.com/q?exp=a|b
is not allowed and must be encoded instead ashttp://api.google.com/q?exp=a%7Cb
if using Java with a URI object instance.