Possible Duplicate:
Which characters make a url invalid?
I'm trying to remove the non-URL part of a big string. Most of the regexes I found are like [A-Za-z0-9-_.!~*'()]
, but there are more things that can a url contain. Like http://127.0.0.1:8080/test?v=123#this
for example
So what are the latest characters for a valid URL?
EDIT:
They seem to be:
A-Za-z0-9-._~:/?#[]@!$&'()*+,;= and % followed by hex value
All the gory details can be found in the current RFC on the topic: RFC 3986 (Uniform Resource Identifier (URI): Generic Syntax)
Based on this related answer, you are looking at a list that looks like:
A-Z
,a-z
,0-9
,-
,.
,_
,~
,:
,/
,?
,#
,[
,]
,@
,!
,$
,&
,'
,(
,)
,*
,+
,,
,;
, and=
. Everything else must be url-encoded. Also, some of these characters can only exist in very specific spots in a URI, the RFC has all of these specifics.