I need to test for general URLs using any protocol (http, https, shttp, ftp, svn, mysql and things I don't know about).
My first pass is this:
\w+://(\w+\.)+[\w+](/[\w]+)(\?[-A-Z0-9+&@#/%=~_|!:,.;]*)?
(PCRE and .NET so nothing to fancy)
I need to test for general URLs using any protocol (http, https, shttp, ftp, svn, mysql and things I don't know about).
My first pass is this:
\w+://(\w+\.)+[\w+](/[\w]+)(\?[-A-Z0-9+&@#/%=~_|!:,.;]*)?
(PCRE and .NET so nothing to fancy)
According to RFC2396:
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
adding that RegEx as a wiki answer:
[\w+-]+://([a-zA-Z0-9]+\.)+[[a-zA-Z0-9]+](/[%\w]+)(\?[-A-Z0-9+&@#/%=~_|!:,.;]*)?
option 2 (Re CMS)
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
But that's to lax for anything sane so trimmed to make it more restrictive and to differentiate off other things.
proto :// name : pass @ server :port /path ? args
^([^:/?#]+)://(([^/?#@:]+(:[^/?#@:]+)?@)?[^/?#@:]+(:[0-9]+)?)(/[^?#]*)(\?([^#]*))?
I came at this from a slightly different direction. I wanted to emulate gchats ability to match something.co.uk
and linkify it. So I went with a regex that looks for a .
without either a following period or a space on either side and then grabs everything around it until it hits whitespace. It does match a period at the end of a URI but I'm taking that off later. So this could be an option if you would prefer false positives over missing some potentials
url_re = re.compile(r"""
[^\s] # not whitespace
[a-zA-Z0-9:/\-]+ # the protocol and domain name
\.(?!\.) # A literal '.' not followed by another
[\w\-\./\?=&%~#]+ # country and path components
[^\s] # not whitespace""", re.VERBOSE)
url_re.findall('http://thereisnothing.com/a/path adn some text www.google.com/?=query#%20 https://somewhere.com other-countries.co.nz. ellipsis... is also a great place to buy. But try text-hello.com ftp://something.com')
['http://thereisnothing.com/a/path',
'www.google.com/?=query#%20',
'https://somewhere.com',
'other-countries.co.nz.',
'text-hello.com',
'ftp://something.com']