I'm building an app on Google App Engine. I'm incredibly new to Python and have been beating my head against the following problem for the past 3 days.
I have a class to represent an RSS Feed and in this class I have a method called setUrl. Input to this method is a URL.
I'm trying to use the re python module to validate off of the RFC 3986 Reg-ex (http://www.ietf.org/rfc/rfc3986.txt)
Below is a snipped which should work?
p = re.compile('^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?')
m = p.match(url)
if m:
self.url = url
return url
An easy way to parse (and validate) URL's is the
urlparse
(py2, py3) module.A regex is too much work.
There's no "validate" method because almost anything is a valid URL. There are some punctuation rules for splitting it up. Absent any punctuation, you still have a valid URL.
Check the RFC carefully and see if you can construct an "invalid" URL. The rules are very flexible.
For example
:::::
is a valid URL. The path is":::::"
. A pretty stupid filename, but a valid filename.Also,
/////
is a valid URL. The netloc ("hostname") is""
. The path is"///"
. Again, stupid. Also valid. This URL normalizes to"///"
which is the equivalent.Something like
"bad://///worse/////"
is perfectly valid. Dumb but valid.Bottom Line. Parse it, and look at the pieces to see if they're displeasing in some way.
Do you want the scheme to always be "http"? Do you want the netloc to always be "www.somename.somedomain"? Do you want the path to look unix-like? Or windows-like? Do you want to remove the query string? Or preserve it?
These are not RFC-specified validations. These are validations unique to your application.
Here's the complete regexp to parse a URL.
Given its complexibility, I think you should go the urlparse way.
For completeness, here's the pseudo-BNF of the above regex (as a documentation):
I admit, I find your regular expression totally incomprehensible. I wonder if you could use urlparse instead? Something like:
It might be slower, and maybe you'll miss conditions, but it seems (to me) a lot easier to read and debug than a regular expression for URLs.
note - Lepl is no longer maintained or supported.
RFC 3696 defines "best practices" for URL validation - http://www.faqs.org/rfcs/rfc3696.html
The latest release of Lepl (a Python parser library) includes an implementation of RFC 3696. You would use it something like:
Although the validators are defined in Lepl, which is a recursive descent parser, they are largely compiled internally to regular expressions. That combines the best of both worlds - a (relatively) easy to read definition that can be checked against RFC 3696 and an efficient implementation. There's a post on my blog showing how this simplifies the parser - http://www.acooke.org/cute/LEPLOptimi0.html
Lepl is available at http://www.acooke.org/lepl and the RFC 3696 module is documented at http://www.acooke.org/lepl/rfc3696.html
This is completely new in this release, so may contain bugs. Please contact me if you have any problems and I will fix them ASAP. Thanks.
I'm using the one used by Django and it seems to work pretty well:
You can always check the latest version here: https://github.com/django/django/blob/master/django/core/validators.py#L74
urlparse
quite happily takes invalid URLs, it is more a string string-splitting library than any kind of validator. For example:Depending on the situation, this might be fine..
If you mostly trust the data, and just want to verify the protocol is HTTP, then
urlparse
is perfect.If you want to make the URL is actually a legal URL, use the ridiculous regex
If you want to make sure it's a real web address,