I'd like to know do I normalize a URL in python.
For example, If I have a url string like : "http://www.example.com/foo goo/bar.html"
I need a library in python that will transform the extra space (or any other non normalized character) to a proper URL.
Real fix in Python 2.7 for that problem
Right solution was:
For more information see Issue918368: "urllib doesn't correct server returned urls"
I encounter such an problem: need to quote the space only.
fullurl = quote(fullurl, safe="%/:=&?~#+!$,;'@()*[]")
do help, but it's too complicated.So I used a simple way:
url = url.replace(' ', '%20')
, it's not perfect, but it's the simplest way and it works for this situation.Have a look at this module: werkzeug.utils. (now in
werkzeug.urls
)The function you are looking for is called "url_fix" and works like this:
It's implemented in Werkzeug as follows:
Just FYI, urlnorm has moved to github: http://gist.github.com/246089
Because this page is a top result for Google searches on the topic, I think it's worth mentioning some work that has been done on URL normalization with Python that goes beyond urlencoding space characters. For example, dealing with default ports, character case, lack of trailing slashes, etc.
When the Atom syndication format was being developed, there was some discussion on how to normalize URLs into canonical format; this is documented in the article PaceCanonicalIds on the Atom/Pie wiki. That article provides some good test cases.
I believe that one result of this discussion was Mark Nottingham's urlnorm.py library, which I've used with good results on a couple projects. That script doesn't work with the URL given in this question, however. So a better choice might be Sam Ruby's version of urlnorm.py, which handles that URL, and all of the aforementioned test cases from the Atom wiki.
use
urllib.quote
orurllib.quote_plus
From the urllib documentation:
EDIT: Using urllib.quote or urllib.quote_plus on the whole URL will mangle it, as @ΤΖΩΤΖΙΟΥ points out:
@ΤΖΩΤΖΙΟΥ provides a function that uses urlparse.urlparse and urlparse.urlunparse to parse the url and only encode the path. This may be more useful for you, although if you're building the URL from a known protocol and host but with a suspect path, you could probably do just as well to avoid urlparse and just quote the suspect part of the URL, concatenating with known safe parts.