I need to uniquely identify and store some URLs. The problem is that sometimes they come containing ".." like http://somedomain.com/foo/bar/../../some/url
which basically is http://somedomain.com/some/url
if I'm not wrong.
Is there a Python function or a tricky way to resolve this URLs ?
I wanted to comment on the
resolveComponents
function in the top response.Notice that if your path is
/
, the code will add another one which can be problematic. I therefore changed theIF
condition to:There’s a simple solution using urlparse.urljoin:
However, if there is no trailing slash (the last component is a file, not a directory), the last component will be removed.
This fix uses the urlparse function to extract the path, then use (the posixpath version of) os.path to normalize the components. Compensate for a mysterious issue with trailing slashes, then join the URL back together. The following is
doctest
able:According to RFC 3986 this should happen as part of "relative resolution" process. So answer could be
urlparse.urljoin(url, '')
. But due to bugurlparse.urljoin
does not remove dot segments when second argument is empty url. You can use yurl — alternative url manipulation library. It do this right:Those are file paths. Look at os.path.normpath:
EDIT:
If this is on Windows, your input path will use backslashes instead of slashes. In this case, you still need
os.path.normpath
to get rid of the..
patterns (and//
and/./
and whatever else is redundant), then convert the backslashes to forward slashes:EDIT 2:
If you want to normalize URLs, do it (before you strip off the method and such) with urlparse module, as shown in the answer to this question.
EDIT 3:
It seems that
urljoin
doesn't normalize the base path it's given:normpath
by itself doesn't quite cut it either:Note the initial double slash got eaten.
So we have to make them join forces:
Usage:
urljoin
won't work, as it only resolves dot segments if the second argument isn't absolute(!?) or empty. Not only that, it doesn't handle excessive..
s properly according to RFC 3986 (they should be removed;urljoin
doesn't do so).posixpath.normpath
can't be used either (much lessos.path.normpath)
, since it resolves multiple slashes in a row to only one (e.g./////
becomes/
), which is incorrect behavior for URLs.The following short function resolves any URL path string correctly. It shouldn't be used with relative paths, however, since additional decisions about its behavior would then need to be made (Raise an error on excessive
..
s? Remove.
in the beginning? Leave them both?) - instead, join URLs before resolving if you know you might handle relative paths. Without further ado:This handles trailing dot segments (that is, without a trailing slash) and consecutive slashes correctly. To resolve an entire URL, you can then use the following wrapper (or just inline the path resolution function into it).
You can then call it like this:
Correct URL resolution has more than a few pitfalls, it turns out!