Canonicalize / normalize a URL?

2019-01-19 00:12发布

问题:

I am searching for a library function to normalize a URL in Python, that is to remove "./" or "../" parts in the path, or add a default port or escape special characters and so on. The result should be a string that is unique for two URLs pointing to the same web page. For example http://google.com and http://google.com:80/a/../ shall return the same result.

I would prefer Python 3 and already looked through the urllib module. It offers functions to split URLs but nothing to canonicalize them. Java has the URI.normalize() function that does a similar thing (though it does not consider the default port 80 equal to no given port), but is there something like this is python?

回答1:

How about this:

In [1]: from urllib.parse import urljoin

In [2]: urljoin('http://example.com/a/b/c/../', '.')
Out[2]: 'http://example.com/a/b/'

Inspired by answers to this question. It doesn't normalize ports, but it should be simple to whip up a function that does.



回答2:

This is what I use and it's worked so far. You can get urlnorm from pip.

Notice that I sort the query parameters. I've found this to be essential.

from urlparse import urlsplit, urlunsplit, parse_qsl
from urllib import urlencode
import urlnorm

def canonizeurl(url):
    split = urlsplit(urlnorm.norm(url))
    path = split[2].split(' ')[0]

    while path.startswith('/..'):
        path = path[3:]

    while path.endswith('%20'):
        path = path[:-3]

    qs = urlencode(sorted(parse_qsl(split.query)))
    return urlunsplit((split.scheme, split.netloc, path, qs, ''))


回答3:

The urltools module normalizes multiple slashes, . and .. components without messing up the double slash in http://.

Once you do pip install urltools the usage is as follows:

print urltools.normalize('http://domain.com:80/a////b/../c')
>>> 'http://domain.com/a/c'


回答4:

Following the good start, I composed a method that fits most of the cases commonly found in the web.

def urlnorm(base, link=''):
  '''Normalizes an URL or a link relative to a base url. URLs that point to the same resource will return the same string.'''
  new = urlparse(urljoin(base, url).lower())
  return urlunsplit((
    new.scheme,
    (new.port == None) and (new.hostname + ":80") or new.netloc,
    new.path,
    new.query,
    ''))