。规范化/规范化URL？(Canonicalize / normalize a URL?)

我正在寻找一个库函数正常化Python中的URL，即在路径中删除“./”或“../”部分，或添加默认端口或特殊字符转义等。结果应该是一个字符串，它是指向同一个网页的两个网址是唯一的。例如http://google.com和http://google.com:80/a/../应返回相同的结果。

我宁愿Python 3中，并已通过看urllib模块。它提供的功能分割的URL，但没有给他们规范化。 Java有URI.normalize()函数，做了类似的事情（尽管它不考虑默认的端口80等于没有给出端口），但有这样的事情是Python？

Answer 1:

这个怎么样：

In [1]: from urllib.parse import urljoin

In [2]: urljoin('http://example.com/a/b/c/../', '.')
Out[2]: 'http://example.com/a/b/'

通过回答启发了这个问题。它不规范的端口，但它应该是简单的掀起，做了功能。

Answer 2:

这是我使用，到目前为止它的工作。您可以从PIP获得urlnorm。

请注意，我的查询参数进行排序。我发现这是必不可少的。

from urlparse import urlsplit, urlunsplit, parse_qsl
from urllib import urlencode
import urlnorm

def canonizeurl(url):
    split = urlsplit(urlnorm.norm(url))
    path = split[2].split(' ')[0]

    while path.startswith('/..'):
        path = path[3:]

    while path.endswith('%20'):
        path = path[:-3]

    qs = urlencode(sorted(parse_qsl(split.query)))
    return urlunsplit((split.scheme, split.netloc, path, qs, ''))

Answer 3:

该urltools模块标准化多个斜线. 和..成分不搞乱在双斜线http:// 。

一旦你pip install urltools的用法如下：

print urltools.normalize('http://domain.com:80/a////b/../c')
>>> 'http://domain.com/a/c'

Answer 4:

继良好的开端，我由一个适合大多数的情况下，在网络中常见的方法。

def urlnorm(base, link=''):
  '''Normalizes an URL or a link relative to a base url. URLs that point to the same resource will return the same string.'''
  new = urlparse(urljoin(base, url).lower())
  return urlunsplit((
    new.scheme,
    (new.port == None) and (new.hostname + ":80") or new.netloc,
    new.path,
    new.query,
    ''))

文章来源: Canonicalize / normalize a URL?