Reconstructing absolute urls from relative urls on

Given an absolute url of a page, and a relative link found within that page, would there be a way to a) definitively reconstruct or b) best-effort reconstruct the absolute url of the relative link?

In my case, I'm reading an html file from a given url using beautiful soup, stripping out all the img tag sources, and trying to construct a list of absolute urls to the page images.

My Python function so far looks like:

function get_image_url(page_url,image_src):

    from urlparse import urlparse
    # parsed = urlparse('http://user:pass@NetLoc:80/path;parameters?query=argument#fragment')
    parsed = urlparse(page_url)
    url_base = parsed.netloc
    url_path = parsed.path

    if src.find('http') == 0:
        # It's an absolute URL, do nothing.
        pass
    elif src.find('/') == 0:
        # If it's a root URL, append it to the base URL:
        src = 'http://' + url_base + src
    else:
        # If it's a relative URL, ?

NOTE: Don't need a Python answer, just the logic required.

标签： python html url-parsing

2条回答

神经病院院长

2楼-- · 2019-02-05 18:42

very simple:

>>> from urlparse import urljoin
>>> urljoin('http://mysite.com/foo/bar/x.html', '../../images/img.png')
'http://mysite.com/images/img.png'

0人赞添加讨论(0) 举报

何必那么认真

3楼-- · 2019-02-05 19:04

Use urllib.parse.urljoin to resolve a (possibly relative) URL against a base URL.

But, the base URL of a web page isn't necessarily the same as the URL you fetched the document from, because HTML allows a page to specify its preferred base URL via the BASE element. The logic you need is as follows:

base_url = page_url
head = document.getElementsByTagName('head')[0]
for base in head.getElementsByTagName('base'):
    if base.hasAttribute('href'):
        base_url = urllib.parse.urljoin(base_url, base.getAttribute('href'))
        # HTML5 4.2.3 "if there are multiple base elements with href
        # attributes, all but the first are ignored."
        break

(If you are parsing XHTML then in theory you ought to take into account the rather hairy XML Base specification instead. But you can probably get away without worrying about that, since no-one really uses XHTML.)

0人赞添加讨论(0) 举报

Reconstructing absolute urls from relative urls on

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间