Python urlparse: small issue

I'm making an app that parses html and gets images from it. Parsing is easy using Beautiful Soup and downloading of the html and the images works too with urllib2.

I do have a problem with urlparse to make absolute paths out of relative ones. The problem is best explained with an example:

>>> import urlparse
>>> urlparse.urljoin("http://www.example.com/", "../test.png")
'http://www.example.com/../test.png'

As you can see, urlparse doesn't take away the ../ away. This gives a problem when I try to download the image:

HTTPError: HTTP Error 400: Bad Request

Is there a way to fix this problem in urllib?

标签： python urllib2 urlparse

4条回答

小情绪 Triste *

2楼-- · 2019-07-17 21:35

".." would bring you up one directory ("." is current directory), so combining that with a domain name url doesn't make much sense. Maybe what you need is:

>>> urlparse.urljoin("http://www.example.com","./test.png")
'http://www.example.com/test.png'

0人赞添加讨论(0) 举报

该账号已被封号

3楼-- · 2019-07-17 21:35

urlparse.urljoin("http://www.example.com/", "../test.png"[2:])

It is what you need?

0人赞添加讨论(0) 举报

Rolldiameter

4楼-- · 2019-07-17 21:48

If you'd like that /../test would mean the same as /test like paths in a file system then you could use normpath():

>>> url = urlparse.urljoin("http://example.com/", "../test")
>>> p = urlparse.urlparse(url)
>>> path = posixpath.normpath(p.path)
>>> urlparse.urlunparse((p.scheme, p.netloc, path, p.params, p.query,p.fragment))
'http://example.com/test'

0人赞添加讨论(0) 举报

劫难

5楼-- · 2019-07-17 21:58

I think the best you can do is to pre-parse the original URL, and check the path component. A simple test is

if len(urlparse.urlparse(baseurl).path) > 1:

Then you can combine it with the indexing suggested by demas. For example:

start_offset = (len(urlparse.urlparse(baseurl).path) <= 1) and 2 or 0
img_url = urlparse.urljoin("http://www.example.com/", "../test.png"[start_offset:])

This way, you will not attempt to go to the parent of the root URL.

0人赞添加讨论(0) 举报

Python urlparse: small issue

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间