Downloading pdf files using mechanize and urllib

I am new to Python, and my current task is to write a web crawler that looks for PDF files in certain webpages and downloads them. Here's my current approach (just for 1 sample url):

import mechanize
import urllib
import sys

mech = mechanize.Browser()
mech.set_handle_robots(False)

url = "http://www.xyz.com"

try:
    mech.open(url, timeout = 30.0)
except HTTPError, e:
    sys.exit("%d: %s" % (e.code, e.msg))

links = mech.links()

for l in links:
    #Some are relative links
    path = str(l.base_url[:-1])+str(l.url)
    if path.find(".pdf") > 0:
       urllib.urlretrieve(path)

The program runs without any errors, but I am not seeing the pdf being saved anywhere. I am able to access the pdf and save it through my browser. Any ideas what's going on? I am using pydev (eclipse based) as my development environment, if that makes any difference.

Another question is if I want give the pdf a specific name while saving it, how can I do that? Is this approach correct? Do I have to create a file with 'filename' before I can save the PDF?

urllib.urlretrieve(path, filename)

Thanks in advance.

标签： python eclipse web-crawler mechanize urllib

2条回答

\"骚年 ilove

2楼-- · 2019-05-23 15:46

I've never used mechanize, but from the documentation for urllib at http://docs.python.org/library/urllib.html:

urllib.urlretrieve(url[, filename[, reporthook[, data]]])

Copy a network object denoted by a URL to a local file, if necessary. If the URL points to a local file, or a valid cached copy of the object exists, the object is not copied. Return a tuple (filename, headers) where filename is the local file name under which the object can be found, and headers is whatever the info() method of the object returned by urlopen() returned (for a remote object, possibly cached). Exceptions are the same as for urlopen().

As you can see the urlretrieve function saves to a temporary file if you don't specify one. So try specifying the filename as you suggested in your second piece of code. Otherwise you could call urlretrieve like this:

    saved_filename,headers = urllib.urlretrieve(path)

and then use saved_filename later on.

0人赞添加讨论(0) 举报

可以哭但决不认输i

3楼-- · 2019-05-23 15:48

The documentation for urllib says this about the urlretrieve function:

The second argument, if present, specifies the file location to copy to (if absent, the location will be a tempfile with a generated name).

The function's return value has the location of the file:

Return a tuple (filename, headers) where filename is the local file name under which the object can be found, and headers is whatever the info() method of the object returned by urlopen() returned (for a remote object, possibly cached).

So, change this line:

urllib.urlretrieve(path)

to this:

(filename, headers) = urllib.urlretrieve(path)

and the path in filename will have the location. Optionally, pass in the filename argument to urlretrieve to specify the location yourself.

0人赞添加讨论(0) 举报

Downloading pdf files using mechanize and urllib

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间