I am new to Python, and my current task is to write a web crawler that looks for PDF files in certain webpages and downloads them. Here's my current approach (just for 1 sample url):
import mechanize
import urllib
import sys
mech = mechanize.Browser()
mech.set_handle_robots(False)
url = "http://www.xyz.com"
try:
mech.open(url, timeout = 30.0)
except HTTPError, e:
sys.exit("%d: %s" % (e.code, e.msg))
links = mech.links()
for l in links:
#Some are relative links
path = str(l.base_url[:-1])+str(l.url)
if path.find(".pdf") > 0:
urllib.urlretrieve(path)
The program runs without any errors, but I am not seeing the pdf being saved anywhere. I am able to access the pdf and save it through my browser. Any ideas what's going on? I am using pydev (eclipse based) as my development environment, if that makes any difference.
Another question is if I want give the pdf a specific name while saving it, how can I do that? Is this approach correct? Do I have to create a file with 'filename' before I can save the PDF?
urllib.urlretrieve(path, filename)
Thanks in advance.
The documentation for urllib says this about the urlretrieve
function:
The second argument, if present, specifies the file location to copy
to (if absent, the location will be a tempfile with a generated name).
The function's return value has the location of the file:
Return a tuple (filename, headers) where filename is the local file
name under which the object can be found, and headers is whatever the
info() method of the object returned by urlopen() returned (for a
remote object, possibly cached).
So, change this line:
urllib.urlretrieve(path)
to this:
(filename, headers) = urllib.urlretrieve(path)
and the path in filename
will have the location. Optionally, pass in the filename
argument to urlretrieve to specify the location yourself.
I've never used mechanize, but from the documentation for urllib at http://docs.python.org/library/urllib.html:
urllib.urlretrieve(url[, filename[, reporthook[, data]]])
Copy a network object denoted by a URL to a local file, if
necessary. If the URL points to a local file, or a valid cached copy
of the object exists, the object is not copied. Return a tuple
(filename, headers) where filename is the local file name under which
the object can be found, and headers is whatever the info() method of
the object returned by urlopen() returned (for a remote object,
possibly cached). Exceptions are the same as for urlopen().
As you can see the urlretrieve function saves to a temporary file if you don't specify one. So try specifying the filename as you suggested in your second piece of code. Otherwise you could call urlretrieve like this:
saved_filename,headers = urllib.urlretrieve(path)
and then use saved_filename later on.