I am new to Python, and my current task is to write a web crawler that looks for PDF files in certain webpages and downloads them. Here's my current approach (just for 1 sample url):
import mechanize
import urllib
import sys
mech = mechanize.Browser()
mech.set_handle_robots(False)
url = "http://www.xyz.com"
try:
mech.open(url, timeout = 30.0)
except HTTPError, e:
sys.exit("%d: %s" % (e.code, e.msg))
links = mech.links()
for l in links:
#Some are relative links
path = str(l.base_url[:-1])+str(l.url)
if path.find(".pdf") > 0:
urllib.urlretrieve(path)
The program runs without any errors, but I am not seeing the pdf being saved anywhere. I am able to access the pdf and save it through my browser. Any ideas what's going on? I am using pydev (eclipse based) as my development environment, if that makes any difference.
Another question is if I want give the pdf a specific name while saving it, how can I do that? Is this approach correct? Do I have to create a file with 'filename' before I can save the PDF?
urllib.urlretrieve(path, filename)
Thanks in advance.
I've never used mechanize, but from the documentation for urllib at http://docs.python.org/library/urllib.html:
As you can see the urlretrieve function saves to a temporary file if you don't specify one. So try specifying the filename as you suggested in your second piece of code. Otherwise you could call urlretrieve like this:
and then use saved_filename later on.
The documentation for urllib says this about the
urlretrieve
function:The function's return value has the location of the file:
So, change this line:
to this:
and the path in
filename
will have the location. Optionally, pass in thefilename
argument to urlretrieve to specify the location yourself.