I want to open many url's (I open one url, search for all links on this webstie and open also them or download images etc. from this inks). So first I wanted to check if the url is correct, so I used an if
statement:
if not urlparse.urlparse(link).netloc:
return 'broken url'
But I noticed that some values did not pass this statement. I came across a website when a links looked like: //b.thumbs.redditmedia.com/7pTYj4rOii6CkkEC.jpg
, but I had an error:
ValueError: unknown url type: //b.thumbs.redditmedia.com/7pTYj4rOii6CkkEC.jpg
, but my if statement didn't catch that.
How can I check more precisely if an url works good?
If you aren't specific about the library being used, you could do the following :
import urllib2
import re
def is_fully_alive(url, live_check = False):
try:
if not urllib2.urlparse.urlparse(url).netloc:
return False
website = urllib2.urlopen(url)
html = website.read()
if website.code != 200 :
return False
# Get all the links
for link in re.findall('"((http|ftp)s?://.*?)"', html):
url = link[0]
if not urllib2.urlparse.urlparse(url).netloc:
return False
if live_check:
website = urllib2.urlopen(url)
if website.code != 200:
print "Failed link : ", url
return False
except Exception, e:
print "Errored while attempting to validate link : ", url
print e
return False
return True
Check your url's:
>>> is_fully_alive("http://www.google.com")
True
Check by opening every single link:
# Takes some time depending on your net speed and no. of links in the page
>>> is_fully_alive("http://www.google.com", True)
True
Check an invalid url:
>>> is_fully_alive("//www.google.com")
Errored while attempting to validate link : //www.google.com
unknown url type: //www.google.com
False
Pretty simple:
import urllib2
def valid_url(url):
try:
urllib2.urlopen(url)
return True
except Exception, e:
return False
print valid_url('//b.thumbs.redditmedia.com/7pTYj4rOii6CkkEC.jpg') # False
print valid_url('http://stackoverflow.com/questions/25069947/check-if-the-url-link-is-correct') # True
You can also read the whole document by
urllib2.urlopen(url).read()
Generally if you want to download all images from an HTML document, you can do something like this:
for link, img in re.findall('http.?:\/\/b\.thumbs\.redditmedia\.com\/(\w+?\.(?:jpg|png|gif))', load(url)):
if not os.path.exists(img):
with open(img, 'w') as f:
f.write(link)