I'm building a broken link checker in python, and it's becoming a chore building the logic for correctly identifying links that do not resolve when visited with a browser. I've found a set of links where I can consistently reproduce a redirect error with my scraper, but which resolve perfectly when visited in a browser. I was hoping I could find some insight here.
import urllib
import urllib.request
import html.parser
import requests
from requests.exceptions import HTTPError
from socket import error as SocketError
try:
req=urllib.request.Request(url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
response = urllib.request.urlopen(req)
raw_response = response.read().decode('utf8', errors='ignore')
response.close()
except urllib.request.HTTPError as inst:
output = format(inst)
print(output)
In this instance, an example of a URL that reliably returns this error is 'http://forums.hostgator.com/want-see-your-sites-dns-propagating-t48838.html'. It resolves perfectly when visited but the code above will return the following error:
HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently
Any ideas how I can correctly identify these links as functional without blindly ignoring links from that site (which might miss genuinely broken links)?