Repeated host lookups failing in urllib2

2019-02-21 04:58发布

问题:

I have code which issues many HTTP GET requests using Python's urllib2, in several threads, writing the responses into files (one per thread).
During execution, it looks like many of the host lookups fail (causing a name or service unknown error, see appended error log for an example).

Is this due to a flaky DNS service? Is it bad practice to rely on DNS caching, if the host name isn't changing? I.e. should a single lookup's result be passed into the urlopen?

Exception in thread Thread-16:
Traceback (most recent call last):
  File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner
    self.run()
  File "/home/da/local/bin/ThreadedDownloader.py", line 61, in run
     page = urllib2.urlopen(url) # get the page
  File "/usr/lib/python2.6/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.6/urllib2.py", line 391, in open
    response = self._open(req, data)
  File "/usr/lib/python2.6/urllib2.py", line 409, in _open
    '_open', req)
  File "/usr/lib/python2.6/urllib2.py", line 369, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.6/urllib2.py", line 1170, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.6/urllib2.py", line 1145, in do_open
    raise URLError(err)
URLError: <urlopen error [Errno -2] Name or service not known>

UPDATE my (extremely simple) code

class AsyncGet(threading.Thread):

def __init__(self,outDir,baseUrl,item,method,numPages,numRows,semaphore):
    threading.Thread.__init__(self)
    self.outDir = outDir
    self.baseUrl = baseUrl
    self.method = method
    self.numPages = numPages
    self.numRows = numRows
    self.item = item
    self.semaphore = semaphore

def run(self):
    with self.semaphore: # 'with' is awesome.
        with open( os.path.join(self.outDir,self.item+".xml"), 'a' ) as f:
            for i in xrange(1,self.numPages+1):
                url = self.baseUrl + \
                "method=" + self.method + \
                "&item=" + self.item + \
                "&page=" + str(i) + \
                "&rows=" + str(self.numRows) + \
                "&prettyXML"
                page = urllib2.urlopen(url)
                f.write(page.read())
                page.close() # Must remember to close!

The semaphore is a BoundedSemaphore to constrain the total number of running threads.

回答1:

This is not a Python problem, on Linux systems make sure nscd (Name Service Cache Daemon) is actually running.

UPDATE: And looking at your code you are never calling page.close() hence leaking sockets.