Python asyncio/aiohttp: ValueError: too many file

2019-04-11 23:22发布

问题:

Hello everyone, I'm having trouble trying to understand asyncio and aiohttp and making both work together properly. Not only I don't properly understand what I'm doing, at this point I've run into a problem that I have no idea how to solve.

I'm using Windows 10 64 bits, latest update.

The following code returns me a list of pages that do not contain html in the Content-Type in the header using asyncio.

import asyncio
import aiohttp

MAXitems = 30

async def getHeaders(url, session, sema):
    async with session:
        async with sema:
            try:
                async with session.head(url) as response:
                    try:
                        if "html" in response.headers["Content-Type"]:
                            return url, True
                        else:
                            return url, False
                    except:
                        return url, False
            except:
                return url, False


def checkUrlsWithoutHtml(listOfUrls):
    headersWithoutHtml = set()
    while(len(listOfUrls) != 0):
        blockurls = []
        print(len(listOfUrls))
        items = 0
        for num in range(0, len(listOfUrls)):
            if num < MAXitems:
                blockurls.append(listOfUrls[num - items])
                listOfUrls.remove(listOfUrls[num - items])
                items +=1
        loop = asyncio.get_event_loop()
        semaphoreHeaders = asyncio.Semaphore(50)
        session = aiohttp.ClientSession()
        data = loop.run_until_complete(asyncio.gather(*(getHeaders(url, session, semaphoreHeaders) for url in blockurls)))
        for header in data:
            if False == header[1]:
                headersWithoutHtml.add(header)
    return headersWithoutHtml


listOfUrls = ['http://www.google.com', 'http://www.reddit.com']
headersWithoutHtml=  checkUrlsWithoutHtml(listOfUrls)

for header in headersWithoutHtml:
    print(header[0])

When I run it with, let's say, 2000 urls (sometimes) it returns something like:

data = loop.run_until_complete(asyncio.gather(*(getHeaders(url, session, semaphoreHeaders) for url in blockurls)))
  File "USER\AppData\Local\Programs\Python\Python36-32\lib\asyncio\base_events.py", line 454, in run_until_complete
    self.run_forever()
  File "USER\AppData\Local\Programs\Python\Python36-32\lib\asyncio\base_events.py", line 421, in run_forever
    self._run_once()
  File "USER\AppData\Local\Programs\Python\Python36-32\lib\asyncio\base_events.py", line 1390, in _run_once
    event_list = self._selector.select(timeout)
  File "USER\AppData\Local\Programs\Python\Python36-32\lib\selectors.py", line 323, in select
    r, w, _ = self._select(self._readers, self._writers, [], timeout)
  File "USER\AppData\Local\Programs\Python\Python36-32\lib\selectors.py", line 314, in _select
    r, w, x = select.select(r, w, w, timeout)
ValueError: too many file descriptors in select()

Note1: I edited out my name with USER in the user.

Note2: For whatever reason reddit.com returns as it doesn't contain HTML, this is a completly separate problem that I will try to solve, however if you notice some other inconsistency in my code that would fix that please point it out.

Note3: My code is badly constructed because I've tried to change many things to try to debug this problem, but I've got no luck.

I've heard somewhere that this is a restriction of Windows and there is no way to bypass it, the problem is that:

a) I directly don't understand what "too many file descriptors in select()" means.

b) What I'm doing wrong that Windows can't handle? I've seen people push thousands of requests with asyncio and aiohttp but even with my chuncking I can't push 30-50 without getting a Value Error?

Edit: Turns out with MAXitems = 10 it hasn't crashed me yet, but because I can't follow the pattern I have no idea why or how that tells me anything.

Edit2: Nevermind, it needed more time to crash, but it did eventually even with MAXitems = 10

回答1:

By default Windows can use only 64 sockets in asyncio loop. This is a limitation of underlying select() API call.

To increase the limit please use ProactorEventLoop. Instructions for installation is here.



回答2:

I'm having the same problem. Not 100% sure that this is guaranteed to work, but try replacing this:

session = aiohttp.ClientSession()

with this:

connector = aiohttp.TCPConnector(limit=60)
session = aiohttp.ClientSession(connector=connector)

By default limit is set to 100 (docs), meaning that the client can have 100 simultaneous connections open at a time. As Andrew mentioned, Windows can only have 64 sockets open at a time, so we provide a number lower than 64 instead.