Hello everyone, I'm having trouble trying to understand asyncio and aiohttp and making both work together properly. Not only I don't properly understand what I'm doing, at this point I've run into a problem that I have no idea how to solve.
I'm using Windows 10 64 bits, latest update.
The following code returns me a list of pages that do not contain html in the Content-Type in the header using asyncio.
import asyncio
import aiohttp
MAXitems = 30
async def getHeaders(url, session, sema):
async with session:
async with sema:
try:
async with session.head(url) as response:
try:
if "html" in response.headers["Content-Type"]:
return url, True
else:
return url, False
except:
return url, False
except:
return url, False
def checkUrlsWithoutHtml(listOfUrls):
headersWithoutHtml = set()
while(len(listOfUrls) != 0):
blockurls = []
print(len(listOfUrls))
items = 0
for num in range(0, len(listOfUrls)):
if num < MAXitems:
blockurls.append(listOfUrls[num - items])
listOfUrls.remove(listOfUrls[num - items])
items +=1
loop = asyncio.get_event_loop()
semaphoreHeaders = asyncio.Semaphore(50)
session = aiohttp.ClientSession()
data = loop.run_until_complete(asyncio.gather(*(getHeaders(url, session, semaphoreHeaders) for url in blockurls)))
for header in data:
if False == header[1]:
headersWithoutHtml.add(header)
return headersWithoutHtml
listOfUrls = ['http://www.google.com', 'http://www.reddit.com']
headersWithoutHtml= checkUrlsWithoutHtml(listOfUrls)
for header in headersWithoutHtml:
print(header[0])
When I run it with, let's say, 2000 urls (sometimes) it returns something like:
data = loop.run_until_complete(asyncio.gather(*(getHeaders(url, session, semaphoreHeaders) for url in blockurls)))
File "USER\AppData\Local\Programs\Python\Python36-32\lib\asyncio\base_events.py", line 454, in run_until_complete
self.run_forever()
File "USER\AppData\Local\Programs\Python\Python36-32\lib\asyncio\base_events.py", line 421, in run_forever
self._run_once()
File "USER\AppData\Local\Programs\Python\Python36-32\lib\asyncio\base_events.py", line 1390, in _run_once
event_list = self._selector.select(timeout)
File "USER\AppData\Local\Programs\Python\Python36-32\lib\selectors.py", line 323, in select
r, w, _ = self._select(self._readers, self._writers, [], timeout)
File "USER\AppData\Local\Programs\Python\Python36-32\lib\selectors.py", line 314, in _select
r, w, x = select.select(r, w, w, timeout)
ValueError: too many file descriptors in select()
Note1: I edited out my name with USER in the user.
Note2: For whatever reason reddit.com returns as it doesn't contain HTML, this is a completly separate problem that I will try to solve, however if you notice some other inconsistency in my code that would fix that please point it out.
Note3: My code is badly constructed because I've tried to change many things to try to debug this problem, but I've got no luck.
I've heard somewhere that this is a restriction of Windows and there is no way to bypass it, the problem is that:
a) I directly don't understand what "too many file descriptors in select()" means.
b) What I'm doing wrong that Windows can't handle? I've seen people push thousands of requests with asyncio and aiohttp but even with my chuncking I can't push 30-50 without getting a Value Error?
Edit: Turns out with MAXitems = 10 it hasn't crashed me yet, but because I can't follow the pattern I have no idea why or how that tells me anything.
Edit2: Nevermind, it needed more time to crash, but it did eventually even with MAXitems = 10