My script encounters an error when it is supposed

I've written a script in python using asyncio association with aiohttp library to parse the names out of pop up boxes initiated upon clicking on contact info buttons out of diffetent agency information located within a table from this website asynchronously. The webpage displayes the tabular contents across 513 pages.

I encountered this error too many file descriptors in select() when I tried with asyncio.get_event_loop() but when I came across this thread I could see that there is a suggestion to use asyncio.ProactorEventLoop() to avoid such error so I used the latter but noticed that, even when I complied with the suggestion, the script collects the names only from few pages until it throws the following error. How can i fix this?

raise client_error(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host www.tursab.org.tr:443 ssl:None [The semaphore timeout period has expired]

This is my try so far with:

import asyncio
import aiohttp
from bs4 import BeautifulSoup

links = ["https://www.tursab.org.tr/en/travel-agencies/search-travel-agency?sayfa={}".format(page) for page in range(1,514)]
lead_link = "https://www.tursab.org.tr/en/displayAcenta?AID={}"

async def get_links(url):
    async with asyncio.Semaphore(10):
        async with aiohttp.ClientSession() as session:
            async with session.get(url) as response:
                text = await response.text()
                result = await process_docs(text)
            return result

async def process_docs(html):
    coros = []
    soup = BeautifulSoup(html,"lxml")
    items = [itemnum.get("data-id") for itemnum in soup.select("#acentaTbl tr[data-id]")]
    for item in items:
        coros.append(fetch_again(lead_link.format(item)))
    await asyncio.gather(*coros)

async def fetch_again(link):
    async with asyncio.Semaphore(10):
        async with aiohttp.ClientSession() as session:
            async with session.get(link) as response:
                text = await response.text()
                sauce = BeautifulSoup(text,"lxml")
                try:
                    name = sauce.select_one("p > b").text
                except Exception: name = ""
                print(name)

if __name__ == '__main__':
    loop = asyncio.ProactorEventLoop()
    asyncio.set_event_loop(loop)
    loop.run_until_complete(asyncio.gather(*(get_links(link) for link in links)))

In short, What the process_docs() function does is collect data-id numbers from each pages to reuse them as the prefix of this https://www.tursab.org.tr/en/displayAcenta?AID={} link to collect the names from pop up boxes. One such id is 8757 and one such qualified links therefore https://www.tursab.org.tr/en/displayAcenta?AID=8757.

Btw, If I change the highest number used in the links variable to 20 or 30 or so, It goes smoothly.

标签： python python-3.x web-scraping python-asyncio aiohttp

1条回答

我只想做你的唯一

2楼-- · 2019-08-02 05:46

async def get_links(url):
    async with asyncio.Semaphore(10):

You can't do such a thing: it means that on each function call new semaphore instance will be created, while you need to single semaphore instance for all requests. Change your code this way:

sem = asyncio.Semaphore(10)  # module level

async def get_links(url):
    async with sem:
        # ...


async def fetch_again(link):
    async with sem:
        # ...

You can also return default loop once you're using semaphore correctly:

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(...)

Finally, you should alter both get_links(url) and fetch_again(link) to do parsing outside of semaphore to release it as soon as possible, before semaphore is needed inside process_docs(text).

Final code:

import asyncio
import aiohttp
from bs4 import BeautifulSoup

links = ["https://www.tursab.org.tr/en/travel-agencies/search-travel-agency?sayfa={}".format(page) for page in range(1,514)]
lead_link = "https://www.tursab.org.tr/en/displayAcenta?AID={}"

sem = asyncio.Semaphore(10)

async def get_links(url):
    async with sem:
        async with aiohttp.ClientSession() as session:
            async with session.get(url) as response:
                text = await response.text()
    result = await process_docs(text)
    return result

async def process_docs(html):
    coros = []
    soup = BeautifulSoup(html,"lxml")
    items = [itemnum.get("data-id") for itemnum in soup.select("#acentaTbl tr[data-id]")]
    for item in items:
        coros.append(fetch_again(lead_link.format(item)))
    await asyncio.gather(*coros)

async def fetch_again(link):
    async with sem:
        async with aiohttp.ClientSession() as session:
            async with session.get(link) as response:
                text = await response.text()
    sauce = BeautifulSoup(text,"lxml")
    try:
        name = sauce.select_one("p > b").text
    except Exception:
        name = "o"
    print(name)

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(asyncio.gather(*(get_links(link) for link in links)))

0人赞添加讨论(0) 举报

My script encounters an error when it is supposed

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间