Script performs very slowly even when it runs asyn

2019-08-03 19:42发布

I've written a script in asyncio in association with aiohttp library to parse the content of a website asynchronously. I've tried to apply the logic within the following script the way it is usually applied in scrapy.

However, when I execute my script, it acts like how syncronous libraries like requests or urllib.request do. Therefore, it is very slow and doesn't serve the purpose.

I know I can get around this by defining all the next page link within the link variable. But, am I not doing the task with my existing script in the right way already?

Within the script what processing_docs() function does is collect all the links of the different posts and pass the refined links to the fetch_again() function to fetch the title from it's target page. There is a logic applied within processing_docs() function which collects the next_page link and supply the same to fetch() function to repeat the same. This next_page call is making the script slower whereas we usually do the same inscrapyand get expected performance.

My question is: How can I achieve the same keeping the existing logic intact?

import aiohttp
import asyncio
from lxml.html import fromstring
from urllib.parse import urljoin

link = "https://stackoverflow.com/questions/tagged/web-scraping"

async def fetch(url):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            text = await response.text()
            result = await processing_docs(session, text)
        return result

async def processing_docs(session, html):
        tree = fromstring(html)
        titles = [urljoin(link,title.attrib['href']) for title in tree.cssselect(".summary .question-hyperlink")]
        for title in titles:
            await fetch_again(session,title)

        next_page = tree.cssselect("div.pager a[rel='next']")
        if next_page:
            page_link = urljoin(link,next_page[0].attrib['href'])
            await fetch(page_link)

async def fetch_again(session,url):
    async with session.get(url) as response:
        text = await response.text()
        tree = fromstring(text)
        title = tree.cssselect("h1[itemprop='name'] a")[0].text
        print(title)

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(asyncio.gather(*(fetch(url) for url in [link])))
    loop.close()

1条回答
别忘想泡老子
2楼-- · 2019-08-03 20:14

Whole point of using asyncio is that you may run multiple fetches concurrently (in parallel to each other). Let's look at your code:

for title in titles:
    await fetch_again(session,title)

This part means that each new fetch_again will be started only after previous was awaited (finished). If you do things this way, yes, there's no difference with using sync approach.

To invoke all power of asyncio start multiple fetches concurrently using asyncio.gather:

await asyncio.gather(*[
    fetch_again(session,title)
    for title 
    in titles
])

You'll see significant speedup.


You can go event futher and start fetch for next page concurrently with fetch_again for titles:

async def processing_docs(session, html):
        coros = []

        tree = fromstring(html)

        # titles:
        titles = [
            urljoin(link,title.attrib['href']) 
            for title 
            in tree.cssselect(".summary .question-hyperlink")
        ]

        for title in titles:
            coros.append(
                fetch_again(session,title)
            )

        # next_page:
        next_page = tree.cssselect("div.pager a[rel='next']")
        if next_page:
            page_link = urljoin(link,next_page[0].attrib['href'])

            coros.append(
                fetch(page_link)
            )

        # await:
        await asyncio.gather(*coros)

Important note

While such approach allows you to do things much faster you may want to limit number of concurrent requests at the time to avoid significant resources usage on both your machine and on server.

You can use asyncio.Semaphore for this purpose:

semaphore = asyncio.Semaphore(10)

async def fetch(url):
    async with semaphore:
        async with aiohttp.ClientSession() as session:
            async with session.get(url) as response:
                text = await response.text()
                result = await processing_docs(session, text)
            return result
查看更多
登录 后发表回答