I've written a script in asyncio
in association with aiohttp
library to parse the content of a website asynchronously. I've tried to apply the logic within the following script the way it is usually applied in scrapy
.
However, when I execute my script, it acts like how syncronous libraries like requests
or urllib.request
do. Therefore, it is very slow and doesn't serve the purpose.
I know I can get around this by defining all the next page link within the link
variable. But, am I not doing the task with my existing script in the right way already?
Within the script what processing_docs()
function does is collect all the links of the different posts and pass the refined links to the fetch_again()
function to fetch the title from it's target page. There is a logic applied within processing_docs()
function which collects the next_page link and supply the same to fetch()
function to repeat the same. This next_page call is making the script slower whereas we usually do the same in
scrapyand get expected performance.
My question is: How can I achieve the same keeping the existing logic intact?
import aiohttp
import asyncio
from lxml.html import fromstring
from urllib.parse import urljoin
link = "https://stackoverflow.com/questions/tagged/web-scraping"
async def fetch(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
text = await response.text()
result = await processing_docs(session, text)
return result
async def processing_docs(session, html):
tree = fromstring(html)
titles = [urljoin(link,title.attrib['href']) for title in tree.cssselect(".summary .question-hyperlink")]
for title in titles:
await fetch_again(session,title)
next_page = tree.cssselect("div.pager a[rel='next']")
if next_page:
page_link = urljoin(link,next_page[0].attrib['href'])
await fetch(page_link)
async def fetch_again(session,url):
async with session.get(url) as response:
text = await response.text()
tree = fromstring(text)
title = tree.cssselect("h1[itemprop='name'] a")[0].text
print(title)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.gather(*(fetch(url) for url in [link])))
loop.close()
Whole point of using asyncio is that you may run multiple fetches concurrently (in parallel to each other). Let's look at your code:
This part means that each new
fetch_again
will be started only after previous was awaited (finished). If you do things this way, yes, there's no difference with using sync approach.To invoke all power of asyncio start multiple fetches concurrently using
asyncio.gather
:You'll see significant speedup.
You can go event futher and start
fetch
for next page concurrently withfetch_again
for titles:Important note
While such approach allows you to do things much faster you may want to limit number of concurrent requests at the time to avoid significant resources usage on both your machine and on server.
You can use
asyncio.Semaphore
for this purpose: