I've written a script in asyncio
in association with aiohttp
library to parse the content of a website asynchronously. I've tried to apply the logic within the following script the way it is usually applied in scrapy
.
However, when I execute my script, it acts like how syncronous libraries like requests
or urllib.request
do. Therefore, it is very slow and doesn't serve the purpose.
I know I can get around this by defining all the next page link within the link
variable. But, am I not doing the task with my existing script in the right way already?
Within the script what processing_docs()
function does is collect all the links of the different posts and pass the refined links to the fetch_again()
function to fetch the title from it's target page. There is a logic applied within processing_docs()
function which collects the next_page link and supply the same to fetch()
function to repeat the same. This next_page call is making the script slower whereas we usually do the same in
scrapyand get expected performance.
My question is: How can I achieve the same keeping the existing logic intact?
import aiohttp
import asyncio
from lxml.html import fromstring
from urllib.parse import urljoin
link = "https://stackoverflow.com/questions/tagged/web-scraping"
async def fetch(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
text = await response.text()
result = await processing_docs(session, text)
return result
async def processing_docs(session, html):
tree = fromstring(html)
titles = [urljoin(link,title.attrib['href']) for title in tree.cssselect(".summary .question-hyperlink")]
for title in titles:
await fetch_again(session,title)
next_page = tree.cssselect("div.pager a[rel='next']")
if next_page:
page_link = urljoin(link,next_page[0].attrib['href'])
await fetch(page_link)
async def fetch_again(session,url):
async with session.get(url) as response:
text = await response.text()
tree = fromstring(text)
title = tree.cssselect("h1[itemprop='name'] a")[0].text
print(title)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.gather(*(fetch(url) for url in [link])))
loop.close()