I've created a script in python using pyppeteer to collect the names of different institutions traversing multiple pages from a website. What I wish to do is let my script rove different pages by clicking on the next page button while parsing the names from each page.
website address
What I've tried:
import asyncio
from pyppeteer import launch
url = "https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx"
async def fetch_table(link):
browser = await launch(headless=False)
[page] = await browser.pages()
await page.goto(link)
while True:
await page.waitForSelector("h1.faqsno-heading", {'visible':True})
for item in await page.querySelectorAll("h1.faqsno-heading"):
name = await item.querySelectorEval("div[id^='arrowex']",'e => e.innerText')
print(name)
try:
elem = await page.querySelector("[title='Next Page']")
await elem.click()
except Exception: break
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(fetch_table(url))
The above script is doing it's job just fine until it encounters an error somewhere between 5 to 10 pages. Pages may vary, though.
Traceback (most recent call last):
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\demo.py", line 23, in <module>
loop.run_until_complete(fetch_table(url))
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\asyncio\base_events.py", line 568, in run_until_complete
return future.result()
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\demo.py", line 11, in fetch_table
await page.waitForSelector("h1.faqsno-heading", {'visible':True})
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyppeteer\frame_manager.py", line 834, in __await__
raise result
pyppeteer.errors.TimeoutError: Waiting for selector "h1.faqsno-heading" failed: timeout 30000ms exceeds.
However, when I bring about a minor change and try like this, I can see that the script also does it's job until it encounters the following error:
try:
await page.click("[title='Next Page']")
except Exception: break
I get the following error:
Traceback (most recent call last):
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\demo.py", line 48, in <module>
loop.run_until_complete(fetch_table(url))
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\asyncio\base_events.py", line 568, in run_until_complete
return future.result()
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\demo.py", line 37, in fetch_table
await page.waitForSelector("h1.faqsno-heading", {'visible':True})
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyppeteer\frame_manager.py", line 832, in __await__
result = yield from self.promise
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyppeteer\frame_manager.py", line 859, in rerun
*self._args,
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyppeteer\execution_context.py", line 109, in evaluateHandle
_rewriteError(e)
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyppeteer\execution_context.py", line 239, in _rewriteError
raise error
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyppeteer\execution_context.py", line 106, in evaluateHandle
'userGesture': True,
pyppeteer.errors.NetworkError: Protocol error Runtime.callFunctionOn: Target closed.
How can let my script keep going until all the clicks are performed?