I want to get all the titles() in the website.
http://www.shyan.gov.cn/zwhd/web/webindex.action
Now, my code successfully scrapes only one page. However, there are multiple pages available at the site above in which I would like to to scrape.
For example, with the url above, when I click the link to "page 2", the overall url does NOT change. I looked at the page source and saw javascript code to advance to the next page like this: javascript:gotopage(2) or javascript:void(0). My code is here (get page 1)
from bs4 import Beautifulsoup
import requests
url = 'http://www.shyan.gov.cn/zwhd/web/webindex.action'
r = requests.get(url)
soup = Beautifulsoup(r.content,'lxml')
titles = soup.select('td.tit3 > a')
for title in titles:
print(title.get_text())
How can my code be changed to scrape titles from all the available listed pages? Thank you very much!
Try to use the following URL format:
http://www.shiyan.gov.cn/zwhd/web/webindex.action?keyWord=&searchType=3&page.currentpage=2&page.pagesize=15&page.pagecount=2357&docStatus=&sendOrg=
The site is using javascript to pass hidden page information to the server to request the next page. When you view the source you will find: