Some of the search queries entered under https://www.comparis.ch/carfinder/default would yield more than 1'000 results (shown dynamically on the search page). The results however only show a max of 100 pages with 10 results each so I'm trying to scrape the remaining data given a query that yields more than 1'000 results. The code to scrape the IDs of the first 100 pages is (takes approx. 2 minutes to run through all 100 pages):
from bs4 import BeautifulSoup
import requests
# as the max number of pages is limited to 100
number_of_pages = 100
# initiate empty dict
car_dict = {}
# parse every search results page and extract every car ID
for page in range(0, number_of_pages + 1, 1):
newest_secondhand_cars = 'https://www.comparis.ch/carfinder/marktplatz/occasion'
newest_secondhand_cars = requests.get(newest_secondhand_cars + str('?page=') + str(page))
newest_secondhand_cars = newest_secondhand_cars.content
soup = BeautifulSoup(newest_secondhand_cars, "lxml")
for car in list(soup.find('div', {'id': 'cf-result-list'}).find_all('h2')):
car_id = int(car.decode().split('href="')[1].split('">')[0].split('/')[-1])
car_dict[car_id] = {}
So I obviously tried just passing a str(page)
greater than 100 which does not yield additional results.
How could I access the remaining results, if at all?
It seems that your website loads data when the client is browsing. There are probably a number of ways to fix this. One option could be to utilize Scrapy Splash.
Assuming you use scrapy, you can do the following:
settings.py
addSPLASH_URL = <splash-server-ip-address>
settings.py
add to middlewaresthis code:
from scrapy_splash import SplashRequest
in your spider.pystart_url
in your spider.py to iterate over the pagesE.g. like this
def start_requests(self):
E.g. like this
Let me know how that works out for you.