scrape hidden pages if search yields more results

Some of the search queries entered under https://www.comparis.ch/carfinder/default would yield more than 1'000 results (shown dynamically on the search page). The results however only show a max of 100 pages with 10 results each so I'm trying to scrape the remaining data given a query that yields more than 1'000 results. The code to scrape the IDs of the first 100 pages is (takes approx. 2 minutes to run through all 100 pages):

from bs4 import BeautifulSoup
import requests

# as the max number of pages is limited to 100
number_of_pages = 100

# initiate empty dict
car_dict = {}

# parse every search results page and extract every car ID
for page in range(0, number_of_pages + 1, 1):
    newest_secondhand_cars = 'https://www.comparis.ch/carfinder/marktplatz/occasion'
    newest_secondhand_cars = requests.get(newest_secondhand_cars + str('?page=') + str(page))
    newest_secondhand_cars = newest_secondhand_cars.content
    soup = BeautifulSoup(newest_secondhand_cars, "lxml")

    for car in list(soup.find('div', {'id': 'cf-result-list'}).find_all('h2')):
        car_id = int(car.decode().split('href="')[1].split('">')[0].split('/')[-1])
        car_dict[car_id] = {}

So I obviously tried just passing a str(page) greater than 100 which does not yield additional results. How could I access the remaining results, if at all?

标签： python web-scraping beautifulsoup

1条回答

啃猪蹄的小仙女

2楼-- · 2020-05-01 09:18

It seems that your website loads data when the client is browsing. There are probably a number of ways to fix this. One option could be to utilize Scrapy Splash.

Assuming you use scrapy, you can do the following:

Start a Splash server using docker - make a note of the
In settings.py add SPLASH_URL = <splash-server-ip-address>
In settings.py add to middlewares

this code:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

Import from scrapy_splash import SplashRequest in your spider.py
Set start_url in your spider.py to iterate over the pages

E.g. like this

base_url = 'https://www.comparis.ch/carfinder/marktplatz/occasion'
start_urls = [
     base_url + str('?page=') + str(page) % page for page in range(0,100)      
    ]

Redirect the url to the splash server by modifing def start_requests(self):

E.g. like this

def start_requests(self):
    for url in self.start_urls:
        yield SplashRequest(url, self.parse,
            endpoint='render.html',
            args={'wait': 0.5},
        )

Parse the response like you do now.

Let me know how that works out for you.

0人赞添加讨论(0) 举报

scrape hidden pages if search yields more results

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间