scrapy and selenium seem to intervene each other

Hi I don't have much experience in web scraping or using scrapy and selenium. Apologize first if there are too many bad practices in my code.

Brief background for my code: I tried to scrape information of products from multiple websites using scrapy, and I also use selenium because I need to click the "view more" button and "No thanks" button on the web page. Since there are href for different categories on the website, I also need to request those "sublinks" to make sure I don't miss any items not shown on the root page.

The problem is, I notice in this for loop for l in product_links:, scrapy and selenium seems to act strangely. For example, I would expect response.url == self.driver.current_url would always be true. However, they become different in the middle of this for loop. Furthermore, self.driver seem to capture some elements not existing in the current url in products = self.driver.find_elements_by_xpath('//div[@data-url]') and then fail to retrieve them again in sub = self.driver.find_elements_by_xpath('//div[(@class="shelf-container") and (.//div/@data-url="' + l + '")]//h2')

Many thanks. I'm really confused.

from webScrape.items import ProductItem
from scrapy import Spider, Request
from selenium import webdriver

class MySpider(Spider):
    name = 'name'
    domain = 'https://uk.burberry.com'

    def __init__(self):
        super().__init__()
        self.driver = webdriver.Chrome('path to driver')
        self.start_urls = [self.domain + '/' + k for k in ('womens-clothing', 'womens-bags', 'womens-scarves',
                                        'womens-accessories', 'womens-shoes', 'make-up', 'womens-fragrances')]
        self.pool = set()

    def parse(self, response):
        sub_links = response.xpath('//h2[starts-with(@class, "shelf1-section-title")]/a/@href').extract()
        if len(sub_links) > 0:
            for l in sub_links:
                yield Request(self.domain + l, callback = self.parse)
        self.driver.get(response.url)
        email_reg = self.driver.find_element_by_xpath('//button[@class="dc-reset dc-actions-btn js-data-capture-newsletter-block-cancel"]')
        if email_reg.is_displayed():
            email_reg.click()
        # Make sure to click all the "load more" buttons
        load_more_buttons = self.driver.find_elements_by_xpath('//div[@class="load-assets-button js-load-assets-button ga-shelf-load-assets-button"]')
        for button in load_more_buttons:
            if button.is_displayed():
                button.click()
        products = self.driver.find_elements_by_xpath('//div[@data-url]')
        product_links = [item.get_attribute('data-url') for item in products if item.get_attribute('data-url').split('-')[-1][1:] not in self.pool]
        for l in product_links:
            sub = self.driver.find_elements_by_xpath('//div[(@class="shelf-container") and (.//div/@data-url="' + l + '")]//h2')
            if len(sub) > 0:
                sub_category = ', '.join(set([s.get_attribute('data-ga-shelf-title') for s in sub]))
            else:
                sub_category = ''
            yield Request(self.domain + l, callback = self.parse_product, meta = {'sub_category': sub_category})

    def parse_product(self, response):
        item = ProductItem()
        item['id'] = response.url.split('-')[-1][1:]
        item['sub_category'] = response.meta['sub_category']
        item['name'] = response.xpath('//h1[@class="product-title transaction-title ta-transaction-title"]/text()').extract()[0].strip()
        self.pool.add(item['id'])
        yield item
        others = response.xpath('//input[@data-url]/@data-url').extract()
        for l in others:
            if l.split('-')[-1][1:] not in self.pool:
                yield Request(self.domain + l, callback = self.parse_product, meta = response.meta)

标签： python selenium selenium-webdriver web-scraping scrapy

1条回答

虎瘦雄心在

2楼-- · 2019-08-28 17:29

Scrapy is an asynchronous framework. The code in your parse*() methods does not always run linearly. Wherever there is a yield there, the execution of that method may stop there for some time while other parts of the code run.

Because there is a yield in the loop, that explains why you are experiencing that unexpected behavior. At yield, some other code of your program resumes execution and may switch the Selenium driver to a different URL, and when the code resumes the loop the URL from the Selenium driver has changed.

To be honest, you don’t really need Selenium in Scrapy for your use case, as far as I can see. In Scrapy, things like Splash or Selenium are only used on very specific scenarios, for things like avoiding bot detection.

It is usually a better approach to figure out the structure of the page HTML and the parameters used in requests by using the developer tools from your web browser (Inspect, Network) and then reproducing them in Scrapy.

0人赞添加讨论(0) 举报

scrapy and selenium seem to intervene each other

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间