Hi I don't have much experience in web scraping or using scrapy and selenium. Apologize first if there are too many bad practices in my code.
Brief background for my code: I tried to scrape information of products from multiple websites using scrapy, and I also use selenium because I need to click the "view more" button and "No thanks" button on the web page. Since there are href for different categories on the website, I also need to request those "sublinks" to make sure I don't miss any items not shown on the root page.
The problem is, I notice in this for loop for l in product_links:
, scrapy and selenium seems to act strangely. For example, I would expect response.url == self.driver.current_url
would always be true. However, they become different in the middle of this for loop. Furthermore, self.driver
seem to capture some elements not existing in the current url in products = self.driver.find_elements_by_xpath('//div[@data-url]')
and then fail to retrieve them again in sub = self.driver.find_elements_by_xpath('//div[(@class="shelf-container") and (.//div/@data-url="' + l + '")]//h2')
Many thanks. I'm really confused.
from webScrape.items import ProductItem
from scrapy import Spider, Request
from selenium import webdriver
class MySpider(Spider):
name = 'name'
domain = 'https://uk.burberry.com'
def __init__(self):
super().__init__()
self.driver = webdriver.Chrome('path to driver')
self.start_urls = [self.domain + '/' + k for k in ('womens-clothing', 'womens-bags', 'womens-scarves',
'womens-accessories', 'womens-shoes', 'make-up', 'womens-fragrances')]
self.pool = set()
def parse(self, response):
sub_links = response.xpath('//h2[starts-with(@class, "shelf1-section-title")]/a/@href').extract()
if len(sub_links) > 0:
for l in sub_links:
yield Request(self.domain + l, callback = self.parse)
self.driver.get(response.url)
email_reg = self.driver.find_element_by_xpath('//button[@class="dc-reset dc-actions-btn js-data-capture-newsletter-block-cancel"]')
if email_reg.is_displayed():
email_reg.click()
# Make sure to click all the "load more" buttons
load_more_buttons = self.driver.find_elements_by_xpath('//div[@class="load-assets-button js-load-assets-button ga-shelf-load-assets-button"]')
for button in load_more_buttons:
if button.is_displayed():
button.click()
products = self.driver.find_elements_by_xpath('//div[@data-url]')
product_links = [item.get_attribute('data-url') for item in products if item.get_attribute('data-url').split('-')[-1][1:] not in self.pool]
for l in product_links:
sub = self.driver.find_elements_by_xpath('//div[(@class="shelf-container") and (.//div/@data-url="' + l + '")]//h2')
if len(sub) > 0:
sub_category = ', '.join(set([s.get_attribute('data-ga-shelf-title') for s in sub]))
else:
sub_category = ''
yield Request(self.domain + l, callback = self.parse_product, meta = {'sub_category': sub_category})
def parse_product(self, response):
item = ProductItem()
item['id'] = response.url.split('-')[-1][1:]
item['sub_category'] = response.meta['sub_category']
item['name'] = response.xpath('//h1[@class="product-title transaction-title ta-transaction-title"]/text()').extract()[0].strip()
self.pool.add(item['id'])
yield item
others = response.xpath('//input[@data-url]/@data-url').extract()
for l in others:
if l.split('-')[-1][1:] not in self.pool:
yield Request(self.domain + l, callback = self.parse_product, meta = response.meta)
Scrapy is an asynchronous framework. The code in your
parse*()
methods does not always run linearly. Wherever there is ayield
there, the execution of that method may stop there for some time while other parts of the code run.Because there is a
yield
in the loop, that explains why you are experiencing that unexpected behavior. Atyield
, some other code of your program resumes execution and may switch the Selenium driver to a different URL, and when the code resumes the loop the URL from the Selenium driver has changed.To be honest, you don’t really need Selenium in Scrapy for your use case, as far as I can see. In Scrapy, things like Splash or Selenium are only used on very specific scenarios, for things like avoiding bot detection.
It is usually a better approach to figure out the structure of the page HTML and the parameters used in requests by using the developer tools from your web browser (Inspect, Network) and then reproducing them in Scrapy.