Empty List From Scrapy When Using Xpath to Extract

2019-07-24 16:59发布

Really need the help from this community.

My question is that when I used the code in python

response.xpath("//div[contains(@class,'check-prices-widget-not-sponsored')]/a/div[contains(@class,'check-prices-widget-not-sponsored-link')]").extract()

to extract the vendor name in scrapy shell, the output is empty. I really did not know why that happened, and it seems to me that the problem might be the website info is updating dynamically?

The url for this web scrapping is: https://cruiseline.com/cruise/7-night-bahamas-florida-new-york-roundtrip-32860, and what I need is the Vendor name and Price for each vendor. Besides the attached pic is the screenshot of "the inspect". enter image description here

However, the similar code works to extract price in the following page url ('https://cruiseline.com/destination/caribbean/cruise/best?sort=rank,ship_status&&direction=desc&page=1&per_page=10&sailing_counts=0')

Prices = response.xpath(
        "//div[contains(@class,'featured-cruise-price-inner-price')]/span/descendant::text()").extract()

Really appreciate the help!

1条回答
地球回转人心会变
2楼-- · 2019-07-24 17:25

I tried this url in scrapy shell:https://cruiseline.com/cruise/7-night-bahamas-florida-new-york-roundtrip-32860, and i also got nothing with

response.xpath("//div[contains(@class,'check-prices-widget-not-sponsored')]/a/div[contains(@class,'check-prices-widget-not-sponsored-link')]").extract()

Then I used view(response) command to figure out what the spider sees, and found out that the site is dynamic, which means if you want to scrape info on that website, you need to execute the js codes that show the info.

Here are the screenshots:what your spider sees_1 what your spider sees_2

As you can see, the info you need doesn't show. However, this one https://cruiseline.com/destination/caribbean/cruise/best?sort=rank,ship_status&&direction=desc&page=1&per_page=10&sailing_counts=0 is static, so that's why you can scrape what you need.

I got two ways for you to scrape dynamic website(of course, there are more):

1.Splash(Official Doc): In your Spider, yield your url with SplashRequest instead of scrapy.Request.

2.Selenium + PhantomJS(Official Doc)

Good luck with your scraping! :)

查看更多
登录 后发表回答