Amazon web scraping

I'm trying to scrape Amazon prices with phantomjs and python. I want to parse it with beautiful soup, to get the new and used prices for books, the problem is: when I pass the source of the request I do with phantomjs the prices are just 0,00, the code is this simple test.

I'm new in web scraping but I don't understand if is amazon who have measures to avoid scraping prices or I'm doing it wrong because I was trying with other more simple pages and I can get the data I want.

PD I'm in a country not supported to use amazon API, that's why the scraper is necesary

import re
import urlparse

from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep

link = 'http://www.amazon.com/gp/offer-listing/1119998956/ref=dp_olp_new?ie=UTF8&condition=new'#'http://www.amazon.com/gp/product/1119998956'

class AmzonScraper(object):
    def __init__(self):
        self.driver = webdriver.PhantomJS()
        self.driver.set_window_size(1120, 550)

    def scrape_prices(self):
        self.driver.get(link)
        s = BeautifulSoup(self.driver.page_source)
        return s

    def scrape(self):
        source = self.scrape_prices()
        print source
        self.driver.quit()

if __name__ == '__main__':
    scraper = TaleoJobScraper()
    scraper.scrape()

标签： python web-scraping beautifulsoup phantomjs amazon

2条回答

欢心

2楼-- · 2019-04-02 05:16

First of all, to follow @Nick Bailey's comment, study the Terms of Use and make sure there are no violations on your side.

To solve it, you need to tweak PhantomJS desired capabilities:

caps = webdriver.DesiredCapabilities.PHANTOMJS
caps["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 (KHTML, like Gecko) Chrome/15.0.87"

self.driver = webdriver.PhantomJS(desired_capabilities=caps)
self.driver.maximize_window()

And, to make it bullet-proof, you can make a Custom Expected Condition and wait for the price to become non-zero:

from selenium.common.exceptions import StaleElementReferenceException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

class wait_for_price(object):
    def __init__(self, locator):
        self.locator = locator

    def __call__(self, driver):
        try :
            element_text = EC._find_element(driver, self.locator).text.strip()
            return element_text != "0,00"
        except StaleElementReferenceException:
            return False

Usage:

def scrape_prices(self):
    self.driver.get(link)

    WebDriverWait(self.driver, 200).until(wait_for_price((By.CLASS_NAME, "olpOfferPrice")))
    s = BeautifulSoup(self.driver.page_source)

    return s

0人赞添加讨论(0) 举报

SAY GOODBYE

3楼-- · 2019-04-02 05:16

Good answer on setting the user agent for phantomjs to that of a normal browser. Since you said that your country is being blocked by amazon, then I would imagine that you also need to set a proxy.

here is an example of how to start phantomJS in python with a firefox useragent and a proxy.

from selenium.webdriver import *
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
service_args = [ '--proxy=1.1.1.1:port', '--proxy-auth=username:pass'  ]
dcap = dict( DesiredCapabilities.PHANTOMJS )
dcap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:36.0) Gecko/20100101 Firefox/36.0"
driver = PhantomJS( desired_capabilities = dcap, service_args=service_args )

where 1.1.1.1 is your proxy ip and port is the proxy port. Also username and password are only necessary if your proxy requires authentication.

0人赞添加讨论(0) 举报

Amazon web scraping

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间