I'm trying to scrape Amazon prices with phantomjs and python. I want to parse it with beautiful soup, to get the new and used prices for books, the problem is: when I pass the source of the request I do with phantomjs the prices are just 0,00, the code is this simple test.
I'm new in web scraping but I don't understand if is amazon who have measures to avoid scraping prices or I'm doing it wrong because I was trying with other more simple pages and I can get the data I want.
PD I'm in a country not supported to use amazon API, that's why the scraper is necesary
import re
import urlparse
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
link = 'http://www.amazon.com/gp/offer-listing/1119998956/ref=dp_olp_new?ie=UTF8&condition=new'#'http://www.amazon.com/gp/product/1119998956'
class AmzonScraper(object):
def __init__(self):
self.driver = webdriver.PhantomJS()
self.driver.set_window_size(1120, 550)
def scrape_prices(self):
self.driver.get(link)
s = BeautifulSoup(self.driver.page_source)
return s
def scrape(self):
source = self.scrape_prices()
print source
self.driver.quit()
if __name__ == '__main__':
scraper = TaleoJobScraper()
scraper.scrape()
First of all, to follow @Nick Bailey's comment, study the Terms of Use and make sure there are no violations on your side.
To solve it, you need to tweak
PhantomJS
desired capabilities:And, to make it bullet-proof, you can make a Custom Expected Condition and wait for the price to become non-zero:
Usage:
Good answer on setting the user agent for phantomjs to that of a normal browser. Since you said that your country is being blocked by amazon, then I would imagine that you also need to set a proxy.
here is an example of how to start phantomJS in python with a firefox useragent and a proxy.
where 1.1.1.1 is your proxy ip and port is the proxy port. Also username and password are only necessary if your proxy requires authentication.