Why can't I scrape Amazon by BeautifulSoup?

Here is my python code:

import urllib2
from bs4 import BeautifulSoup

page = urllib2.urlopen("http://www.amazon.com/")
soup = BeautifulSoup(page)
print soup

it works for google.com and many other websites, but it doesn't work for amazon.com.

I can open amazon.com in my browser, but the resulting "soup" is still none.

Besides, I find that it cannot scrape from appannie.com, either. However, rather than give none, the code returns an error:

HTTPError: HTTP Error 503: Service Temporarily Unavailable

So I doubt whether Amazon and App Annie block scraping.

Please do try by yourself instead of just voting down to the question :(

Thanks

标签： python beautifulsoup amazon

3条回答

孤傲高冷的网名

2楼-- · 2020-04-01 08:53

You can try this:

import urllib2
from bs4 import BeautifulSoup

page = urllib2.urlopen("http://www.amazon.com/")
soup = BeautifulSoup(page)
print soup

In python arbitrary text is called a string and it must be enclosed in quotes(" ").

0人赞添加讨论(0) 举报

Rolldiameter

3楼-- · 2020-04-01 08:53

Add a header

import urllib2
from bs4 import BeautifulSoup

headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36'}

page = urllib2.urlopen("http://www.amazon.com/")
soup = BeautifulSoup(page)
print soup

0人赞添加讨论(0) 举报

The star\"

4楼-- · 2020-04-01 08:59

Add a header, then it will work.

from bs4 import BeautifulSoup
import requests
url = "http://www.amazon.com/"

# add header
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, "lxml")
print soup

0人赞添加讨论(0) 举报

Why can't I scrape Amazon by BeautifulSoup?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间