Why can't I scrape Amazon by BeautifulSoup?

2020-04-01 08:10发布

Here is my python code:

import urllib2
from bs4 import BeautifulSoup

page = urllib2.urlopen("http://www.amazon.com/")
soup = BeautifulSoup(page)
print soup

it works for google.com and many other websites, but it doesn't work for amazon.com.

I can open amazon.com in my browser, but the resulting "soup" is still none.

Besides, I find that it cannot scrape from appannie.com, either. However, rather than give none, the code returns an error:

HTTPError: HTTP Error 503: Service Temporarily Unavailable 

So I doubt whether Amazon and App Annie block scraping.

Please do try by yourself instead of just voting down to the question :(

Thanks

3条回答
孤傲高冷的网名
2楼-- · 2020-04-01 08:53

You can try this:

import urllib2
from bs4 import BeautifulSoup

page = urllib2.urlopen("http://www.amazon.com/")
soup = BeautifulSoup(page)
print soup

In python arbitrary text is called a string and it must be enclosed in quotes(" ").

查看更多
Rolldiameter
3楼-- · 2020-04-01 08:53

Add a header

import urllib2
from bs4 import BeautifulSoup

headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36'}

page = urllib2.urlopen("http://www.amazon.com/")
soup = BeautifulSoup(page)
print soup
查看更多
The star\"
4楼-- · 2020-04-01 08:59

Add a header, then it will work.

from bs4 import BeautifulSoup
import requests
url = "http://www.amazon.com/"

# add header
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, "lxml")
print soup
查看更多
登录 后发表回答