Using Tor + Privoxy to scrape google shopping resu

2019-04-10 00:48发布

I have installed Tor + Privoxy on my server and they're working fine! (Tested). But now when I try to use urllib2 (python) to scrape google shopping results, using proxy of course, I always get blocked by google (sometimes 503 error, sometimes 403 error). So anyone have any solutions can help me avoid that problem? It would be very appreciated!!

The source code that I am using:

 _HEADERS = {
      'User-Agent': 'Mozilla/5.0',
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Encoding': 'deflate',
      'Connection': 'close',
      'DNT': '1'
  }

  request = urllib2.Request("https://www.google.com/#q=iphone+5&tbm=shop", headers=self._HEADERS)

  proxy_support = urllib2.ProxyHandler({"http" : "127.0.0.1:8118"})
  opener = urllib2.build_opener(proxy_support) 
  urllib2.install_opener(opener)

  try:
      response = urllib2.urlopen(request)
      html = response.read()
      print html

   except urllib2.HTTPError as e:
       print e.code
       print e.reason

Note that: When I don't use proxy, it can work fine!

标签： python scrape tor

2条回答

Lonely孤独者°

2楼-- · 2019-04-10 01:36

Have you installed stem, the controller library for Tor? In just a few lines of code you can request a new identity from Tor. See:

https://stem.torproject.org/faq.html#how-do-i-request-a-new-identity-from-tor

Simply use exceptions to catch your 403 and 503 errors and handle them by requesting a new identity, as shown in the link above. Best of luck.

0人赞添加讨论(0) 举报

兄弟一词,经得起流年.

3楼-- · 2019-04-10 01:52

Google blocks many of exit Tor nodes because Google receive many requests from them. So this error is question of probability, change your exit Tor node until find one without be blocked by Google.

https://www.torproject.org/docs/faq.html.en#GoogleCAPTCHA

0人赞添加讨论(0) 举报

Using Tor + Privoxy to scrape google shopping resu

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间