How to scrape a web site with sucuri protection

According to Scrapy Documetions I want to crawl and scrape data from several sites, My codes works correctly with usual website,but when I want crawl a website with Sucuri I don't get any data, it seems sucuri firewall prevent me to access to websites markups.

The target website is http://www.dwarozh.net/ and This is my spider snippet

from scrapy import Spider
from scrapy.selector import Selector
import scrapy

from Stack.items import StackItem
from bs4 import BeautifulSoup
from scrapy import log
from scrapy.utils.response import open_in_browser


    class StackSpider(Spider):
        name = "stack"
        start_urls = [
            "http://www.dwarozh.net/sport/",
        ]


        def parse(self, response):
            mItems = Selector(response).xpath('//div[@class="news-more-img"]/ul/li')
            for mItem in mItems:
                item = StackItem()
                item['title'] = mItem.xpath(
                    'a/h2/text()').extract_first()
                item['url'] = mItem.xpath(
                    'viewa/@href').extract_first()
                yield item

And this is result I get in response

<html><title>You are being redirected...</title>
<noscript>Javascript is required. Please enable javascript before you are allowed to see this page.</noscript>
<script>var s={},u,c,U,r,i,l=0,a,e=eval,w=String.fromCharCode,sucuri_cloudproxy_js='',S='cz0iMHNlYyIuc3Vic3RyKDAsMSkgKyAnNXlCMicuc3Vic3RyKDMsIDEpICsgJycgKycnKyIxIi5zbGljZSgwLDEpICsgJ2pQYycuY2hhckF0KDIpKyJmIiArICIiICsnbz1jJy5jaGFyQXQoMikrICcnICsgCiI0Ii5zbGljZSgwLDEpICsgJ0FvPzcnLnN1YnN0cigzLCAxKSArIjUiICsgU3RyaW5nLmZyb21DaGFyQ29kZSgxMDIpICsgIiIgKycxJyArICAgJycgKyAKIjFzZWMiLnN1YnN0cigwLDEpICsgICcnICsnJysnMycgKyAgImUiLnNsaWNlKDAsMSkgKyAiIiArImZzdSIuc2xpY2UoMCwxKSArICIiICsiMnN1Y3VyIi5jaGFyQXQoMCkrICcnICtTdHJpbmcuZnJvbUNoYXJDb2RlKDEwMCkgKyAgJycgKyI5c3UiLnNsaWNlKDAsMSkgKyAgJycgKycnKyI2IiArICdDYycuc2xpY2UoMSwyKSsiNnN1Ii5zbGljZSgwLDEpICsgJ2YnICsgICAnJyArIAonYScgKyAgIjAiICsgJ2YnICsgICI0IiArICI2c2VjIi5zdWJzdHIoMCwxKSArICAnJyArIAonWnBFMScuc3Vic3RyKDMsIDEpICsiMSIgKyBTdHJpbmcuZnJvbUNoYXJDb2RlKDB4MzgpICsgIiIgKyI1c3VjdXIiLmNoYXJBdCgwKSsiZnN1Ii5zbGljZSgwLDEpICsgJyc7ZG9jdW1lbnQuY29va2llPSdzc3VjJy5jaGFyQXQoMCkrICd1JysnJysnYycuY2hhckF0KDApKyd1c3VjdXInLmNoYXJBdCgwKSsgJ3JzdWMnLmNoYXJBdCgwKSsgJ3N1Y3VyaScuY2hhckF0KDUpICsgJ19zdScuY2hhckF0KDApICsnY3N1Y3VyJy5jaGFyQXQoMCkrICdsJysnbycrJ3UnLmNoYXJBdCgwKSsnZCcrJ3AnKycnKydyc3VjdScuY2hhckF0KDApICArJ3NvJy5jaGFyQXQoMSkrJ3gnKyd5JysnX3N1Y3VyaScuY2hhckF0KDApICsgJ3UnKyd1JysnaXN1Y3VyaScuY2hhckF0KDApICsgJ3N1Y3VkJy5jaGFyQXQoNCkrICdzXycuY2hhckF0KDEpKycxJysnOCcrJzEnKydzdWN1cmQnLmNoYXJBdCg1KSArICdlJy5jaGFyQXQoMCkrJzEnKydzdWN1cjEnLmNoYXJBdCg1KSArICcxc3VjdXJpJy5jaGFyQXQoMCkgKyAnMicrIj0iICsgcyArICc7cGF0aD0vO21heC1hZ2U9ODY0MDAnOyBsb2NhdGlvbi5yZWxvYWQoKTs=';L=S.length;U=0;r='';var A='ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/';for(u=0;u<64;u++){s[A.charAt(u)]=u;}for(i=0;i<L;i++){c=s[S.charAt(i)];U=(U<<6)+c;l+=6;while(l>=8){((a=(U>>>(l-=8))&0xff)||(i<(L-2)))&&(r+=w(a));}}e(r);</script></html>

How can I bypass sucuri with scrapy?

标签： python scrapy scrapy-spider cloudflare

2条回答

淡お忘

2楼-- · 2019-04-16 06:02

Site uses cookie- and user-agent based protection. You may check it in such a way. Open DevTools in Chrome. Navigate to the target page http://www.dwarozh.net/sport/, then in Network tab right click on the request to the page and "Copy as CURL" Open console and run the cURL:

$ curl 'http://www.dwarozh.net/sport/all-hawal.aspx?cor=3&Nawnishan=%D9%88%DB%95%D8%B1%D8%B2%D8%B4%DB%95%DA%A9%D8%A7%D9%86%DB%8C%20%D8%AF%DB%8C%DA%A9%DB%95' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'Accept-Language: ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4,es;q=0.2' -H 'Upgrade-Insecure-Requests: 1' -H 'X-Compress: null' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' -H 'Referer: http://www.dwarozh.net/sport/details.aspx?jimare=10505' -H 'Cookie: __cfduid=dc9867; sucuri_cloudproxy_uuid_ce28bca9c=d36ad9; ASP.NET_SessionId=wqdo0v; __atuvc=1%7C49; sucuri_cloudproxy_uuid_0d5c=6ab0; _gat=1; __asc=7c0b5; __auc=35; _ga=GA1.2.19688' -H 'Connection: keep-alive' --compressed

You will see normal html code. If you remove cookie of User-Agent from the request, you get the cap page.

Lets check it in scrapy:

$ scrapy shell
>>> from scrapy import Request
>>> cookie_str = '''here; your; cookies; from; browser; go;'''
>>> cookies = dict(pair.split('=') for pair in cookie_str.split('; '))
>>> cookies  # check them
{'__auc': '999', '__cfduid': '796', '_gat': '1', '__atuvc': '1%7C49', 'sucuri_cloudproxy_uuid_0d5c97a96': '6ab007eb1
9', 'ASP.NET_SessionId': 'u9', '_ga': 'GA1.2.1968.148', '__asc': 'sfsdf', 'sucuri_cloudproxy_uuid_ce2
sfsdfs': 'sdfsdf'}
>>> r = Request(url='http://www.dwarozh.net/sport/', cookies=cookies, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/56 (KHTML, like Gecko) Chrome
/54. Safari/5'})
>>> fetch(r)
>>> response.xpath('//div[@class="news-more-img"]/ul/li')
[<Selector xpath='//div[@class="news-more-img"]/ul/li' data='<li><a href="details.aspx?jimare=10507">'>, <Selector xpath='//div[@class="news-more-img"]/ul/li' data='<li><a href="de
tails.aspx?jimare=10505">'>, <Selector xpath='//div[@class="news-more-img"]/ul/li' data='<li><a href="details.aspx?jimare=10504">'>, <Selector xpath='//div[@class="news-more-img"]/
ul/li' data='<li><a href="details.aspx?jimare=10503">'>, <Selector xpath='//div[@class="news-more-img"]/ul/li' data='<li><a href="details.aspx?jimare=10323">'>]

Excellent! Let's make a spider:

I've modified yours because I have no source code of some components.

from scrapy import Spider, Request
from scrapy.selector import Selector
import scrapy

#from Stack.items import StackItem
#from bs4 import BeautifulSoup
from scrapy import log
from scrapy.utils.response import open_in_browser


class StackSpider(Spider):
        name = "dwarozh"
        start_urls = [
            "http://www.dwarozh.net/sport/",
        ]
        _cookie_str = '''__cfduid=dc986; sucuri_cloudproxy_uuid_ce=d36a; ASP.NET_SessionId=wq; __atuvc=1%7C49; sucuri_cloudproxy_uuid_0d5c97a96=6a; _gat=1; __asc=7c0b; __auc=3; _ga=GA1.2.196.14'''
        _user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/5 (KHTML, like Gecko) Chrome/54 Safari/5'

        def start_requests(self):
            cookies = dict(pair.split('=') for pair in self._cookie_str.split('; '))
            return [Request(url=url, cookies=cookies, headers={'User-Agent': self._user_agent})
                    for url in self.start_urls]

        def parse(self, response):
            mItems = Selector(response).xpath('//div[@class="news-more-img"]/ul/li')
            for mItem in mItems:
                item = {} # StackItem()
                item['title'] = mItem.xpath('a/h2/text()').extract_first()
                item['url'] = mItem.xpath('viewa/@href').extract_first()
                yield {'url': item['url'], 'title': item['title']}

Lets run it:

$ scrapy crawl dwarozh -o - -t csv --loglevel=DEBUG
/Users/el/Projects/scrap_woman/.env/lib/python3.4/importlib/_bootstrap.py:321: ScrapyDeprecationWarning: Module `scrapy.log` has been deprecated, Scrapy now relies on the builtin Python library for logging. Read the updated logging entry in the documentation to learn more.
  return f(*args, **kwds)
2016-12-10 00:18:55 [scrapy] INFO: Scrapy 1.2.1 started (bot: scrap1)
2016-12-10 00:18:55 [scrapy] INFO: Overridden settings: {'SPIDER_MODULES': ['scrap1.spiders'], 'FEED_FORMAT': 'csv', 'BOT_NAME': 'scrap1', 'FEED_URI': 'stdout:', 'NEWSPIDER_MODULE': 'scrap1.spiders', 'ROBOTSTXT_OBEY': True}
2016-12-10 00:18:55 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2016-12-10 00:18:55 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-12-10 00:18:55 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-12-10 00:18:55 [scrapy] INFO: Enabled item pipelines:
[]
2016-12-10 00:18:55 [scrapy] INFO: Spider opened
2016-12-10 00:18:55 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-10 00:18:55 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
2016-12-10 00:18:55 [scrapy] DEBUG: Crawled (200) <GET http://www.dwarozh.net/robots.txt> (referer: None)
2016-12-10 00:18:56 [scrapy] DEBUG: Crawled (200) <GET http://www.dwarozh.net/sport/> (referer: None)
2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/>
{'url': None, 'title': '\nلیستی یاریزانانی ریاڵ مەدرید بۆ یاری سبەی ڕاگەیەنراو پێنج یاریزان دورخرانەوە'}
2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/>
{'url': None, 'title': '\nهەواڵێکی ناخۆش بۆ هاندەرانی ریاڵ مەدرید'}
2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/>
{'url': None, 'title': '\nگرنگترین مانشێتی ئەمرۆ هەینی رۆژنامەکانی ئیسپانیا'}
2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/>
{'url': None, 'title': '\nبەفەرمی یۆفا پێكهاتەی نموونەی جەولەی شەشەم و کۆتایی چامپیۆنس لیگی بڵاو کردەوە'}
2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/>
{'url': None, 'title': '\nكچە یاریزانێك دەبێتە هۆیی دروست بوونی تیپێكی تۆكمە'}
2016-12-10 00:18:56 [scrapy] INFO: Closing spider (finished)
2016-12-10 00:18:56 [scrapy] INFO: Stored csv feed (5 items) in: stdout:
2016-12-10 00:18:56 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 950,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 15121,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 12, 9, 21, 18, 56, 271371),
 'item_scraped_count': 5,
 'log_count/DEBUG': 8,
 'log_count/INFO': 8,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2016, 12, 9, 21, 18, 55, 869851)}
2016-12-10 00:18:56 [scrapy] INFO: Spider closed (finished)
url,title
,"
لیستی یاریزانانی ریاڵ مەدرید بۆ یاری سبەی ڕاگەیەنراو پێنج یاریزان دورخرانەوە"
,"
هەواڵێکی ناخۆش بۆ هاندەرانی ریاڵ مەدرید"
,"
گرنگترین مانشێتی ئەمرۆ هەینی رۆژنامەکانی ئیسپانیا"
,"
بەفەرمی یۆفا پێكهاتەی نموونەی جەولەی شەشەم و کۆتایی چامپیۆنس لیگی بڵاو کردەوە"
,"
كچە یاریزانێك دەبێتە هۆیی دروست بوونی تیپێكی تۆكمە"

Possibly you will have to update cookies from time to time. You may use PhantomJS for this.

UPDATE:

How to get cookies using PhantomJS.

Install PhantomJS.

Make script like this dwarosh.js:

var page = require('webpage').create();
page.settings.userAgent = 'SpecialAgent';
page.open('http://www.dwarozh.net/sport/', function(status) {
  console.log("Status: " + status);
  if(status === "success") {
    page.render('example.png');
    page.evaluate(function() {
    return document.title;
  });
  }
  for (var i=0; i<page.cookies.length; i++) {
    var c = page.cookies[i];
    console.log(c.name, c.value);
  };
  phantom.exit();
});

Run script:

  $ phantomjs --cookies-file=cookie.txt dwarosh.js
  TypeError: undefined is not an object (evaluating  'activeElement.position().left')

  http://www.dwarozh.net/sport/js/script.js:5
  https://code.jquery.com/jquery-1.10.2.min.js:4 in c
  https://code.jquery.com/jquery-1.10.2.min.js:4 in fireWith
  https://code.jquery.com/jquery-1.10.2.min.js:4 in ready
  https://code.jquery.com/jquery-1.10.2.min.js:4 in q
Status: success
__auc 250ab0a9158ee9e73eeeac78bba
__asc 250ab0a9158ee9e73eeeac78bba
_gat 1
_ga GA1.2.260482211.1481472111
ASP.NET_SessionId vs1utb1nyblqkxprxgazh0g2
sucuri_cloudproxy_uuid_3e07984e4 26e4ab3...
__cfduid d9059962a4c12e0f....1

Get cookie sucuri_cloudproxy_uuid_3e07984e4 and try to get the page with curl and the same User-Agent.

$ curl -v http://www.dwarozh.net/sport/ -b sucuri_cloudproxy_uuid_3e07984e4=26e4ab377efbf766d4be7eff20328465 -A SpecialAgent
*   Trying 104.25.209.23...
* Connected to www.dwarozh.net (104.25.209.23) port 80 (#0)
> GET /sport/ HTTP/1.1
> Host: www.dwarozh.net
> User-Agent: SpecialAgent
> Accept: */*
> Cookie:     sucuri_cloudproxy_uuid_3e07984e4=26e4ab377efbf766d4be7eff20328465
>
< HTTP/1.1 200 OK
< Date: Sun, 11 Dec 2016 16:17:04 GMT
< Content-Type: text/html; charset=utf-8
< Transfer-Encoding: chunked
< Connection: keep-alive
< Set-Cookie: __cfduid=d1646515f5ba28212d4e4ca562e2966311481473024; expires=Mon, 11-Dec-17 16:17:04 GMT; path=/; domain=.dwarozh.net; HttpOnly
< Cache-Control: private
< Vary: Accept-Encoding
< Set-Cookie: ASP.NET_SessionId=srxyurlfpzxaxn1ufr0dvxc2; path=/; HttpOnly
< X-AspNet-Version: 4.0.30319
< X-XSS-Protection: 1; mode=block
< X-Frame-Options: SAMEORIGIN
< X-Content-Type-Options: nosniff
< X-Sucuri-ID: 15008
< Server: cloudflare-nginx
< CF-RAY: 30fa3ea1335237b0-ARN
<
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><title>
Dwarozh : Sport
</title><meta content="دواڕۆژ سپۆرت هەواڵی ناوخۆ،هەواڵی جیهانی، وەرزشەکانی دیکە" name="description"/><meta property="fb:app_id" content="1713056075578566"/><meta content="initial-scale=1.0, width=device-width, maximum-scale=1.0, user-scalable=no" name="viewport"/><link href="wene/favicon.ico" rel="shortcut icon" type="image/x-icon"/><link href="wene/style.css" rel="stylesheet" type="text/css"/>
<script src="js/jquery-2.1.1.js" type="text/javascript"></script>
<script src="https://code.jquery.com/jquery-1.10.2.min.js" type="text/javascript"></script>
<script src="js/script.js" type="text/javascript"></script>
<link href="css/styles.css" rel="stylesheet"/>
<script src="js/classie.js" type="text/javascript"></script>
<script type="text/javascript">

0人赞添加讨论(0) 举报

地球回转人心会变

3楼-- · 2019-04-16 06:17

The general solution to parse dynamic content will be to first get rendered dom/html by using something able to run Javascript (for example http://phantomjs.org/) then save html and feed it to a parser.

This will also help to bypass some js-based protectors.

phantomjs is a single executable file and it will load a uri as a real browser with all JS evaluated. You can run it from Python by subprocess.call([phantomJsPath, jsProgramPath, url, htmlFileToSave])

For jsProgram example you can check https://github.com/ariya/phantomjs/blob/master/examples/rasterize.js

To save html from the js program, use fs.write(htmlFileToSave, page.content, "w");

I've tested this method on dwarozh.net and it worked, though you should figure out how to plug this into your scrapy pipeline.

Specifically for your example, you can try to "manually" parse the provided javascript to get cookie detail which is required to load the actual page. Though Sucuri algorithm may be changed at any moment and any solution based on cookie or js-decoding will became broken.

0人赞添加讨论(0) 举报

How to scrape a web site with sucuri protection

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间