According to Scrapy Documetions I want to crawl and scrape data from several sites, My codes works correctly with usual website,but when I want crawl a website with Sucuri I don't get any data, it seems sucuri firewall prevent me to access to websites markups.
The target website is http://www.dwarozh.net/ and This is my spider snippet
from scrapy import Spider
from scrapy.selector import Selector
import scrapy
from Stack.items import StackItem
from bs4 import BeautifulSoup
from scrapy import log
from scrapy.utils.response import open_in_browser
class StackSpider(Spider):
name = "stack"
start_urls = [
"http://www.dwarozh.net/sport/",
]
def parse(self, response):
mItems = Selector(response).xpath('//div[@class="news-more-img"]/ul/li')
for mItem in mItems:
item = StackItem()
item['title'] = mItem.xpath(
'a/h2/text()').extract_first()
item['url'] = mItem.xpath(
'viewa/@href').extract_first()
yield item
And this is result I get in response
<html><title>You are being redirected...</title>
<noscript>Javascript is required. Please enable javascript before you are allowed to see this page.</noscript>
<script>var s={},u,c,U,r,i,l=0,a,e=eval,w=String.fromCharCode,sucuri_cloudproxy_js='',S='cz0iMHNlYyIuc3Vic3RyKDAsMSkgKyAnNXlCMicuc3Vic3RyKDMsIDEpICsgJycgKycnKyIxIi5zbGljZSgwLDEpICsgJ2pQYycuY2hhckF0KDIpKyJmIiArICIiICsnbz1jJy5jaGFyQXQoMikrICcnICsgCiI0Ii5zbGljZSgwLDEpICsgJ0FvPzcnLnN1YnN0cigzLCAxKSArIjUiICsgU3RyaW5nLmZyb21DaGFyQ29kZSgxMDIpICsgIiIgKycxJyArICAgJycgKyAKIjFzZWMiLnN1YnN0cigwLDEpICsgICcnICsnJysnMycgKyAgImUiLnNsaWNlKDAsMSkgKyAiIiArImZzdSIuc2xpY2UoMCwxKSArICIiICsiMnN1Y3VyIi5jaGFyQXQoMCkrICcnICtTdHJpbmcuZnJvbUNoYXJDb2RlKDEwMCkgKyAgJycgKyI5c3UiLnNsaWNlKDAsMSkgKyAgJycgKycnKyI2IiArICdDYycuc2xpY2UoMSwyKSsiNnN1Ii5zbGljZSgwLDEpICsgJ2YnICsgICAnJyArIAonYScgKyAgIjAiICsgJ2YnICsgICI0IiArICI2c2VjIi5zdWJzdHIoMCwxKSArICAnJyArIAonWnBFMScuc3Vic3RyKDMsIDEpICsiMSIgKyBTdHJpbmcuZnJvbUNoYXJDb2RlKDB4MzgpICsgIiIgKyI1c3VjdXIiLmNoYXJBdCgwKSsiZnN1Ii5zbGljZSgwLDEpICsgJyc7ZG9jdW1lbnQuY29va2llPSdzc3VjJy5jaGFyQXQoMCkrICd1JysnJysnYycuY2hhckF0KDApKyd1c3VjdXInLmNoYXJBdCgwKSsgJ3JzdWMnLmNoYXJBdCgwKSsgJ3N1Y3VyaScuY2hhckF0KDUpICsgJ19zdScuY2hhckF0KDApICsnY3N1Y3VyJy5jaGFyQXQoMCkrICdsJysnbycrJ3UnLmNoYXJBdCgwKSsnZCcrJ3AnKycnKydyc3VjdScuY2hhckF0KDApICArJ3NvJy5jaGFyQXQoMSkrJ3gnKyd5JysnX3N1Y3VyaScuY2hhckF0KDApICsgJ3UnKyd1JysnaXN1Y3VyaScuY2hhckF0KDApICsgJ3N1Y3VkJy5jaGFyQXQoNCkrICdzXycuY2hhckF0KDEpKycxJysnOCcrJzEnKydzdWN1cmQnLmNoYXJBdCg1KSArICdlJy5jaGFyQXQoMCkrJzEnKydzdWN1cjEnLmNoYXJBdCg1KSArICcxc3VjdXJpJy5jaGFyQXQoMCkgKyAnMicrIj0iICsgcyArICc7cGF0aD0vO21heC1hZ2U9ODY0MDAnOyBsb2NhdGlvbi5yZWxvYWQoKTs=';L=S.length;U=0;r='';var A='ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/';for(u=0;u<64;u++){s[A.charAt(u)]=u;}for(i=0;i<L;i++){c=s[S.charAt(i)];U=(U<<6)+c;l+=6;while(l>=8){((a=(U>>>(l-=8))&0xff)||(i<(L-2)))&&(r+=w(a));}}e(r);</script></html>
How can I bypass sucuri with scrapy?
Site uses cookie- and user-agent based protection. You may check it in such a way. Open DevTools in Chrome. Navigate to the target page http://www.dwarozh.net/sport/, then in Network tab right click on the request to the page and "Copy as CURL" Open console and run the cURL:
You will see normal html code. If you remove cookie of User-Agent from the request, you get the cap page.
Lets check it in scrapy:
Excellent! Let's make a spider:
I've modified yours because I have no source code of some components.
Lets run it:
Possibly you will have to update cookies from time to time. You may use PhantomJS for this.
UPDATE:
How to get cookies using PhantomJS.
Install PhantomJS.
Make script like this
dwarosh.js
:Run script:
Get cookie
sucuri_cloudproxy_uuid_3e07984e4
and try to get the page withcurl
and the same User-Agent.The general solution to parse dynamic content will be to first get rendered dom/html by using something able to run Javascript (for example http://phantomjs.org/) then save html and feed it to a parser.
This will also help to bypass some js-based protectors.
phantomjs
is a single executable file and it will load a uri as a real browser with all JS evaluated. You can run it from Python bysubprocess.call([phantomJsPath, jsProgramPath, url, htmlFileToSave])
For jsProgram example you can check https://github.com/ariya/phantomjs/blob/master/examples/rasterize.js
To save html from the js program, use
fs.write(htmlFileToSave, page.content, "w");
I've tested this method on dwarozh.net and it worked, though you should figure out how to plug this into your
scrapy
pipeline.Specifically for your example, you can try to "manually" parse the provided javascript to get cookie detail which is required to load the actual page. Though Sucuri algorithm may be changed at any moment and any solution based on cookie or js-decoding will became broken.