I am scraping a website using Scrapy which require cooking and java-script to be enabled. I don't think I will have to actually process javascript. All I need is to pretend as if javascript is enabled.
Here is what I have tried: 1) Enable Cookies through following in settings
COOKIES_ENABLED = True
COOKIES_DEBUG = True
2) Using download middleware for cookies
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400,
'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware':700
}
3) Sending 'X-JAVASCRIPT-ENABLED': 'True'
DEFAULT_REQUEST_HEADERS={
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'X-JAVASCRIPT-ENABLED': 'True'
}
but none of them is working with me. Can you please suggest any idea or give me some direction ?
Thanks in advance for your replies.
You should try Splash JS engine with scrapyjs. Here is a example of how to set it up in your spider project:
Scraping hub which is the same company behind Scrapy, has special instances to run your spiders with splash enabled.
Then yield
SplashRequest
instead ofRequest
in your spider like this:Scrapy doesn't support java script.
but
you can use some other library with Scrapy for executing JS , like Webkit, Selenium etc,
and you don't needs to enable cookies (
COOKIES_ENABLED = True
), not even required to addDOWNLOADER_MIDDLEWARES
in yoursettings.py
because they are already available in default scrapy settingsAFAIK, there is no a universal solution. You have to debug the site, to see how it determines that Javascript is not supported/enabled by your client.
I don't think the server looks at
X-JAVASCRIPT-ENABLED
header. Maybe there is a cookie set by Javascript when the page loads in a real javascript enabled browser? Maybe the server looks atuser-agent
header?See also this response.