可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am scraping a website using Scrapy which require cooking and java-script to be enabled. I don't think I will have to actually process javascript. All I need is to pretend as if javascript is enabled.

Here is what I have tried: 1) Enable Cookies through following in settings

COOKIES_ENABLED = True
COOKIES_DEBUG = True

2) Using download middleware for cookies

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400,
    'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware':700
}

3) Sending 'X-JAVASCRIPT-ENABLED': 'True'

DEFAULT_REQUEST_HEADERS={
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
    'X-JAVASCRIPT-ENABLED': 'True'
}

but none of them is working with me. Can you please suggest any idea or give me some direction ?

Thanks in advance for your replies.

回答1:

You should try Splash JS engine with scrapyjs. Here is a example of how to set it up in your spider project:

SPLASH_URL = 'http://192.168.59.103:8050'
DOWNLOADER_MIDDLEWARES = {
    'scrapyjs.SplashMiddleware': 725,
}

Scraping hub which is the same company behind Scrapy, has special instances to run your spiders with splash enabled.

Then yield SplashRequest instead of Request in your spider like this:

import scrapy
from scrapy_splash import SplashRequest

class MySpider(scrapy.Spider):
    start_urls = ["http://example.com", "http://example.com/foo"]

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse,
                endpoint='render.html',
                args={'wait': 0.5},
            )

    def parse(self, response):
        # response.body is a result of render.html call; it
        # contains HTML processed by a browser.
        # …

回答2:

AFAIK, there is no a universal solution. You have to debug the site, to see how it determines that Javascript is not supported/enabled by your client.

I don't think the server looks at X-JAVASCRIPT-ENABLED header. Maybe there is a cookie set by Javascript when the page loads in a real javascript enabled browser? Maybe the server looks at user-agent header?

回答3:

Scrapy doesn't support java script.

but

you can use some other library with Scrapy for executing JS , like Webkit, Selenium etc,

and you don't needs to enable cookies (COOKIES_ENABLED = True), not even required to add DOWNLOADER_MIDDLEWARES in your settings.py because they are already available in default scrapy settings