I purchased a VPS — my first ever — which is running CentOS 7 64-bit. I had absolutely zero experience with CentOS 7 until I began using this VPS today, so please go easy on me.
When trying to scrape some dynamically generated content with Scrapy and Selenium, the script ultimately fails and the log throws an error that reads:DevToolsActivePort file doesn't exist
On the very next line in the log, it pulls up info about the Chrome WebDriver:(Driver info: chromedriver=2.40.565383 ...
Hence I suspect the issue has nothing to do with locating webdriver
.
I've included part of the log below. The execution of the script always begins to hang for an extended period of time before ultimately failing when Selenium is first queried, which is why I haven't included the purely Scrapy part of the log.
The second-to-top answer with 4 votes on this thread reads, "This error message implies that the ChromeDriver was unable to initiate/spawn a new WebBrowser i.e. Chrome Browser session."
I've installed the Chrome Browser according to these instructions from the official repository.
Chrome is installed in the /usr/bin/google-chrome
directory, whereas chromedriver
is located in the /usr/local/bin/
directory. Both directories have been added to PATH
.
I've tried searching section 7.1 Exceptions in this non-official Selenium documentation for anything having to do with this error, but came up empty handed.
When I try to launch Google Chrome on the VPS via SSH, I get an error that reads [83526:83526:0622/212649.156252:ERROR:zygote_host_impl_linux.cc(88)] Running as root without --no-sandbox is not supported. See https://crbug.com/638180.
and links to a page that is no longer available... When I try to open Chrome with the --no-sandbox
parameter, I then get the error: (google-chrome-stable:85573): Gtk-WARNING **: cannot open display:
[0622/221013.556327:ERROR:nacl_helper_linux.cc(310)] NaCl helper process running without a sandbox! Most likely you need to configure your SUID sandbox correctly
.
Nothing is wrong with my code, though I'll include it below anyways. My script works fine locally on my own computer.
What's going on? I'm at a loss right now. Any help would be appreciated big time!
I have yet to try messing with option parameters of webdriver.Chrome(...)
, but I plan on trying this right after I finish posting this question.
Above are just a few of things that I've tried to remedy the situation.
Part of the Log When the Issues Begin
2018-06-22 20:31:22 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:41533/session {"capabilities": {"firstMatch": [{}], "alwaysMatch": {"browserName": "chrome", "platformName": "any", "goog:chromeOptions": {"extensions": [], "args": []}}}, "desiredCapabilities": {"browserName": "chrome", "version": "", "platform": "ANY", "goog:chromeOptions": {"extensions": [], "args": []}}}
2018-06-22 20:32:22 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2018-06-22 20:32:22 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.amazon.ca/b/ref=sr_aj?node=2055586011> (referer: None)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/local/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output
for x in result:
File "/usr/local/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/usr/local/lib/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/bldsprt/public_html/spiders/selen.py", line 53, in parse
self.driver = webdriver.Chrome('/usr/local/bin/chromedriver')
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/chrome/webdriver.py", line 75, in __init__
desired_capabilities=desired_capabilities)
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 156, in __init__
self.start_session(capabilities, browser_profile)
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 245, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 314, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: DevToolsActivePort file doesn't exist
(Driver info: chromedriver=2.40.565383 (76257d1ab79276b2d53ee976b2c3e3b9f335cde7),platform=Linux 3.10.0-862.3.3.el7.x86_64 x86_64)
2018-06-22 20:32:23 [scrapy.extensions.logstats] INFO: Crawled 1 pages (at 1 pages/min), scraped 0 items (at 0 items/min)
2018-06-22 20:32:23 [scrapy.core.engine] INFO: Closing spider (finished)
2018-06-22 20:32:23 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 310,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 111488,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 6, 23, 0, 32, 23, 3101),
'log_count/DEBUG': 4,
'log_count/ERROR': 1,
'log_count/INFO': 8,
'memusage/max': 54161408,
'memusage/startup': 46567424,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/WebDriverException': 1,
'start_time': datetime.datetime(2018, 6, 23, 0, 31, 21, 19959)}
2018-06-22 20:32:23 [scrapy.core.engine] INFO: Spider closed (finished)
[root@host spiders]#
Part of the Script
self.driver = (executable_path='../../../../usr/local/bin/chromedriver')
self.driver.get(response.url)
self.driver.set_window_size(960, 540)
self.driver.wait = WebDriverWait(self.driver, 10)
next = self.driver.find_element_by_xpath('//a[@id="pagnNextLink"]')
href = next.get_attribute('href')
self.driver.quit()