During the scraping process using scrapy one error appears in my logs from time to time. It doesnt seem to be anywhere in my code, and looks like it something inside twisted\openssl. Any ideas what caused this and how to get rid of it?
Stacktrace here:
[Launcher,27487/stderr] Error during info_callback
Traceback (most recent call last):
File "/opt/webapps/link_crawler/lib/python2.7/site-packages/twisted/protocols/tls.py", line 415, in dataReceived
self._write(bytes)
File "/opt/webapps/link_crawler/lib/python2.7/site-packages/twisted/protocols/tls.py", line 554, in _write
sent = self._tlsConnection.send(toSend)
File "/opt/webapps/link_crawler/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1270, in send
result = _lib.SSL_write(self._ssl, buf, len(buf))
File "/opt/webapps/link_crawler/lib/python2.7/site-packages/OpenSSL/SSL.py", line 926, in wrapper
callback(Connection._reverse_mapping[ssl], where, return_code)
--- <exception caught here> ---
File "/opt/webapps/link_crawler/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1055, in infoCallback
return wrapped(connection, where, ret)
File "/opt/webapps/link_crawler/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1157, in _identityVerifyingInfoCallback
transport = connection.get_app_data()
File "/opt/webapps/link_crawler/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1589, in get_app_data
return self._app_data
File "/opt/webapps/link_crawler/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1148, in __getattr__
return getattr(self._socket, name)
exceptions.AttributeError: 'NoneType' object has no attribute '_app_data'
I was able to solve this problem by installing the
service_identity
package:pip install service_identity
At first glance, it appears as though this is due to a bug in scrapy. Scrapy defines its own Twisted "context factory": https://github.com/scrapy/scrapy/blob/ad36de4e6278cf635509a1ade30cca9a506da682/scrapy/core/downloader/contextfactory.py#L21-L28
This code instantiates
ClientTLSOptions
with the context it intends to return. A side-effect of instantiating this class is that an "info callback" is installed on the context factory. The info callback requires that the Twisted TLS implementation has been set as "app data" on the connection. However, since nothing ever uses theClientTLSOptions
instance (it is discarded immediately), the app data is never set.When the info callback comes back around to get the Twisted TLS implementation (necessary to do part of its job) it instead finds there is no app data and fails with the exception you've reported.
The side-effect of
ClientTLSOptions
is a little bit unpleasant but I think this is clearly a scrapy bug caused by mis-use/abuse ofClientTLSOptions
. I don't think this code could ever have been very well tested since this error will happen every single time a certificate fails to verify.I suggest reporting the bug to Scrapy. Hopefully they can fix their use of
ClientTLSOptions
and eliminate this error for you.I was able to solve the issue by replacing https:// with http:// in all image links before sending them to the imagepipeline