I am using tweepy to handle a large twitter stream (following 4,000+ accounts). The more accounts that I add to the stream, the more likely I am to get this error:
Traceback (most recent call last):
File "myscript.py", line 2103, in <module>
main()
File "myscript.py", line 2091, in main
twitter_stream.filter(follow=USERS_TO_FOLLOW_STRING_LIST, stall_warnings=True)
File "C:\Python27\lib\site-packages\tweepy\streaming.py", line 445, in filter
self._start(async)
File "C:\Python27\lib\site-packages\tweepy\streaming.py", line 361, in _start
self._run()
File "C:\Python27\lib\site-packages\tweepy\streaming.py", line 294, in _run
raise exception
requests.packages.urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(0 bytes read, 2000 more expected)', IncompleteRead(0 bytes read, 2000 more expected))
Obviously that is a thick firehose - empirically obviously, it's too thick to handle. Based on researching this error on stackoverflow as well as the empirical trend that 'the more accounts to follow I add, the faster this exception occurs', my hypothesis is that this is 'my fault'. My processing of each tweet takes too long and/or my firehose is too thick. I get that.
But notwithstanding that setup, I still have two questions that I can't seem to find solid answers for.
1. Is there a way to simply 'handle' this exception, accept that I will miss some tweets, but keep the script running? I figure maybe it misses a tweet (or many tweets', but if I can live without 100% of the tweets I want, then the script/stream can still go on, ready to catch the next tweet whenever it can.
I've tried this exception handling, which was recommended for that in a similar question on stackoverflow: from urllib3.exceptions import ProtocolError
while True:
try:
twitter_stream.filter(follow=USERS_TO_FOLLOW_STRING_LIST, stall_warnings=True)
except ProtocolError:
continue
But unfortunately for me, (perhaps I implemented it incorrectly, but I don't think I did), that did not work. I get the same exact error I was previously getting with or without that recommended exception handling code in place.
- I have never implemented queues and/or threading in my python code. Would this be a good time for me to try to implement that? I don't know everything about queues/threading, but I am imagining...
Could I have the tweets sort of written - in the raw - pre-processing - to memory, or a database, or something, on one thread? And then, have a second thread ready to do the processing of those tweets, as soon as it's ready? I figure that way, at least, it takes my post-processing of the tweet out of the equation as a limiting factor on the bandwidth of the firehose I am reading. Then if I still get the error I can cut back on who I am following, etc.
I have watched some threading tutorials but figured might be worth asking if that 'works' with ... this tweepy/twitter/etc/ complex. I am not confident in my understanding of the problem I have or how threading might help, so figured I could ask for advice as to if indeed that would help me here.
If this idea is valid, is there a sort of simple piece of example code someone could help me with to point me in the right direction?