I installed and tried out tweepy, I am using the following function right now:
from API Reference
API.public_timeline()
Returns the 20 most recent statuses from non-protected users who have set a custom user icon. The public timeline is cached for 60 seconds so requesting it more often than that is a waste of resources.
However, I want to do extract all tweets that match a certain regular expression from the complete live stream. I could put public_timeline()
inside a while True
loop but that would probably run into problems with rate limiting. Either way, I don't really think it can cover all current tweets.
How could that be done? If not all tweets, then I want to extract as many tweets that match a certain keyword.
The streaming API is what you want. I use a library called tweetstream. Here's my basic listening function:
I haven't looked in a while, but I'm pretty sure that this library is just accessing the sample stream (as opposed to the firehose). HTH.
Edit to add: you say you want the "complete live stream", aka the firehose. That's fiscally and technically expensive and only very large companies are allowed to have it. Look at the docs and you'll see that the sample is basically representative.
Take a look at the streaming API. You can even subscribe to a list of words that you define, and only tweets that match those words are returned.
The streaming API rate limiting works differently: you get 1 connection per IP, and a maximum number of events per second. If more events occur than that, then you only get the maximum anyways, with a notification regarding how many events you missed because of rate limiting.
My understanding is that the streaming API is most suitable for servers that will redistribute the content to your users as needed, instead of being accessed directly by your users - the standing connections are expensive and Twitter starts blacklisting IPs after too many failed connections and re-connections, and possibly your API key afterwards.