I'm trying to find out the number of replies for all given tweets by a user. This is not something avaiable directly from Twitter's API. I've decided to only go after replies from a user's followers, to help distill down the data generated and as a good approximation (I believe msot of the replies to a tweet will come directly from that users followers.
I believe I've come a long way already, I jsut need help with the final section. I'm struggling to make the function I've created run over all the followers.
I'd rather this solution be in R over Python, although I know this exists and will be an option. I've also put in the twitter tag for Donald Trump; I'm not trying to do it for him and know that his huge following will make this a challenge. I want a generic version useable for whichever user is inputted.
library(rtweet)
library(plyr)
library(dplyr)
##set name of tweeter to look at (this can be changed)
targettwittername <- "realDonaldTrump"
##get this tweeter's timeline
tmls <- get_timeline(targettwittername, n=3200, retryonratelimit=TRUE)
##get their user id
targettwitteruserid <- as.numeric(select(lookup_users(targettwittername), user_id))
##get ids of their tweets
tweetids <- select(tmls, status_id)
tweetids <- transform(tweetids, status_id_num=as.numeric(status_id))
##get list of followers (who are most likely to reply)
targetfollowers <- data.frame(get_followers(targettwittername))
##clean up follower list to exclude those that have never tweeted and restricted access
user_lookup <- lookup_users(targetfollowers)
users_with_tweets_and_unprotected <- filter(user_lookup, statuses_count != 0)
users_with_tweets_and_unprotected <- select(filter(users_with_tweets_and_unprotected, protected != "TRUE"), user_id)
targetfollowers <- filter(targetfollowers, user_id %in% users_with_tweets_and_unprotected$user_id)
##custom function to search all followers timelines one by one
getfollowersreplies <- function(x){
follower <- as.numeric(x[1])
followertl <- data.frame(get_timeline(follower, n=3200, retryonratelimit=TRUE))
followertl <- filter(followertl, in_reply_to_status_user_id == targettwitteruserid)
followertl <- transform(followertl, reply_to_status_id_num=as.numeric(in_reply_to_status_status_id))
join <- inner_join(followertl, tweetids, by=c("reply_to_status_id_num"="status_id_num"))
replycounts <- data.frame(
join %>%
group_by(user_id, reply_to_status_id_num) %>%
summarise(n=n())
)
return(replycounts)
}
tweet_replies <- do.call("rbind", lapply(targetfollowers$user_id, getfollowersreplies))
The biggest obstacle would be the time it takes to collect up to 3,200 of the most recent tweets posted by more than 42 million followers of @realDonaldTrump.
Twitter limits the number of follower user IDs collected to 75,000 every 15 minutes.
Assuming you have a reliable internet connection and time, then you can use the following code to get all 42 million follower IDs.
Then you'd probably want to construct a for loop that uses
get_timeline()
and handles API rate limits. In the example code below, I've made the loop sleep until the rate limit reset after every 56 calls.As you can see, this would take a really long time. You'd be better off trying to collect all the replies in the past 6-9 days. The code below gets up to 5 million replies to Trump's tweets from the past 9 days. Warning: if there are actually that many replies (I honestly have no idea) available from the past 9 days, this search would take just under three days to finish.