I created a loop function that extract tweets using the search api with a certain interval (lets say every 5 min.). This function does what it suppose to do: connect to twitter, extracts tweets that contain a certain keyword, and saves them in a csv file. However occasionally (2-3 times a day) the loop is stopped because of one of these two errors:
Error in htmlTreeParse(URL, useInternal = TRUE) : error in creating parser for http://search.twitter.com/search.atom?q= 6.95322e-310tst&rpp=100&page=10
Error in UseMethod("xmlNamespaceDefinitions") : no applicable method for 'xmlNamespaceDefinitions' applied to an object of class "NULL"
I hope you can help me deal with these errors, by answering some of my questions:
- What causes these errors to occur?
- How can I adjust my code to avoid these errors?
- How can I 'force' the loop to keep running if it experiences an error (e.g. by using the Try function)?
My function (based on several scripts found online) is as follows:
library(XML) # htmlTreeParse
twitter.search <- "Keyword"
QUERY <- URLencode(twitter.search)
# Set time loop (in seconds)
d_time = 300
number_of_times = 3000
for(i in 1:number_of_times){
tweets <- NULL
tweet.count <- 0
page <- 1
read.more <- TRUE
while (read.more)
{
# construct Twitter search URL
URL <- paste('http://search.twitter.com/search.atom?q=',QUERY,'&rpp=100&page=', page, sep='')
# fetch remote URL and parse
XML <- htmlTreeParse(URL, useInternal=TRUE, error = function(...){})
# Extract list of "entry" nodes
entry <- getNodeSet(XML, "//entry")
read.more <- (length(entry) > 0)
if (read.more)
{
for (i in 1:length(entry))
{
subdoc <- xmlDoc(entry[[i]]) # put entry in separate object to manipulate
published <- unlist(xpathApply(subdoc, "//published", xmlValue))
published <- gsub("Z"," ", gsub("T"," ",published) )
# Convert from GMT to central time
time.gmt <- as.POSIXct(published,"GMT")
local.time <- format(time.gmt, tz="Europe/Amsterdam")
title <- unlist(xpathApply(subdoc, "//title", xmlValue))
author <- unlist(xpathApply(subdoc, "//author/name", xmlValue))
tweet <- paste(local.time, " @", author, ": ", title, sep="")
entry.frame <- data.frame(tweet, author, local.time, stringsAsFactors=FALSE)
tweet.count <- tweet.count + 1
rownames(entry.frame) <- tweet.count
tweets <- rbind(tweets, entry.frame)
}
page <- page + 1
read.more <- (page <= 15) # Seems to be 15 page limit
}
}
names(tweets)
# top 15 tweeters
#sort(table(tweets$author),decreasing=TRUE)[1:15]
write.table(tweets, file=paste("Twitts - ", format(Sys.time(), "%a %b %d %H_%M_%S %Y"), ".csv"), sep = ";")
Sys.sleep(d_time)
} # end if
My guess is that your problems correspond to twitter (or your connection to the web) being down or slow or whatever, and so getting a bad result. Have you tried setting
Then the next time you get an error, a nice browser environment would be get up for you to have a poke around.
Here's my solution using
try
to a similar problem with the Twitter API.I was asking the Twitter API for the number of followers for each person in a long list of Twitter users. When a user has their account protected I would get an error and the loop would break before I put in the
try
function. Usingtry
allowed the loop to keep working by skipping onto the next person on the list.Here's the setup
And here's the loop with
try
to handle errors while populating users$followers with follower counts obtained from Twitter API