Circumvent errors in loop function (used to extrac

2020-07-18 05:18发布


I created a loop function that extract tweets using the search api with a certain interval (lets say every 5 min.). This function does what it suppose to do: connect to twitter, extracts tweets that contain a certain keyword, and saves them in a csv file. However occasionally (2-3 times a day) the loop is stopped because of one of these two errors:

  • Error in htmlTreeParse(URL, useInternal = TRUE) : error in creating parser for 6.95322e-310tst&rpp=100&page=10

  • Error in UseMethod("xmlNamespaceDefinitions") : no applicable method for 'xmlNamespaceDefinitions' applied to an object of class "NULL"

I hope you can help me deal with these errors, by answering some of my questions:

  • What causes these errors to occur?
  • How can I adjust my code to avoid these errors?
  • How can I 'force' the loop to keep running if it experiences an error (e.g. by using the Try function)?

My function (based on several scripts found online) is as follows:

    library(XML)   # htmlTreeParse <- "Keyword"

    QUERY <- URLencode(

    # Set time loop (in seconds)
    d_time = 300
    number_of_times = 3000

    for(i in 1:number_of_times){

    tweets <- NULL
    tweet.count <- 0
    page <- 1
    read.more <- TRUE

    while (read.more)
    # construct Twitter search URL
    URL <- paste('',QUERY,'&rpp=100&page=', page, sep='')
    # fetch remote URL and parse
    XML <- htmlTreeParse(URL, useInternal=TRUE, error = function(...){})

    # Extract list of "entry" nodes
    entry     <- getNodeSet(XML, "//entry")

    read.more <- (length(entry) > 0)
    if (read.more)
    for (i in 1:length(entry))
    subdoc     <- xmlDoc(entry[[i]])   # put entry in separate object to manipulate

    published  <- unlist(xpathApply(subdoc, "//published", xmlValue))

    published  <- gsub("Z"," ", gsub("T"," ",published) )

    # Convert from GMT to central time
    time.gmt   <- as.POSIXct(published,"GMT")
    local.time <- format(time.gmt, tz="Europe/Amsterdam")

    title  <- unlist(xpathApply(subdoc, "//title", xmlValue))

    author <- unlist(xpathApply(subdoc, "//author/name",  xmlValue))

    tweet  <-  paste(local.time, " @", author, ":  ", title, sep="")

    entry.frame <- data.frame(tweet, author, local.time, stringsAsFactors=FALSE)
    tweet.count <- tweet.count + 1
    rownames(entry.frame) <- tweet.count
    tweets <- rbind(tweets, entry.frame)
    page <- page + 1
    read.more <- (page <= 15)   # Seems to be 15 page limit


    # top 15 tweeters

    write.table(tweets, file=paste("Twitts - ", format(Sys.time(), "%a %b %d %H_%M_%S %Y"), ".csv"), sep = ";")


    } # end if


Here's my solution using try to a similar problem with the Twitter API.

I was asking the Twitter API for the number of followers for each person in a long list of Twitter users. When a user has their account protected I would get an error and the loop would break before I put in the try function. Using try allowed the loop to keep working by skipping onto the next person on the list.

Here's the setup

# load library
# Search Twitter for your term
s <- searchTwitter('#rstats', n=1500) 
# convert search results to a data frame
df <-"rbind", lapply(s, 
# extract the usernames
users <- unique(df$screenName)
users <- sapply(users, as.character)
# make a data frame for the loop to work with 
users.df <- data.frame(users = users, 
                       followers = "", stringsAsFactors = FALSE)

And here's the loop with try to handle errors while populating users$followers with follower counts obtained from Twitter API

for (i in 1:nrow(users.df)) 
    # tell the loop to skip a user if their account is protected 
    # or some other error occurs  
    result <- try(getUser(users.df$users[i])$followersCount, silent = TRUE);
    if(class(result) == "try-error") next;
    # get the number of followers for each user
    users.df$followers[i] <- getUser(users.df$users[i])$followersCount
    # tell the loop to pause for 60 s between iterations to 
    # avoid exceeding the Twitter API request limit
    print('Sleeping for 60 seconds...')
# Now inspect users.df to see the follower data


My guess is that your problems correspond to twitter (or your connection to the web) being down or slow or whatever, and so getting a bad result. Have you tried setting

options(error = recover)

Then the next time you get an error, a nice browser environment would be get up for you to have a poke around.