Circumvent errors in loop function (used to extrac

2020-07-18 05:18发布

问题:

I created a loop function that extract tweets using the search api with a certain interval (lets say every 5 min.). This function does what it suppose to do: connect to twitter, extracts tweets that contain a certain keyword, and saves them in a csv file. However occasionally (2-3 times a day) the loop is stopped because of one of these two errors:

  • Error in htmlTreeParse(URL, useInternal = TRUE) : error in creating parser for http://search.twitter.com/search.atom?q= 6.95322e-310tst&rpp=100&page=10

  • Error in UseMethod("xmlNamespaceDefinitions") : no applicable method for 'xmlNamespaceDefinitions' applied to an object of class "NULL"

I hope you can help me deal with these errors, by answering some of my questions:

  • What causes these errors to occur?
  • How can I adjust my code to avoid these errors?
  • How can I 'force' the loop to keep running if it experiences an error (e.g. by using the Try function)?

My function (based on several scripts found online) is as follows:

    library(XML)   # htmlTreeParse

    twitter.search <- "Keyword"

    QUERY <- URLencode(twitter.search)

    # Set time loop (in seconds)
    d_time = 300
    number_of_times = 3000

    for(i in 1:number_of_times){

    tweets <- NULL
    tweet.count <- 0
    page <- 1
    read.more <- TRUE

    while (read.more)
    {
    # construct Twitter search URL
    URL <- paste('http://search.twitter.com/search.atom?q=',QUERY,'&rpp=100&page=', page, sep='')
    # fetch remote URL and parse
    XML <- htmlTreeParse(URL, useInternal=TRUE, error = function(...){})

    # Extract list of "entry" nodes
    entry     <- getNodeSet(XML, "//entry")

    read.more <- (length(entry) > 0)
    if (read.more)
    {
    for (i in 1:length(entry))
    {
    subdoc     <- xmlDoc(entry[[i]])   # put entry in separate object to manipulate

    published  <- unlist(xpathApply(subdoc, "//published", xmlValue))

    published  <- gsub("Z"," ", gsub("T"," ",published) )

    # Convert from GMT to central time
    time.gmt   <- as.POSIXct(published,"GMT")
    local.time <- format(time.gmt, tz="Europe/Amsterdam")

    title  <- unlist(xpathApply(subdoc, "//title", xmlValue))

    author <- unlist(xpathApply(subdoc, "//author/name",  xmlValue))

    tweet  <-  paste(local.time, " @", author, ":  ", title, sep="")

    entry.frame <- data.frame(tweet, author, local.time, stringsAsFactors=FALSE)
    tweet.count <- tweet.count + 1
    rownames(entry.frame) <- tweet.count
    tweets <- rbind(tweets, entry.frame)
    }
    page <- page + 1
    read.more <- (page <= 15)   # Seems to be 15 page limit
    }
    }

    names(tweets)

    # top 15 tweeters
    #sort(table(tweets$author),decreasing=TRUE)[1:15]

    write.table(tweets, file=paste("Twitts - ", format(Sys.time(), "%a %b %d %H_%M_%S %Y"), ".csv"), sep = ";")

    Sys.sleep(d_time)

    } # end if

回答1:

Here's my solution using try to a similar problem with the Twitter API.

I was asking the Twitter API for the number of followers for each person in a long list of Twitter users. When a user has their account protected I would get an error and the loop would break before I put in the try function. Using try allowed the loop to keep working by skipping onto the next person on the list.

Here's the setup

# load library
library(twitteR)
#
# Search Twitter for your term
s <- searchTwitter('#rstats', n=1500) 
# convert search results to a data frame
df <- do.call("rbind", lapply(s, as.data.frame)) 
# extract the usernames
users <- unique(df$screenName)
users <- sapply(users, as.character)
# make a data frame for the loop to work with 
users.df <- data.frame(users = users, 
                       followers = "", stringsAsFactors = FALSE)

And here's the loop with try to handle errors while populating users$followers with follower counts obtained from Twitter API

for (i in 1:nrow(users.df)) 
    {
    # tell the loop to skip a user if their account is protected 
    # or some other error occurs  
    result <- try(getUser(users.df$users[i])$followersCount, silent = TRUE);
    if(class(result) == "try-error") next;
    # get the number of followers for each user
    users.df$followers[i] <- getUser(users.df$users[i])$followersCount
    # tell the loop to pause for 60 s between iterations to 
    # avoid exceeding the Twitter API request limit
    print('Sleeping for 60 seconds...')
    Sys.sleep(60); 
    }
#
# Now inspect users.df to see the follower data


回答2:

My guess is that your problems correspond to twitter (or your connection to the web) being down or slow or whatever, and so getting a bad result. Have you tried setting

options(error = recover)

Then the next time you get an error, a nice browser environment would be get up for you to have a poke around.