Improving a function to get stock news data from g

2019-03-09 14:50发布

问题:

I've written a function to grab and parse news data from Google for a given stock symbol, but I'm sure there are ways it could be improved. For starters, my function returns an object in the GMT timezone, rather than the user's current timezone, and it fails if passed a number greater than 299 (probably because google only returns 300 stories per stock). This is somewhat in response to my own question on stack overflow, and relies heavily on this blog post.

tl;dr: how can I improve this function?

 getNews <- function(symbol, number){

    # Warn about length
    if (number>300) {
        warning("May only get 300 stories from google")
    }

    # load libraries
    require(XML); require(plyr); require(stringr); require(lubridate);
    require(xts); require(RDSTK)

    # construct url to news feed rss and encode it correctly
    url.b1 = 'http://www.google.com/finance/company_news?q='
    url    = paste(url.b1, symbol, '&output=rss', "&start=", 1,
               "&num=", number, sep = '')
    url    = URLencode(url)

    # parse xml tree, get item nodes, extract data and return data frame
    doc   = xmlTreeParse(url, useInternalNodes = TRUE)
    nodes = getNodeSet(doc, "//item")
    mydf  = ldply(nodes, as.data.frame(xmlToList))

    # clean up names of data frame
    names(mydf) = str_replace_all(names(mydf), "value\\.", "")

    # convert pubDate to date-time object and convert time zone
    pubDate = strptime(mydf$pubDate, 
                     format = '%a, %d %b %Y %H:%M:%S', tz = 'GMT')
    pubDate = with_tz(pubDate, tz = 'America/New_york')
    mydf$pubDate = NULL

    #Parse the description field
    mydf$description <- as.character(mydf$description)
    parseDescription <- function(x) {
        out <- html2text(x)$text
        out <- strsplit(out,'\n|--')[[1]]

        #Find Lead
        TextLength <- sapply(out,nchar)
        Lead <- out[TextLength==max(TextLength)]

        #Find Site
        Site <- out[3]

        #Return cleaned fields
        out <- c(Site,Lead)
        names(out) <- c('Site','Lead')
        out
    }
    description <- lapply(mydf$description,parseDescription)
    description <- do.call(rbind,description)
    mydf <- cbind(mydf,description)

    #Format as XTS object
    mydf = xts(mydf,order.by=pubDate)

    # drop Extra attributes that we don't use yet
    mydf$guid.text = mydf$guid..attrs = mydf$description = mydf$link = NULL
    return(mydf) 

}

回答1:

Here is a shorter (and probably more efficient) version of your getNews function

  getNews2 <- function(symbol, number){

    # load libraries
    require(XML); require(plyr); require(stringr); require(lubridate);  

    # construct url to news feed rss and encode it correctly
    url.b1 = 'http://www.google.com/finance/company_news?q='
    url    = paste(url.b1, symbol, '&output=rss', "&start=", 1,
               "&num=", number, sep = '')
    url    = URLencode(url)

    # parse xml tree, get item nodes, extract data and return data frame
    doc   = xmlTreeParse(url, useInternalNodes = T);
    nodes = getNodeSet(doc, "//item");
    mydf  = ldply(nodes, as.data.frame(xmlToList))

    # clean up names of data frame
    names(mydf) = str_replace_all(names(mydf), "value\\.", "")

    # convert pubDate to date-time object and convert time zone
    mydf$pubDate = strptime(mydf$pubDate, 
                     format = '%a, %d %b %Y %H:%M:%S', tz = 'GMT')
    mydf$pubDate = with_tz(mydf$pubDate, tz = 'America/New_york')

    # drop guid.text and guid..attrs
    mydf$guid.text = mydf$guid..attrs = NULL

    return(mydf)    
}

Moreover, there might be a bug in your code, as I tried using it for symbol = 'WMT' and it returned an error. I think getNews2 works fine for WMT too. Check it out and let me know if it works for you.

PS. The description column still contains html code. But it should be easy to extract the text from it. I will post an update when I find time