Rvest scraping errors

2019-04-16 06:39发布

问题:

Here's the code I'm running

library(rvest)

rootUri <- "https://github.com/rails/rails/pull/"
PR <- as.list(c(100, 200, 300))
list <- paste0(rootUri, PR)
messages <- lapply(list, function(l) {
  html(l)
})

Up until this point it seems to work fine, but when I try to extract the text:

html_text(messages)

I get:

Error in xml_apply(x, XML::xmlValue, ..., .type = character(1)) : 
  Unknown input of class: list

Trying to extract a specific element:

html_text(messages[1])

Can't do that either...

Error in xml_apply(x, XML::xmlValue, ..., .type = character(1)) : 
  Unknown input of class: list

So I try a different way:

html_text(messages[[1]])

This seems to at least get at the data, but is still not succesful:

Error in UseMethod("xmlValue") : 
  no applicable method for 'xmlValue' applied to an object of class "c('HTMLInternalDocument',     'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument')"

How can I extract the text material from each of the elements of my list?

回答1:

There are two problems with your code. Look here for examples on how to use the package.

1. You cannot just use every function with everything.

  • html() is for download of content
  • html_node() is for selecting node(s) from the downloaded content of a page
  • html_text() is for extracting text from a previously selected node

Therefore, to download one of your pages and extract the text of the html-node, use this:

library(rvest)

old-school style:

url          <- "https://github.com/rails/rails/pull/100"
url_content  <- html(url)
url_mainnode <- html_node(url_content, "*")
url_mainnode_text <- html_text(url_mainnode)
url_mainnode_text

... or this ...

hard to read old-school style:

url_mainnode_text  <- html_text(html_node(html("https://github.com/rails/rails/pull/100"), "*"))
url_mainnode_text

... or this ...

magritr-piping style

url_mainnode_text  <- 
  html("https://github.com/rails/rails/pull/100") %>%
  html_node("*") %>%
  html_text()
url_mainnode_text

2. When using lists you have to apply functions to the list with e.g. lapply()

If you want to kind of batch-process several URLs you can try something like this:

  url_list    <- c("https://github.com/rails/rails/pull/100", 
                   "https://github.com/rails/rails/pull/200", 
                   "https://github.com/rails/rails/pull/300")

  get_html_text <- function(url, css_or_xpath="*"){
      html_text(
        html_node(
          html("https://github.com/rails/rails/pull/100"), css_or_xpath
        )
      )
   }

lapply(url_list, get_html_text, css_or_xpath="a[class=message]")


回答2:

You need to use html_nodes() and identify which CSS selectors relate to the data you're interested in. For example, if we want to extract the usernames of the people discussing pull 200

rootUri <- "https://github.com/rails/rails/pull/200"
page<-html(rootUri)
page %>% html_nodes('#discussion_bucket strong a') %>% html_text()

[1] "jaw6"      "jaw6"      "josevalim"