可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Here's the code I'm running

library(rvest)

rootUri <- "https://github.com/rails/rails/pull/"
PR <- as.list(c(100, 200, 300))
list <- paste0(rootUri, PR)
messages <- lapply(list, function(l) {
  html(l)
})

Up until this point it seems to work fine, but when I try to extract the text:

html_text(messages)

I get:

Error in xml_apply(x, XML::xmlValue, ..., .type = character(1)) : 
  Unknown input of class: list

Trying to extract a specific element:

html_text(messages[1])

Can't do that either...

Error in xml_apply(x, XML::xmlValue, ..., .type = character(1)) : 
  Unknown input of class: list

So I try a different way:

html_text(messages[[1]])

This seems to at least get at the data, but is still not succesful:

Error in UseMethod("xmlValue") : 
  no applicable method for 'xmlValue' applied to an object of class "c('HTMLInternalDocument',     'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument')"

How can I extract the text material from each of the elements of my list?

回答1:

There are two problems with your code. Look here for examples on how to use the package.

1. You cannot just use every function with everything.

html() is for download of content
html_node() is for selecting node(s) from the downloaded content of a page
html_text() is for extracting text from a previously selected node

Therefore, to download one of your pages and extract the text of the html-node, use this:

library(rvest)

old-school style:

url          <- "https://github.com/rails/rails/pull/100"
url_content  <- html(url)
url_mainnode <- html_node(url_content, "*")
url_mainnode_text <- html_text(url_mainnode)
url_mainnode_text

... or this ...

hard to read old-school style:

url_mainnode_text  <- html_text(html_node(html("https://github.com/rails/rails/pull/100"), "*"))
url_mainnode_text

... or this ...

magritr-piping style

url_mainnode_text  <- 
  html("https://github.com/rails/rails/pull/100") %>%
  html_node("*") %>%
  html_text()
url_mainnode_text

2. When using lists you have to apply functions to the list with e.g. lapply()

If you want to kind of batch-process several URLs you can try something like this:

  url_list    <- c("https://github.com/rails/rails/pull/100", 
                   "https://github.com/rails/rails/pull/200", 
                   "https://github.com/rails/rails/pull/300")

  get_html_text <- function(url, css_or_xpath="*"){
      html_text(
        html_node(
          html("https://github.com/rails/rails/pull/100"), css_or_xpath
        )
      )
   }

lapply(url_list, get_html_text, css_or_xpath="a[class=message]")

回答2:

You need to use html_nodes() and identify which CSS selectors relate to the data you're interested in. For example, if we want to extract the usernames of the people discussing pull 200

rootUri <- "https://github.com/rails/rails/pull/200"
page<-html(rootUri)
page %>% html_nodes('#discussion_bucket strong a') %>% html_text()

[1] "jaw6"      "jaw6"      "josevalim"