Here's the code I'm running
library(rvest)
rootUri <- "https://github.com/rails/rails/pull/"
PR <- as.list(c(100, 200, 300))
list <- paste0(rootUri, PR)
messages <- lapply(list, function(l) {
html(l)
})
Up until this point it seems to work fine, but when I try to extract the text:
html_text(messages)
I get:
Error in xml_apply(x, XML::xmlValue, ..., .type = character(1)) :
Unknown input of class: list
Trying to extract a specific element:
html_text(messages[1])
Can't do that either...
Error in xml_apply(x, XML::xmlValue, ..., .type = character(1)) :
Unknown input of class: list
So I try a different way:
html_text(messages[[1]])
This seems to at least get at the data, but is still not succesful:
Error in UseMethod("xmlValue") :
no applicable method for 'xmlValue' applied to an object of class "c('HTMLInternalDocument', 'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument')"
How can I extract the text material from each of the elements of my list?
There are two problems with your code. Look here for examples on how to use the package.
1. You cannot just use every function with everything.
html()
is for download of content
html_node()
is for selecting node(s) from the downloaded content of a page
html_text()
is for extracting text from a previously selected node
Therefore, to download one of your pages and extract the text of the html-node, use this:
library(rvest)
old-school style:
url <- "https://github.com/rails/rails/pull/100"
url_content <- html(url)
url_mainnode <- html_node(url_content, "*")
url_mainnode_text <- html_text(url_mainnode)
url_mainnode_text
... or this ...
hard to read old-school style:
url_mainnode_text <- html_text(html_node(html("https://github.com/rails/rails/pull/100"), "*"))
url_mainnode_text
... or this ...
magritr-piping style
url_mainnode_text <-
html("https://github.com/rails/rails/pull/100") %>%
html_node("*") %>%
html_text()
url_mainnode_text
2. When using lists you have to apply functions to the list with e.g. lapply()
If you want to kind of batch-process several URLs you can try something like this:
url_list <- c("https://github.com/rails/rails/pull/100",
"https://github.com/rails/rails/pull/200",
"https://github.com/rails/rails/pull/300")
get_html_text <- function(url, css_or_xpath="*"){
html_text(
html_node(
html("https://github.com/rails/rails/pull/100"), css_or_xpath
)
)
}
lapply(url_list, get_html_text, css_or_xpath="a[class=message]")
You need to use html_nodes()
and identify which CSS selectors relate to the data you're interested in. For example, if we want to extract the usernames of the people discussing pull 200
rootUri <- "https://github.com/rails/rails/pull/200"
page<-html(rootUri)
page %>% html_nodes('#discussion_bucket strong a') %>% html_text()
[1] "jaw6" "jaw6" "josevalim"