R loop with html_nodes ( rvest ) isn´t catching al

2019-06-05 20:28发布

问题:

I would like to make a loop with html_node to catch some the value of nodes (nodes no text), that is, I have some values

library(rvest)
country <- c("Canada", "US", "Japan", "China")

With those values ("Canada","us", ...), I´ve done a loop which creates a URL by pasting each value with "https://en.wikipedia.org/wiki/", after that, with each new html apply read_html(i) and a sequences of codes to catch finally a node with html_nodes ('a.page-link') -yes! a node, not a text- and save that html_nodes (...) as.character in a data.frame (or could be a list).

dff<- NULL
for ( i in country ) {
url<-paste0("https://en.wikipedia.org/wiki/",i)
page<- read_html(url) 
b <- page%>%
html_nodes ('h2.flow-title') %>%
html_nodes ('a.page-link') %>%
as.character()
dff<- data.frame(b)
}

The problem is this code only save the data from the last country, that is, run the first country and obtain the html_nodes(saving it), but when run the next country the first data is erased and replace by this new , and so on, obtaining as final result just the dat from the last country. I would be grateful with your help!

回答1:

As the comment mentioned this line: dff<- data.frame(b) is over writing dff on each loop iteration. On solution is to create an empty list and append the data to the list.
In this example the list items are named for the country queried.

library(rvest)
country <- c("Canada", "US", "Japan", "China")

#initialize the empty list
dff<- list()

for ( i in country ) {
  url<-paste0("https://en.wikipedia.org/wiki/",i)
  page<- read_html(url) 
  b <- page%>%
    html_nodes ('h2.flow-title') %>%
    html_nodes ('a.page-link') %>%
    as.character()
#append new data onto the list
  dff[[i]]<- data.frame(b)
}

To access the data, one can use dff$Canada, or lapply to process the entire list.

Note: I ran your example which returned no results, better double check the node ids.