Harvest (rvest) multiple HTML pages from a list of

2020-01-31 12:22发布

I have a dataframe that looks like this:

country <- c("Canada", "US", "Japan", "China")
url <- c("http://en.wikipedia.org/wiki/United_States", "http://en.wikipedia.org/wiki/Canada",
          "http://en.wikipedia.org/wiki/Japan", "http://en.wikipedia.org/wiki/China")
df <- data.frame(country, url)

    country link
1   Canada  http://en.wikipedia.org/wiki/United_States
2   US      http://en.wikipedia.org/wiki/Canada
3   Japan   http://en.wikipedia.org/wiki/Japan
4   China   http://en.wikipedia.org/wiki/China

Using rvest I'd like to scrape the table of contents for each url and bind them to one single output.

This code extracts the table of contents for one url:

library(rvest)
toc <- html(url) %>%
  html_nodes(".toctext") %>%
  html_text()

Desired Output:

country toc
US      Etymology
        History
        Native American and European contact
        Settlements
        ...  
Canada  Etymology
        History
        Aboriginal peoples
        European colonization
        ...etc

标签: r rvest
1条回答
Root(大扎)
2楼-- · 2020-01-31 12:37

This will scrape them into a full data frame (one row per TOC entry). Tedious-but-straightforward "print/output" code left to the OP:

library(rvest)
library(dplyr)

country <- c("Canada", "US", "Japan", "China")
url <- c("http://en.wikipedia.org/wiki/United_States", 
         "http://en.wikipedia.org/wiki/Canada",
         "http://en.wikipedia.org/wiki/Japan", 
         "http://en.wikipedia.org/wiki/China")
df <- data.frame(country, url)

bind_rows(lapply(url, function(x) {

  data.frame(url=x, toc_entry=toc <- html(url[1]) %>%
    html_nodes(".toctext") %>%
    html_text())

})) -> toc_entries

df <- toc_entries %>% left_join(df)

df[sample(nrow(df), 10),]

## Source: local data frame [10 x 3]
## 
##                                           url                            toc_entry country
## 1          http://en.wikipedia.org/wiki/Japan                   Government finance   Japan
## 2         http://en.wikipedia.org/wiki/Canada        Cold War and civil rights era      US
## 3  http://en.wikipedia.org/wiki/United_States                                 Food  Canada
## 4          http://en.wikipedia.org/wiki/Japan                               Sports   Japan
## 5         http://en.wikipedia.org/wiki/Canada                             Religion      US
## 6          http://en.wikipedia.org/wiki/China        Cold War and civil rights era   China
## 7          http://en.wikipedia.org/wiki/Japan Literature, philosophy, and the arts   Japan
## 8  http://en.wikipedia.org/wiki/United_States                           Population  Canada
## 9          http://en.wikipedia.org/wiki/Japan                          Settlements   Japan
## 10        http://en.wikipedia.org/wiki/Canada                             Military      US
查看更多
登录 后发表回答