I have a dataframe that looks like this:
country <- c("Canada", "US", "Japan", "China")
url <- c("http://en.wikipedia.org/wiki/United_States", "http://en.wikipedia.org/wiki/Canada",
"http://en.wikipedia.org/wiki/Japan", "http://en.wikipedia.org/wiki/China")
df <- data.frame(country, url)
country link
1 Canada http://en.wikipedia.org/wiki/United_States
2 US http://en.wikipedia.org/wiki/Canada
3 Japan http://en.wikipedia.org/wiki/Japan
4 China http://en.wikipedia.org/wiki/China
Using rvest
I'd like to scrape the table of contents for each url and bind them to one single output.
This code extracts the table of contents for one url:
library(rvest)
toc <- html(url) %>%
html_nodes(".toctext") %>%
html_text()
Desired Output:
country toc
US Etymology
History
Native American and European contact
Settlements
...
Canada Etymology
History
Aboriginal peoples
European colonization
...etc
This will scrape them into a full data frame (one row per TOC entry). Tedious-but-straightforward "print/output" code left to the OP:
library(rvest)
library(dplyr)
country <- c("Canada", "US", "Japan", "China")
url <- c("http://en.wikipedia.org/wiki/United_States",
"http://en.wikipedia.org/wiki/Canada",
"http://en.wikipedia.org/wiki/Japan",
"http://en.wikipedia.org/wiki/China")
df <- data.frame(country, url)
bind_rows(lapply(url, function(x) {
data.frame(url=x, toc_entry=toc <- html(url[1]) %>%
html_nodes(".toctext") %>%
html_text())
})) -> toc_entries
df <- toc_entries %>% left_join(df)
df[sample(nrow(df), 10),]
## Source: local data frame [10 x 3]
##
## url toc_entry country
## 1 http://en.wikipedia.org/wiki/Japan Government finance Japan
## 2 http://en.wikipedia.org/wiki/Canada Cold War and civil rights era US
## 3 http://en.wikipedia.org/wiki/United_States Food Canada
## 4 http://en.wikipedia.org/wiki/Japan Sports Japan
## 5 http://en.wikipedia.org/wiki/Canada Religion US
## 6 http://en.wikipedia.org/wiki/China Cold War and civil rights era China
## 7 http://en.wikipedia.org/wiki/Japan Literature, philosophy, and the arts Japan
## 8 http://en.wikipedia.org/wiki/United_States Population Canada
## 9 http://en.wikipedia.org/wiki/Japan Settlements Japan
## 10 http://en.wikipedia.org/wiki/Canada Military US