I have a dataframe that looks like this:
country <- c("Canada", "US", "Japan", "China")
url <- c("http://en.wikipedia.org/wiki/United_States", "http://en.wikipedia.org/wiki/Canada",
"http://en.wikipedia.org/wiki/Japan", "http://en.wikipedia.org/wiki/China")
df <- data.frame(country, url)
country link
1 Canada http://en.wikipedia.org/wiki/United_States
2 US http://en.wikipedia.org/wiki/Canada
3 Japan http://en.wikipedia.org/wiki/Japan
4 China http://en.wikipedia.org/wiki/China
Using rvest
I'd like to scrape the table of contents for each url and bind them to one single output.
This code extracts the table of contents for one url:
library(rvest)
toc <- html(url) %>%
html_nodes(".toctext") %>%
html_text()
Desired Output:
country toc
US Etymology
History
Native American and European contact
Settlements
...
Canada Etymology
History
Aboriginal peoples
European colonization
...etc
This will scrape them into a full data frame (one row per TOC entry). Tedious-but-straightforward "print/output" code left to the OP: