I am trying to scrape some IMDB data looping through a list of URLs. Unfortunately my output isn't exactly what I hoped for, never mind storing it in a dataframe.
I get URLs with
library(rvest)
topmovies <- read_html("http://www.imdb.com/chart/top")
links <- top250 %>%
html_nodes(".titleColumn") %>%
html_nodes("a") %>%
html_attr("href")
links_full <- paste("http://imdb.com",links,sep="")
links_full_test <- links_full[1:10]
and then I could get content with
lapply(links_full_test, . %>% read_html() %>% html_nodes("h1") %>% html_text())
but it is a nested list and I don't know how to get it into a proper data.frame in R. Similarly, if I wanted to get another attribute, say
%>% read_html() %>% html_nodes("strong span") %>% html_text()
to retrieve the IMDB rating, I get the same nested-list output and most importantly I have to do read_html() twice ... which takes a lot of time. Is there a better way to do this? I guess for-loops, but I can't get it to work that way :(
Edit: now with rating as well
Here's one approach using purrr and rvest. The key idea is to save the parsed page, and then extract the bits you're interested in.
Another approach would be to use
sapply
as follows:Which results in: