I'm new to web scraping, and I'm currently trying to download subtitle files for over 100,000 films for a research project. Each film has a unique IMDb ID (i.e., the ID for Inception is 1375666
). I have a list in R containing the 102524 IDs, and I want to download the corresponding subtitles from opensubtitles.org.
Each film has its own page on the site, for example, Inception has:
https://www.opensubtitles.org/en/search/sublanguageid-eng/imdbid-1375666
The link to download the subtitles is obtained by clicking on the first link in the table called "Movie name", which takes you to a new page, then clicking the "Download button" on that page.
I'm using rvest
to scrape the pages, and I've written this code:
for(i in 1:102524) {
subtitle.url = paste0("https://www.opensubtitles.org/en/search/sublanguageid-eng/imdbid-", movie.ids[i])
read_html(subtitle.url) %>%
html_nodes(".head+ .expandable .bnone")
# Not sure where to go from here
}
Any help on how to do this will be greatly appreciated.
EDIT: I know I'm asking something pretty complicated, but any pointers on where to start would be great.
Following the link and the download button, we can see that the actual subtitle file is downloaded from https://www.opensubtitles.org/en/download/vrf-108d030f/sub/6961922
(for you example). I found out this inspecting the Network
tab in Mozilla's Developer Tools
while doing a download.
We can download directly from that address using:
download.file('https://www.opensubtitles.org/en/download/vrf-108d030f/sub/6961922',
destfile = 'subtitle-6961922.zip')
The base url (https://www.opensubtitles.org/en/download/vrf-108d030f/sub/
) is fixed for all the downloads as far as I can see, so we only need the site's id.
The id is found within the search page doing:
id <- read_html(subtitle.url) %>%
html_node('.bnone') %>%
html_attr('href') %>%
stringr::str_extract('\\d+')
So, putting it all together:
search_url <- 'https://www.opensubtitles.org/en/search/sublanguageid-eng/imdbid-'
download_url <- 'https://www.opensubtitles.org/en/download/vrf-108d030f/sub/'
for(i in 1:102524) {
subtitle.url = paste0(search_url, movie.ids[i])
id <- read_html(subtitle.url) %>%
html_node('.bnone') %>%
html_attr('href') %>%
stringr::str_extract('\\d+')
download.file(paste0(download_url, id),
destfile = paste0('subtitle-', movie.ids[i], '.zip'))
# Wait somwhere between 1 and 4 second before next download
# as courtesy to the site
Sys.sleep(runif(1, 1, 4))
}
Keep in mind this will take a long time!