Web scrape subtitles from opensubtitles.org in R

2020-07-25 01:43发布

问题:

I'm new to web scraping, and I'm currently trying to download subtitle files for over 100,000 films for a research project. Each film has a unique IMDb ID (i.e., the ID for Inception is 1375666). I have a list in R containing the 102524 IDs, and I want to download the corresponding subtitles from opensubtitles.org.

Each film has its own page on the site, for example, Inception has:

https://www.opensubtitles.org/en/search/sublanguageid-eng/imdbid-1375666

The link to download the subtitles is obtained by clicking on the first link in the table called "Movie name", which takes you to a new page, then clicking the "Download button" on that page.

I'm using rvest to scrape the pages, and I've written this code:

for(i in 1:102524) {
  subtitle.url = paste0("https://www.opensubtitles.org/en/search/sublanguageid-eng/imdbid-", movie.ids[i])

  read_html(subtitle.url) %>%
    html_nodes(".head+ .expandable .bnone")
  # Not sure where to go from here
}

Any help on how to do this will be greatly appreciated.

EDIT: I know I'm asking something pretty complicated, but any pointers on where to start would be great.

回答1:

Following the link and the download button, we can see that the actual subtitle file is downloaded from https://www.opensubtitles.org/en/download/vrf-108d030f/sub/6961922 (for you example). I found out this inspecting the Network tab in Mozilla's Developer Tools while doing a download.

We can download directly from that address using:

    download.file('https://www.opensubtitles.org/en/download/vrf-108d030f/sub/6961922', 
              destfile = 'subtitle-6961922.zip')

The base url (https://www.opensubtitles.org/en/download/vrf-108d030f/sub/) is fixed for all the downloads as far as I can see, so we only need the site's id.

The id is found within the search page doing:

id <- read_html(subtitle.url) %>%
    html_node('.bnone') %>% 
    html_attr('href') %>% 
    stringr::str_extract('\\d+')

So, putting it all together:

search_url <- 'https://www.opensubtitles.org/en/search/sublanguageid-eng/imdbid-'
download_url <- 'https://www.opensubtitles.org/en/download/vrf-108d030f/sub/'

for(i in 1:102524) {
    subtitle.url = paste0(search_url, movie.ids[i])

    id <- read_html(subtitle.url) %>%
        html_node('.bnone') %>% 
        html_attr('href') %>% 
        stringr::str_extract('\\d+')

    download.file(paste0(download_url, id), 
                  destfile = paste0('subtitle-', movie.ids[i], '.zip'))

    # Wait somwhere between 1 and 4 second before next download
    # as courtesy to the site
    Sys.sleep(runif(1, 1, 4))
}

Keep in mind this will take a long time!