Handling error response to empty webpage from read

2020-07-23 04:16发布

问题:

Trying to scrape a web page title but running into a problem with a website called "tweg.com"

library(httr)
library(rvest)
page.url <- "tweg.com"
page.get <- GET(page.url) # from httr
pg <- read_html(page.get) # from rvest
page.title <- html_nodes(pg, "title") %>% 
  html_text() # from rvest

read_html stops with an error message: "Error: Failed to parse text". Looking into page.get$content, find that it is empty (raw(0)).

Certainly, can write a simple check to take this into account and avoid parsing using read_html. However, feel that a more elegant solution would be to get something back from read_html and then based on it return an empty page title (i.e., ""). Tried passing "options" to read_html, such as RECOVER, NOERROR and NOBLANKS, but no success. Any ideas how to get back "empty page" response from read_html?

回答1:

You can use tryCatch to catch errors and return something in particular (just try(read_html('http://tweg.com'), silent = TRUE) will work if you just want to return the error and continue). You'll need to pass tryCatch a function for what to return when error is caught, which you can structure as you like.

library(rvest)


tryCatch(read_html('http://tweg.com'), 
         error = function(e){'empty page'})    # just return "empty page"
#> [1] "empty page"

tryCatch(read_html('http://tweg.com'), 
         error = function(e){list(result = 'empty page', 
                                  error = e)})    # return error too
#> $result
#> [1] "empty page"
#> 
#> $error
#> <Rcpp::exception in eval(substitute(expr), envir, enclos): Failed to parse text>

The purrr package also contains two functions possibly and safely that do the same thing, but accept more flexible function definitions. Note that they are adverbs, and thus return a function that still must be called, which is why the URL is in parentheses after the call.

library(purrr)

possibly(read_html, 'empty page')('http://tweg.com')
#> [1] "empty page"

safely(read_html, 'empty page')('http://tweg.com')
#> $result
#> [1] "empty page"
#> 
#> $error
#> <Rcpp::exception in eval(substitute(expr), envir, enclos): Failed to parse text>

A typical usage would be to map the resulting function across a vector of URLs:

c('http://tweg.com', 'http://wikipedia.org') %>% 
    map(safely(read_html, 'empty page'))
#> [[1]]
#> [[1]]$result
#> [1] "empty page"
#> 
#> [[1]]$error
#> <Rcpp::exception in eval(substitute(expr), envir, enclos): Failed to parse text>
#> 
#> 
#> [[2]]
#> [[2]]$result
#> {xml_document}
#> <html lang="mul" dir="ltr" class="no-js">
#> [1] <head>\n  <meta charset="utf-8"/>\n  <title>Wikipedia</title>\n  <me ...
#> [2] <body id="www-wikipedia-org">\n<h1 class="central-textlogo" style="f ...
#> 
#> [[2]]$error
#> NULL


标签: r rvest httr