Trying to scrape a web page title but running into a problem with a website called "tweg.com"
library(httr)
library(rvest)
page.url <- "tweg.com"
page.get <- GET(page.url) # from httr
pg <- read_html(page.get) # from rvest
page.title <- html_nodes(pg, "title") %>%
html_text() # from rvest
read_html stops with an error message: "Error: Failed to parse text".
Looking into page.get$content, find that it is empty (raw(0)).
Certainly, can write a simple check to take this into account and avoid parsing using read_html. However, feel that a more elegant solution would be to get something back from read_html and then based on it return an empty page title (i.e., ""). Tried passing "options" to read_html, such as RECOVER, NOERROR and NOBLANKS, but no success. Any ideas how to get back "empty page" response from read_html?
You can use tryCatch
to catch errors and return something in particular (just try(read_html('http://tweg.com'), silent = TRUE)
will work if you just want to return the error and continue). You'll need to pass tryCatch
a function for what to return when error is caught, which you can structure as you like.
library(rvest)
tryCatch(read_html('http://tweg.com'),
error = function(e){'empty page'}) # just return "empty page"
#> [1] "empty page"
tryCatch(read_html('http://tweg.com'),
error = function(e){list(result = 'empty page',
error = e)}) # return error too
#> $result
#> [1] "empty page"
#>
#> $error
#> <Rcpp::exception in eval(substitute(expr), envir, enclos): Failed to parse text>
The purrr
package also contains two functions possibly
and safely
that do the same thing, but accept more flexible function definitions. Note that they are adverbs, and thus return a function that still must be called, which is why the URL is in parentheses after the call.
library(purrr)
possibly(read_html, 'empty page')('http://tweg.com')
#> [1] "empty page"
safely(read_html, 'empty page')('http://tweg.com')
#> $result
#> [1] "empty page"
#>
#> $error
#> <Rcpp::exception in eval(substitute(expr), envir, enclos): Failed to parse text>
A typical usage would be to map the resulting function across a vector of URLs:
c('http://tweg.com', 'http://wikipedia.org') %>%
map(safely(read_html, 'empty page'))
#> [[1]]
#> [[1]]$result
#> [1] "empty page"
#>
#> [[1]]$error
#> <Rcpp::exception in eval(substitute(expr), envir, enclos): Failed to parse text>
#>
#>
#> [[2]]
#> [[2]]$result
#> {xml_document}
#> <html lang="mul" dir="ltr" class="no-js">
#> [1] <head>\n <meta charset="utf-8"/>\n <title>Wikipedia</title>\n <me ...
#> [2] <body id="www-wikipedia-org">\n<h1 class="central-textlogo" style="f ...
#>
#> [[2]]$error
#> NULL