Trying to scrape a web page title but running into a problem with a website called "tweg.com"
library(httr)
library(rvest)
page.url <- "tweg.com"
page.get <- GET(page.url) # from httr
pg <- read_html(page.get) # from rvest
page.title <- html_nodes(pg, "title") %>%
html_text() # from rvest
read_html stops with an error message: "Error: Failed to parse text". Looking into page.get$content, find that it is empty (raw(0)).
Certainly, can write a simple check to take this into account and avoid parsing using read_html. However, feel that a more elegant solution would be to get something back from read_html and then based on it return an empty page title (i.e., ""). Tried passing "options" to read_html, such as RECOVER, NOERROR and NOBLANKS, but no success. Any ideas how to get back "empty page" response from read_html?
You can use
tryCatch
to catch errors and return something in particular (justtry(read_html('http://tweg.com'), silent = TRUE)
will work if you just want to return the error and continue). You'll need to passtryCatch
a function for what to return when error is caught, which you can structure as you like.The
purrr
package also contains two functionspossibly
andsafely
that do the same thing, but accept more flexible function definitions. Note that they are adverbs, and thus return a function that still must be called, which is why the URL is in parentheses after the call.A typical usage would be to map the resulting function across a vector of URLs: