How to submit login form in Rvest package w/o butt

2019-01-22 21:27发布

问题:

I am trying to scrape a web page that requires authentication using html_session() & html_form() from the rvest package. I found this e.g. provided by Hadley Wickham, but am not able to customize it to my case.

united <- html_session("http://www.united.com/")
account <- united %>% follow_link("Account")
login <- account %>%
         html_nodes("form") %>%
         extract2(1) %>%
         html_form() %>%
         set_values(
                `ctl00$ContentInfo$SignIn$onepass$txtField` = "GY797363",
                `ctl00$ContentInfo$SignIn$password$txtPassword` = password)
account <- account %>% 
submit_form(login, "ctl00$ContentInfo$SignInSecure")

In my case, I can't find the values to set in the form, hence I am trying to give the user and pass directly: set_values("email","password")

I also don't know how to refer to submit button, so I tried: submit_form(account,login)

The error I got for the submit_form function is: Error in names(submits)[[1]] : subscript out of bounds

Any idea on how to go about this is appreciated. Thank you

回答1:

Currently, this issue is the same as the open issue #159 in the rvest package, which causes issues where not all fields in a form have a type value. This buy may be fixed in a future release.

However, we can work around the issue by monkey patching the underlying function rvest:::submit_request.

The core problem is the helper function is_submit. Initially, it's defined like this:

is_submit <- function(x) tolower(x$type) %in% c("submit", 
        "image", "button")

As logical as this is, however, it fails in two scenarios:

  1. There is no type element.
  2. The type element is NULL.

Both of these happen to occur on the United login form. We can resolve this by adding two checks inside the function.

custom.submit_request <- function (form, submit = NULL) 
{
  is_submit <- function(x) {
    if (!exists("type", x) | is.null(x$type)){
      return(F);
    }
    tolower(x$type) %in% c("submit", "image", "button")
  } 
  submits <- Filter(is_submit, form$fields)
  if (length(submits) == 0) {
    stop("Could not find possible submission target.", call. = FALSE)
  }
  if (is.null(submit)) {
    submit <- names(submits)[[1]]
    message("Submitting with '", submit, "'")
  }
  if (!(submit %in% names(submits))) {
    stop("Unknown submission name '", submit, "'.\n", "Possible values: ", 
         paste0(names(submits), collapse = ", "), call. = FALSE)
  }
  other_submits <- setdiff(names(submits), submit)
  method <- form$method
  if (!(method %in% c("POST", "GET"))) {
    warning("Invalid method (", method, "), defaulting to GET", 
            call. = FALSE)
    method <- "GET"
  }
  url <- form$url
  fields <- form$fields
  fields <- Filter(function(x) length(x$value) > 0, fields)
  fields <- fields[setdiff(names(fields), other_submits)]
  values <- pluck(fields, "value")
  names(values) <- names(fields)
  list(method = method, encode = form$enctype, url = url, values = values)
}

To monkey patch, we need to use the R.utils package (install via install.packages("R.utils") if you don't have it).

library(R.utils)

reassignInPackage("submit_request", "rvest", custom.submit_request)

From there, we can issue our own request.

account <- account %>% 
     submit_form(login, "ctl00$ContentInfo$SignInSecure")

And that works!

(Well, "works" is a misnomer. Due to United employing more aggressive authentication requirements -- including known browsers -- this results in a 301 Unauthorized. However, it fixes the error).

A full reproducible example involved a couple of other minor code changes:

library(magrittr)
library(rvest)

url <- "https://www.united.com/web/en-US/apps/account/account.aspx"
account <- html_session(url)
login <- account %>%
  html_nodes("form") %>%
  extract2(1) %>%
  html_form() %>%
  set_values(
    `ctl00$ContentInfo$SignIn$onepass$txtField` = "USER",
    `ctl00$ContentInfo$SignIn$password$txtPassword` = "PASS")
account <- account %>% 
  submit_form(login, "ctl00$ContentInfo$SignInSecure")