I am trying to scrape a web page that requires authentication using html_session() & html_form() from the rvest package. I found this e.g. provided by Hadley Wickham, but am not able to customize it to my case.
united <- html_session("http://www.united.com/")
account <- united %>% follow_link("Account")
login <- account %>%
html_nodes("form") %>%
extract2(1) %>%
html_form() %>%
set_values(
`ctl00$ContentInfo$SignIn$onepass$txtField` = "GY797363",
`ctl00$ContentInfo$SignIn$password$txtPassword` = password)
account <- account %>%
submit_form(login, "ctl00$ContentInfo$SignInSecure")
In my case, I can't find the values to set in the form, hence I am trying to give the user and pass directly: set_values("email","password")
I also don't know how to refer to submit button, so I tried: submit_form(account,login)
The error I got for the submit_form function is: Error in names(submits)[[1]] : subscript out of bounds
Any idea on how to go about this is appreciated. Thank you
Currently, this issue is the same as the open issue #159 in the
rvest
package, which causes issues where not all fields in a form have atype
value. This buy may be fixed in a future release.However, we can work around the issue by monkey patching the underlying function
rvest:::submit_request
.The core problem is the helper function
is_submit
. Initially, it's defined like this:As logical as this is, however, it fails in two scenarios:
type
element.type
element isNULL
.Both of these happen to occur on the United login form. We can resolve this by adding two checks inside the function.
To monkey patch, we need to use the
R.utils
package (install viainstall.packages("R.utils")
if you don't have it).From there, we can issue our own request.
And that works!
(Well, "works" is a misnomer. Due to United employing more aggressive authentication requirements -- including known browsers -- this results in a
301 Unauthorized
. However, it fixes the error).A full reproducible example involved a couple of other minor code changes: