Web scraping password protected website using R

2019-06-10 10:27发布

问题:

i would like to web scrap yammer data using R,but in order to do so first il have to login to this page,(which is authentication for an app that i created).

https://www.yammer.com/dialog/authenticate?client_id=iVGCK1tOhbZGS7zC8dPjg

I am able to get the yammer data once i login to this page but all this is in browser by standard yammer urls (https://www.yammer.com/api/v1/messages/received.json)

I have read through similar questions and tried the suggestions but still cant get through this issue.

I have tried using httr,RSelenium,rvest+Selector gadget.

End goal here is to do everything in R (getting data,cleaning,sentiment analysis...the cleaning and sentiment analysis part is done but as of now the getting data part is manual and i would like to automate that by handling it from R)

1.Trial using httr:

usinghttr<- GET("https://www.yammer.com/dialog/authenticate?client_id=iVGCK1tOhbZGS7zC8dPjg",
     authenticate("Username", "Password"))

corresponding Result : Response [https://www.yammer.com/dialog/authenticate?client_id=iVGCK1tOhbZGS7zC8dPjg] Date: 2015-04-27 12:25 Status: 200 Content-Type: text/html; charset=utf-8 Size: 15.7 kB content of this page showed that it has opened the login page but didnt authenticate.

2.Trial using selector gadget + rvest

i tried scraping wikipedia using this method but couldnt apply it to yammer as authentication would be required prior to calling the html tag that selctor gadget gives.

3.Trial using RSelenium

tried this using the standard browsers and phantomjs but got some errors

> startServer()

remDr <- remoteDriver$new()

remDr$open() [1] "Connecting to remote server" Undefined error in RCurl call. Error in queryRD(paste0(serverURL, "/session"), "POST", qdata = toJSON(serverOpts)) :

> pJS <- phantom()

Error in phantom() : PhantomJS binary not located.

回答1:

I also spent very long time to manage to access password-protected sites from inside R. Finally I managed to do so by submitting the credentials as an html form. I had a quick look to the login page on Yammer and it seems similar to the case where I managed to have access.

Here is the code that I used. You need to adapt it to your context: You first start a session on the login page, you reach to the form that collects the Id and the password and finally you submit the form. I think in your case, the code below would work:

session <- html_session("https://www.yammer.com/dialog/authenticate?client_id=iVGCK1tOhbZGS7zC8dPjg")
    login_form <- session %>% html_nodes("form") %>%
    .... %>%  #Instructions that lead you to the login form, e.g. extract2(1)
                    html_form() %>%
                    set_values(`login` = YourId,`password` = YourPasswd)  
     Logged_in=session %>%  submit_form(login_form))

logged_in should contains the session information after logging in.

BR



回答2:

What are you trying to achieve with this? If you are just looking to collect data then you can always use the data export API to download the network data instead for analysis. This requires an Enterprise network.