note: ipums international and ipums usa probably use the same system. ipums usa allows quicker signup. if you would like to test out your code, try https://usa.ipums.org/usa-action/users/request_access to sign up!
i am trying to programmatically download a file from https://international.ipums.org/ with the R language and httr. i need to use httr and not RCurl because i need to post-authentication download large files not into RAM but directly to disk. this is currently only possible with httr
as far as i know
the reproducible code below documents my best effort at getting from the login page (https://international.ipums.org/international-action/users/login) to the main post-authentication page. any tips or hints would be appreciated! thanks!
my_email <- "email@address.com"
my_password <- "password"
tf <- tempfile()
# use httr, because i need to download a large file after authentication
# and only httr supports that with its `write_disk()` option
library(httr)
# turn off ssl verify, otherwise the subsequent GET command will fail
set_config( config( ssl_verifypeer = 0L ) )
GET( "https://international.ipums.org/Shibboleth.sso/Login?target=https%3A%2F%2Finternational.ipums.org%2Finternational-action%2Fmenu" )
# connect to the starting login page of the website
( a <- GET( "https://international.ipums.org/international-action/users/login" , verbose( info = TRUE ) ) )
# which takes me through to a lot of websites, but ultimately (in my browser) lands at
shibboleth_url <- "https://live.identity.popdata.org:443/idp/Authn/UserPassword"
# construct authentication information?
base_values <- list( "j_username" = my_email , "j_password" = my_password )
idp_values <- list( "j_username" = my_email , "j_password" = my_password , "_idp_authn_lc_key"=subset( a$cookies , domain == "live.identity.popdata.org" )$value , "JSESSIONID" = subset( a$cookies , domain == "#HttpOnly_live.identity.popdata.org" )$value )
ipums_values <- list( "j_username" = my_email , "j_password" = my_password , "_idp_authn_lc_key"=subset( a$cookies , domain == "live.identity.popdata.org" )$value , "JSESSIONID" = subset( a$cookies , domain == "international.ipums.org" )$value)
# i believe this is where the main login should happen, but it looks like it's failing
GET( shibboleth_url , query = idp_values )
POST( shibboleth_url , body = base_values )
writeBin( GET( shibboleth_url , query = idp_values )$content , tf )
readLines( tf )
# The MPC account authentication system has encountered an error
# This error can sometimes occur if you did not close your browser after logging out of an application previously. It may also occur for other reasons. Please close your browser and try your action again."
writeBin( GET( "https://live.identity.popdata.org/idp/profile/SAML2/Redirect/SSO" , query = idp_values )$content , tf )
POST( "https://live.identity.popdata.org/idp/profile/SAML2/Redirect/SSO" , body = idp_values )
readLines( tf )
# same error as above
# return to the main login page..
writeBin( GET( "https://international.ipums.org/international-action/menu" , query = ipums_values )$content , tf )
readLines( tf )
# ..not logged in
You have to use
set_cookies()
to send your cookies to the server:Since the result is
I think you're logged in...
@HubertL have done many steps in the right direction, however, I think, his answer is not complete.
First of all, the main thing to look at when you're implementing automatic web authorization is the cookies being used during 'normal' manual workflow. You can easily spy on them with dev tools in any modern browser:
Here, we see
JSESSIONID
and_shibsession*
cookies, first one holds JSP session id of the website, second is most likely solely for a shibboleth authorization. Server is, probably, have them bound somehow, butJSESSIONID
doesn't require authorization and you get it right away after opening the website. So, we must get_shibsession*
cookie for ourJSESSIONID
to be authorized. That's what the Shibboleth's authorization process with many redirects is about. See the comments in code.After the call to
login_ipums
we'll have the following cookies:Here, we have both
JSESSIONID
and_shibsession_*
used for site-wide authorization._idp_authn_lc_key
is, probably, not needed, but leaving it won't hurt.Now, you can easily download files like that:
IMPORTANT NOTE: As you can see, I used IPUMS USA, not International. To check that code with your account, replace
usa
withinternational
everywhere, including*-action
in URLs.