I want, with a python script, to be able to login a website and retrieve some data. This behind my company's proxy.
I know that this question seems a duplicate of others that you can find searching, but it isn't.
I already tried using the proposed solutions in the responses to those answers but they didn't work... I don't only need a piece of code to login and get a specific webpage but also some "concepts" behind how all this mechanism works.
Here is a description of what I want to be able to do:
Log into a website > Get to page X > Insert data in some form of page X and push "Calculate" button > Capture the results of my query
Once I have the results I'll see how to sort how the data.
How can I achieve this behind a proxy? Every time I try to use "request" library to login it doesn't work saying I am unable to get page X since I did not authenticate... or worst, I am even unable to get to that side because I didn't set up the proxy before.
Clarification of Requirements
First, make sure you understand context for getting results of your calculation
(F12 shall show DevTools in Chrome or Firebug in Firefox where you can learn most details discussed below)
Simple: HTTP based scenario
It is very likely, that your situation will allow use of simple HTTP communication. I will assume following situation:
Complex: Browser emulation scenario
There are some chances, that part of interaction needed to get your result is dependent on JavaScript code performing something on the page. Often it can be converted into HTTP scenario by investigating, what are final HTTP requests, but here I will assume this is not feasible or possible and we will emulate using real browser.
For this scenario I will assume:
Resolving HTTP based scenario
Python provides excellent
requests
package, which shall serve our needs:Proxy
Aassuming proxy at
http://10.10.1.10:3128
, username beinguser
and passwordpass
Basic Authentication
Assuming, the web app allows access for user being
appuser
and passwordapppass
or using explicitly BasicAuthentication
Digest authentication differs only in classname being HTTPDigestAuth
Other authentication methods are documented at requests pages.
HTTP POST for a HTML Form
Note, that this
url
is not url of the form, but of the "action" taken, when you press thesubmit
button.All together
Users often go to the final HTML form in two steps, first log in, then navigate to the form.
However, web applications typically allow (with knowledge of the form url) direct access. This will perform authentication at the same step and this is the way described below.
Note: If this would not work, you would have to use sessions with
requests
, which is possible, but I will not elaborate on that here.By now, you shall have your result available via
req
and you are done.Resolving Browser emulation scenario
Proxy
Selenimum doc for configuring proxy recommends configuring your proxy in your web browser. The same link provides details, how to set up proxy from your script, but here I will assume, you used Firefox and have already (during manual testing) succeeded with configuring proxy.
Basic or Digest Authentication
Following modified snippet originates from SO answer by Mimi, using Basic Authentication:
Note, that Selenium does not seem providing complete solution for Basic/Digest authentication, the sample above is likely to work, but if not, you may check this Selenium Developer Activity Google Group thread and see, you are not alone. Some solutions might work for you.
Situation with Digest Authentication seems even worse then with Basic one, some people reporting success with AutoIT or blindly sending keys, discussion referenced above shows some attempts.
Authentication via Login Form
If the web site allows logging in by entering credentials into some form, you might be lucky one, as this is rather easy task to do with Selenium. For more see next chapter about Filling in forms.
Fill in a Form and Submit
In contrast to Authentication, filling data into forms, clicking buttons and similar activities are where Selenium works very well.
Conclusions
Information provided in question describes what is to be done in rather general manner, but is lacking specific details, which would allow providing tailored solution. That is why this answer focuses on proposing general approach.
There are two scenarios, one bing HTTP based, second one uses emulated browser.
HTTP Solution is preferable, despite of a fact, it requires a bit more preparation in searching, what HTTP requests are to be used. Big advantage is, it is then in production much faster, requiring much less memory and shall be more robust.
In rare cases, when there is some essential JavaScript activity in the browser, we may use Browser emulation solution. However, this is much more complex to set up and has major problems at the Authentication step.