Scraping a site that requires login username and p

2019-04-12 05:54发布

问题:

I'm trying to scrape information from my companies Intranet so that I can display information on our office wall board via dashing dashboard. I'm trying to work with the provided information from:This Site.The problem that I'm having other than being a noob is that in order to gain access to the information I want to scrape, I need to login to our Intranet providing my username on one page then submitting to another so that I can provide my password. Once I'm logged in, I can then link and scrape my data.

Here is some source code from my login username page:

<form action='loginauthpwd.asp?PassedURL=' method='post' style='margin: 0px;'><table border='0' cellspacing='1' width='999' height='350'><tr><td width='100'>&nbsp;</td><td valign='center' width='100'><table style='width: 350px; background-color: #EEEEEE; border: 1px solid gray;'><tr><td class='fontBlack' style='padding: 10px; vertical-align: top;'><span style='font-weight: bold;'>Username:</span><br><input type='text' class='normal' autocomplete='off' id='LoginUser' name='LoginUser' style='border: 1px solid gray; height: 16px; font-family: arial; font-size: 11; width: 180px;' maxlength='30'><input class='normal_button' type='button' value='Go' style='border: 1px solid gray; font-weight: bold; width: 80px; margin-left: 10px;' onclick="var username=document.getElementById('LoginUser').value; if (username.length > 2) { submit(); } else { alert('Enter your Username.'); }"></form>

Here is some source from my login password page:

<form action='loginauthprocess.asp?UserName=******&Page=&PassedURL=' target='_top' method='post' onsubmit='checkMyBrowser();' style='margin: 0px;'><table border='0' cellspacing='1' width='999' height='350'><tr><td width='100'>&nbsp;</td><td valign='center' width='100'><table style='width: 350px; background-color: #EEEEEE; border: 1px solid gray;'><tr><td class='fontBlack' style='padding: 10px; vertical-align: top;'><span style='font-weight: bold;'>Password:</span><br><input class='normal' type='password' autocomplete='off' id='LoginPassword' name='LoginPassword' style='border: 1px solid gray; height: 16px; font-family: arial; font-size: 11; width: 180px;' maxlength='30'><input class='normal_button' type='submit' value='Log In' style='border: 1px solid gray; font-weight: bold; width: 80px; margin-left: 10px;' onclick="var password=document.getElementById('LoginPassword').value; if (password.length > 2) { submit(); } else { alert('Enter your Password.'); }"></form>

Using said resource's example this is what I think should work but doesn't seem to be:

require 'mechanize'
@agent = Mechanize.new
@agent.verify_mode = OpenSSL::SSL::VERIFY_NONE

##Login Page:
page = @agent.get 'http://www.website_here.com/intranet/login.asp'

##Username Page:
form = page.forms[0]
form['USER NAME HERE'] = LoginUser
##Submit User:
page = form.submit

##Password Page:
form = page.forms[0]
form['USER PASSWORD HERE'] = LoginPassword
##Submit Password:
page = form.submit

When I test my code I get the following output:

test.rb:10:in `': uninitialized constant LoginUser (NameError)

Can anyone point out what I'm doing wrong?

Thanks

EDIT 3/27/15:

Using @seoyoochan resource I tried to form my code like this:

require 'rubygems'
require 'mechanize'
login_page  = agent.get "http://www.website_here.com/intranet/loginauthusr.asp?Page="
login_form = login_page.form_with(action: '/sessions') 
user_field = login_form.field_with(name: "session[user]") 
user.value = 'My User Name'

login_form.submit

When I try to run my code I'm now getting this output:

test.rb:4:in <main>': undefined local variable or methodagent' for main:Object (NameError)

I'm needing an example on how to assign the right names/classes that my provided form will work with.

EDIT 4/4/15:

Okay, Now using @tylermauthe example I'm trying to test the following code:

require 'mechanize'
require 'io/console'

agent = Mechanize.new
page = agent.get('http://www.website_here.com/intranet/loginauthusr.asp?Page=')

form = page.forms.find{|form| form.action.include?("loginauthpwd.asp?PassedURL=")}

puts "Login:"
form.login = gets.chomp
page = agent.submit(form)
pp page

Now my thoughts are that this code should allow me to enter and submit my username bringing me to my next page that would ask for my password. BUT, when I try to run it and enter my username, I get the following output:

/var/lib/gems/1.9.1/gems/mechanize-2.7.3/lib/mechanize/form.rb:217:in method_missing': undefined methodloginUser=' for # (NoMethodError) from scraper.rb:10:in `'

What am I missing or have entered wrong? Please refer to my first edit to see how my form is coded. Also to be clear I did not code the forms this way. I'm only trying to learn how to code and scrape data needed to display on my Dashing Dashboard project.

回答1:

I was able to get logged in with the following example. Thanks to everyone that helped me with all the resources and examples to learn from!

require 'nokogiri'
require 'mechanize'

agent = Mechanize.new

# Below opens URL requesting username and finds first field and fills in form then submits page.

login = agent.get('http://www.website_here.com')
login_form = login.forms.first
username_field = login_form.field_with(:name => "user_session[username]")
username_field = "YOUR USERNAME HERE"
page = agent.submit login_form

# Below opens URL requesting password and finds first field and fills in form then submits page.

login = agent.get('http://www.website_here.com')
login_form = login.forms.first
password_field = login_form.field_with(:name => "user_session[password]")
password_field = "YOUR PASSWORD HERE"
page = agent.submit login_form

# Below will print page showing information confirming that you have logged in.

pp page

I found the following example from user:Senthess HERE. I'm still not 100% on what all the individual code is doing so if anyone would like to take the time and break it down, please do so. This will help myself and others to better understand.

Thanks!



回答2:

I just looked up about Mechanize gem and found a relevant solution. You must set a proper 'name' on input fields. Otherwise you can't accept values from them. Follow this article.

http://crabonature.pl/posts/23-automation-with-mechanize-and-ruby



回答3:

Not sure if you found these, but Mechanize has fairly excellent docs: http://docs.seattlerb.org/mechanize/GUIDE_rdoc.html

From these, I played around in the irb REPL to create this simple scraper that logs into GitHub: https://gist.github.com/tylermauthe/781f68add24819e207c4