Element not found in the cache - perhaps the page

2019-07-16 10:04发布

问题:

I am trying to write a crawler that crawls all links from loaded page and logs all request and response headers along with response body in some file say XML or txt. I am opening all links from first loaded page in new browser window so I wont get this error:

Element not found in the cache - perhaps the page has changed since it was looked up

I want to know what could be the alternate way to make requests and receive response from all links and then locate input elements and submit buttons form all opened windows. I am able to do above to some extent except when opened window has common site searh box like one on this http://www.testfire.net in the upper right corner. What I want to do is I want to omit such common boxes so that I can fill other inputs with values using i.send_keys "value" method of webdriver and dont get this error ERROR: Element not found in the cache - perhaps the page has changed since it was looked up.

What is the way to detect and distinguish input tags from each opened window so that value does not get filled repeatably in common input tags that appear on most pages of website. My code is following:

require 'rubygems'
require 'selenium-webdriver'
require 'timeout'

class Clicker
def open_new_window(url)
  @driver = Selenium::WebDriver.for :firefox
  @url = @driver.get " http://test.acunetix.com "
  @link = Array.new(@driver.find_elements(:tag_name, "a"))
  @windows = Array.new(@driver.window_handles())
  @link.each do |a|
      a = @driver.execute_script("var d=document,a=d.createElement('a');a.target='_blank';a.href=arguments[0];a.innerHTML='.';d.body.appendChild(a);return a", a)
      a.click
    end
    i = @driver.window_handles
    i[0..i.length].each do |handle|
        @driver.switch_to().window(handle)
        puts @driver.current_url()
        inputs = Array.new(@driver.find_elements(:tag_name, 'input'))
        forms = Array.new(@driver.find_elements(:tag_name, 'form'))
        inputs.each do |i|
            begin
                i.send_keys "value"
                puts i.class
                i.submit
                rescue Timeout::Error => exc
                    puts "ERROR: #{exc.message}"
                rescue Errno::ETIMEDOUT => exc
                    puts "ERROR: #{exc.message}"
                rescue Exception => exc
                    puts "ERROR: #{exc.message}"
            end
        end 
        forms.each do |j|
            begin
                j.send_keys "value"
                j.submit
                rescue Timeout::Error => exc
                    puts "ERROR: #{exc.message}"
                rescue Errno::ETIMEDOUT => exc
                    puts "ERROR: #{exc.message}"
                rescue Exception => exc
                    puts "ERROR: #{exc.message}"
            end
        end

    end
#Switch back to the original window
    @driver.switch_to().window(i[0])
end
end
ol = Clicker.new
url = ""
ol.open_new_window(url)

Guide me how can I get all requeat and response headers with response body using Selenium Webdriver or using http.set_debug_output of ruby's net/http ?

回答1:

Selenium is not one of the best options to use to attempt to build a "web-crawler". It can be too flakey at times, especially when it comes across unexpected scenarios. Selenium WebDriver is a great tool for automating and testing expectancies and user interactions. Instead, good old fashioned curl would probably be a better option for web-crawling. Also, I am pretty sure there are some ruby gems that might help you web-crawl, just Google search it!

But To answer the actual question if you were to use Selenium WebDriver:

I'd work out a filtering algorithm where you can add the HTML of an element that you interact with to an variable array. Then, when you go on to the next window/tab/link, it would check against the variable array and skip the element if it finds a matching HTML value.

Unfortunately, SWD does not support getting request headers and responses with its API. The common work-around is to use a third party proxy to intercept the requests.

============

Now I'd like to address a few issues with your code.

I'd suggest before iterating over the links, add a @default_current_window = @driver.window_handle. This will allow you to always return back to the correct window at the end of your script when you call @driver.switch_to.window(@default_current_window).

In your @links iterator, instead of iterating over all the possible windows that could be displayed, use @driver.switch_to.window(@driver.window_handles.last). This will switch to the most recently displayed new window (and it only needs to happen once per link click!).

You can DRY up your inputs and form code by doing something like this:

inputs = []
inputs << @driver.find_elements(:tag_name => "input")
inputs << @driver.find_elements(:tag_name => "form")
inputs.flatten
inputs.each do |i|
  begin
    i.send_keys "value"
    i.submit
  rescue e
    puts "ERROR: #{e.message}"
  end
end

Please note how I just added all of the elements you wanted SWD to find into a single array variable that you iterate over. Then, when something bad happens, a single rescue is needed (I assume you don't want to automatically quit from there, which is why you just want to print the message to the screen).

Learning to DRY up your code and use external gems will help you achieve a lot of what you are trying to do, and at a faster pace.