Adjusting timeouts for Nokogiri connections

2019-05-18 11:33发布

问题:

Why nokogiri waits for couple of secongs (3-5) when the server is busy and I'm requesting pages one by one, but when these request are in a loop, nokogiri does not wait and throws the timeout message. I'm using timeout block wrapping the request, but nokogiri does not wait for that time at all. Any suggested procedure on this?

# this is a method from the eng class
def get_page(url,page_type)
 begin
  timeout(10) do
    # Get a Nokogiri::HTML::Document for the page we’re interested in...
    @@doc = Nokogiri::HTML(open(url))
  end
 rescue Timeout::Error
  puts "Time out connection request"
  raise
  end
end

 # this is a snippet from the main app calling eng class
 # receives a hash with urls and goes throgh asking one by one
 def retrieve_in_loop(links)
  (0..links.length).each do |idx|
    url = links[idx]
    puts "Visiting link #{idx} of #{links.length}"
    puts "link: #{url}"
    begin
        @@eng.get_page(url, product)
    rescue Exception => e
        puts "Error getting url: #{idx} #{url}"
        puts "This link will be skeeped. Continuing with next one"
    end
  end
end

回答1:

The timeout block is simply the max time that that code has to execute inside the block without triggering an exception. It does not affect anything inside Nokogiri or OpenURI.

You can set the timeout to a year, but OpenURI can still time out whenever it likes.

So your problem is most likely that OpenURI is timing out on the connection attempt itself. Nokogiri has no timeouts; it's just a parser.

Adjusting read timeout

The only timeout you can adjust on OpenURI is the read timeout. It seems you cannot change the connection timeout through this method:

open(url, :read_timeout => 10)

Adjusting connection timeout

To adjust the connection timeout you would have to go with Net::HTTP directly instead:

uri = URI.parse(url)

http = Net::HTTP.new(uri.host, uri.port)
http.open_timeout = 10
http.read_timeout = 10

response = http.get(uri.path)

Nokogiri.parse(response.body)

You can also take a look at some additional discussion here:

Ruby Net::HTTP time out
Increase timeout for Net::HTTP