可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Here's what I'm trying to accomplish. Let's say I have 100,000 urls stored in a database and I want to check each of these for http status and store that status. I want to be able to do this concurrently in a fairly small amount of time.

I was wondering what the best way(s) to do this would be. I thought about using some sort of queue with workers/consumers or some sort of evented model, but I don't really have enough experience to know what would work best in this scenario.

Ideas?

回答1:

Take a look at the very capable Typhoeus and Hydra combo. The two make it very easy to concurrently process multiple URLs.

The "Times" example should get you up and running quickly. In the on_complete block put your code to write your statuses to the DB. You could use a thread to build and maintain the queued requests at a healthy level, or queue a set number, let them all run to completion, then loop for another group. It's up to you.

Paul Dix, the original author, talked about his design goals on his blog.

This is some sample code I wrote to download archived mail lists so I could do local searches. I deliberately removed the URL to keep from subjecting the site to DOS attacks if people start running the code:

#!/usr/bin/env ruby

require 'nokogiri'
require 'addressable/uri'
require 'typhoeus'

BASE_URL = ''

url = Addressable::URI.parse(BASE_URL)
resp = Typhoeus::Request.get(url.to_s)
doc = Nokogiri::HTML(resp.body)

hydra = Typhoeus::Hydra.new(:max_concurrency => 10)
doc.css('a').map{ |n| n['href'] }.select{ |href| href[/\.gz$/] }.each do |gzip|
  gzip_url = url.join(gzip)
  request = Typhoeus::Request.new(gzip_url.to_s)

  request.on_complete do |resp|
    gzip_filename = resp.request.url.split('/').last
    puts "writing #{gzip_filename}"
    File.open("gz/#{gzip_filename}", 'w') do |fo|
      fo.write resp.body
    end  
  end
  puts "queuing #{ gzip }"
  hydra.queue(request)
end

hydra.run

Running the code on my several-year-old MacBook Pro pulled in 76 files totaling 11MB in just under 20 seconds, over wireless to DSL. If you're only doing HEAD requests your throughput will be better. You'll want to mess with the concurrency setting because there is a point where having more concurrent sessions only slow you down and needlessly use resources.

I give it a 8 out of 10; It's got a great beat and I can dance to it.

EDIT:

When checking the remove URLs you can use a HEAD request, or a GET with the If-Modified-Since. They can give you responses you can use to determine the freshness of your URLs.

回答2:

I haven't done anything multithreaded in Ruby, only in Java, but it seems pretty straightforward: http://www.tutorialspoint.com/ruby/ruby_multithreading.htm

From what you described, you don't need any queue and workers (well, I'm sure you can do it that way too, but I doubt you'll get much benefit). Just partition your urls between several threads, and let each thread do each chunk and update the database with the results. E.g., create 100 threads, and give each thread a range of 1000 database rows to process.

You could even just create 100 separate processes and give them rows as arguments, if you'd rather deal with processes than threads.

To get the URL status, I think you do an HTTP HEAD request, which I guess is http://apidock.com/ruby/Net/HTTP/request_head in ruby.

回答3:

The work_queue gem is the easiest way to perform tasks asynchronously and concurrently in your application.

wq = WorkQueue.new 10

urls.each do |url|
  wq.enqueue_b do
    response = Net::HTTP.get_response(uri)
    puts response.code
  end
end

wq.join

Best way to concurrently check urls (for status i.

问题:

回答1:

回答2:

回答3:

收藏的人(0)

Best way to concurrently check urls (for status i.

问题:

回答1:

回答2:

回答3:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮