I'm trying to GET about 7 dozens of urls in parallel with scripts: the first is below, with HTTP::Async, and the second one is on pastebin, with Net::Async::HTTP.
The problem is that I'm getting pretty same timing results - about 8..14 seconds for all urls list. It's inacceptable slow compared to curl+xargs started from shell, which gets all in less than 3 seconds with 10-20 "threads".
For example, Devel::Timer in first script shows that max queue length is even less than 6 ($queue->in_progress_count
<=5, $queue->to_send_count
=0 allways). So, it's looks like foreach with $queue->add is executing too slow, and I don't know why.
Pretty same situation I got with Net::Async::HTTP (second script on pastebin), which is even slower than the first.
So, please, does anybody know, what I'm doing wrong? How can I get concurrent download speed at least compared to curl+xargs started from shell?
#!/usr/bin/perl -w
use utf8;
use strict;
use POSIX qw(ceil);
use XML::Simple;
use Data::Dumper;
use HTTP::Request;
use HTTP::Async;
use Time::HiRes qw(usleep time);
use Devel::Timer;
#settings
use constant passwd => 'ultramegahypapassword';
use constant agent => 'supa agent dev.alpha';
use constant timeout => 10;
use constant slots => 10;
use constant debug => 1;
my @qids;
my @xmlz;
my $queue = HTTP::Async->new(slots => slots,max_request_time => 10, timeout => timeout, poll_interval => 0.0001);
my %responses;
my @urlz = (
'http://testpodarki.afghanet/api/products/4577',
'http://testpodarki.afghanet/api/products/4653',
'http://testpodarki.afghanet/api/products/4652',
'http://testpodarki.afghanet/api/products/4571',
'http://testpodarki.afghanet/api/products/4572',
'http://testpodarki.afghanet/api/products/4666',
'http://testpodarki.afghanet/api/products/4576',
'http://testpodarki.afghanet/api/products/4574',
'http://testpodarki.afghanet/api/products/4651',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[3294]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[3294]',
'http://testpodarki.afghanet/api/combinations/?display=full&filter[id_product]=[4577]',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4577]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4577]',
'http://testpodarki.afghanet/api/product_option_values/188',
'http://testpodarki.afghanet/api/product_option_values/191',
'http://testpodarki.afghanet/api/product_option_values/187',
'http://testpodarki.afghanet/api/product_option_values/190',
'http://testpodarki.afghanet/api/product_option_values/189',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4653]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4653]',
'http://testpodarki.afghanet/api/images/products/4577/12176',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4652]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4652]',
'http://testpodarki.afghanet/api/images/products/4653/12390',
'http://testpodarki.afghanet/api/combinations/?display=full&filter[id_product]=[4571]',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4571]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4571]',
'http://testpodarki.afghanet/api/images/products/4652/12388',
'http://testpodarki.afghanet/api/product_option_values/175',
'http://testpodarki.afghanet/api/product_option_values/178',
'http://testpodarki.afghanet/api/product_option_values/179',
'http://testpodarki.afghanet/api/product_option_values/180',
'http://testpodarki.afghanet/api/product_option_values/181',
'http://testpodarki.afghanet/api/images/products/3294/8965',
'http://testpodarki.afghanet/api/product_option_values/176',
'http://testpodarki.afghanet/api/product_option_values/177',
'http://testpodarki.afghanet/api/combinations/?display=full&filter[id_product]=[4572]',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4572]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4572]',
'http://testpodarki.afghanet/api/product_option_values/176',
'http://testpodarki.afghanet/api/product_option_values/181',
'http://testpodarki.afghanet/api/product_option_values/180',
'http://testpodarki.afghanet/api/images/products/4571/12159',
'http://testpodarki.afghanet/api/product_option_values/177',
'http://testpodarki.afghanet/api/product_option_values/179',
'http://testpodarki.afghanet/api/product_option_values/175',
'http://testpodarki.afghanet/api/product_option_values/178',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4666]',
'http://testpodarki.afghanet/api/combinations/?display=full&filter[id_product]=[4576]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4666]',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4576]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4576]',
'http://testpodarki.afghanet/api/images/products/4572/12168',
'http://testpodarki.afghanet/api/product_option_values/185',
'http://testpodarki.afghanet/api/product_option_values/182',
'http://testpodarki.afghanet/api/product_option_values/184',
'http://testpodarki.afghanet/api/product_option_values/183',
'http://testpodarki.afghanet/api/product_option_values/186',
'http://testpodarki.afghanet/api/images/products/4666/12413',
'http://testpodarki.afghanet/api/combinations/?display=full&filter[id_product]=[4574]',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4574]',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4574]',
'http://testpodarki.afghanet/api/product_option_values/177',
'http://testpodarki.afghanet/api/product_option_values/181',
'http://testpodarki.afghanet/api/images/products/4576/12174',
'http://testpodarki.afghanet/api/product_option_values/176',
'http://testpodarki.afghanet/api/product_option_values/180',
'http://testpodarki.afghanet/api/product_option_values/179',
'http://testpodarki.afghanet/api/product_option_values/175',
'http://testpodarki.afghanet/api/product_option_values/178',
'http://testpodarki.afghanet/api/specific_prices/?display=full&filter[id_product]=[4651]',
'http://testpodarki.afghanet/api/images/products/4574/12171',
'http://testpodarki.afghanet/api/stock_availables/?display=full&filter[id_product]=[4651]',
'http://testpodarki.afghanet/api/images/products/4651/12387'
);
my $timer = Devel::Timer->new();
foreach my $el (@urlz) {
my $request = HTTP::Request->new(GET => $el);
$request->header(User_Agent => agent);
$request->authorization_basic(passwd,'');
push @qids,$queue->add($request);
$timer->mark("pushed [$el], to_send=".$queue->to_send_count().", to_return=".$queue->to_return_count().", in_progress=".$queue->in_progress_count());
}
$timer->mark('requestz pushed');
while ($queue->in_progress_count) {
usleep(2000);
$queue->poke();
}
$timer->mark('requestz complited');
process_responses();
$timer->mark('responzez processed');
foreach my $q (@xmlz) {
# print ">>>>>>".Dumper($q)."<<<<<<<<\n";
}
$timer->report();
print "\n\n";
My best results with HTTP::Async are well over 4 and up to over 5 seconds. As I understand this approach isn't required, and here is a simple forking example that takes a little over 2 and at most below 3 seconds.
It uses Parallel::ForkManager and LWP::UserAgent for downloads.
The files are written using Path::Tiny. Its
path
builds an object andspew
routines write the file.For reference, the sequential downloads take around 26 seconds.
With the maximum number of processes set to 30 this takes over 4 seconds, and with 60 it is a little over 2 seconds, about the same as with (up to) 90. There are 70 urls in this test.
Tested at a 4-core laptop with a decent network connection. (Here the CPU isn't all that important.) The tests were run repeatedly, at multiple times and on multiple days.
A comparison with the approach from the question
The best
HTTP::Async
results are slower than the above by around a factor of two. They are with 30-40 "slots" since for higher numbers the time goes up, what puzzles (me). The module usesselect
to multiplex, via Net::HTTP::NB (a non-blocking version of Net::HTTP). Whileselect
"does not scale well" this regards hundreds of sockets and I'd expect to be able to use more than 40 on this network bound problem. The simple forked approach does.Also,
select
is considered to be a slow method to monitor sockets while forks don't even need that, as each process has its own url. (This may result in module's overhead when there are many connections?) Fork's inherent overhead is fixed and dwarfed by network access. If we were after (many) hundreds of downloads the system may get strained by processes, butselect
wouldn't fare well either.Finally,
select
based methods download strictly one file at a time, and the effect is seen by printing as requests areadd
ed -- we can see the delay. The forked downloads go in parallel (in this case all 70 at the same time without a problem). Then there'll be a network or disk bottleneck but that is tiny in comparison to the gain.Update: I pushed this to double the number of sites and processes, saw no signs of OS/CPU strain, and retained the average speed.
So I'd say, if you need to shave off every second use forks. But if this is not critical and there are other benefits of
HTTP::Async
(or such) then be content with (just a bit) longer downloads.The
HTTP::Async
code that performs well ended up being simplyI have also tried to tweak headers and timings. (This included dropping
keep-alive
as suggested, by$request->header(Connection => 'close')
, to no effect.)For explaining my comment. I was curious, because never used the
Net::Async::HTTP
before, wanted to try your script locally. So, created this minimalist Plackapp.psgi
:The server understand URL's in a form
GET /sleep_time/reqID
, whereusleep
- and the server sleep the given time before responds. E.g. it fakes some "processing time".E.g. requesting
GET /1000000/1
, the server will sleep 1second before responds. In response is included thePID
of the responding process.In one terminal window run the above using
Starman
preforkimg server with default20
workers.And in the another window the results using
xargs
:so, sending 20 requests, where each response tooks
1s
processing time.So, 20 requests = 4 seconds. It is visible, that the reponding
PID
's are different - e.g the repsonse is sent by different worker.Now using your script
async.pl
(slightly shortened/modified):command
result
Same 20 requests = 20 seconds, and each request is served by the same
PID
. Like pure sequential processing. :(This probably because the requests reusing the connection (eg. the keep-alive).
Finally - unfortunately, as i told - i haven't any experience with the module, so haven't idea how to force the module to do-not-reuse the opened connection.
Async will be slower than parallel downloading: the async code will yield to other calls only while waiting for a response, but downloading happens sequentially in a single process, while the curl+xargs will work 100% (well, almost 100%, and as long as you don't saturate the cores) in parallel, same as when using forked workers.
please, google for "concurrency is not parallelism"
So, finally got working sample (full script). It uses
Furl
andfork_call
fromAnyEvent::Util
. This example returnes in ~3 secs, what is good enough. If you need to use basic HTTP auth, just use URI with creds like this:username:password@hostdomain.org/path?param1=val1¶m2=val2
. You better adduse EV;
before usingAnyEvent
because EV is the fastest.