How to stop forking in this code

2020-02-15 04:12发布

问题:

So I have a perl script that goes out there and wgets pieces of a stream (I don't know how many pieces there are upfront)

but I can't think of a good way to know when to STOP wget'ing. right now if wget returns unsuccessful, than we create a file called "end" and once the main program sees it, it stops the loop. Is there a better way to go about doing this?

Obviously it would be easy if it was done sequentially instead of concurrently, but i was trying to make it download the fastest.

my $link = $ARGV[0];
my ($url) = $link=~ m/(.+-)\d+.ts/i;

my $num = 0;

#while the file END doesn't exist
my @pids;
while (! -e "END") {
        #create the URL, increment by 1
        my $video=$url.++$num.".ts";
        die "could not fork" unless defined (my $pid = fork());

        #child process goes until wget returns invalid, create END
        if (not $pid) {
                system ("wget -T 5 -t 5 $video");
                `touch END` if $? != 0;
                exit;
        }
        push @pids, $pid;
}

#parent process still running, waiting for the same END file.
for my $pid (@pids) { waitpid $pid,0; }

print "pids finished\n";

sleep 1;
`rm END`;

回答1:

You don't indicate how many processes there may be, but no resource is unlimited and you should limit the number or you'll see a rapid degradation of performance as you reach saturation.

This is even more so when going out on the network since you may be annoying a server (and things will also stop being faster quite soon). Perhaps run up to a few tens of processes at a time?

Then one option is to limit a number of parallel downloads using Parallel::ForkManager. It has a way to return data to parent, so a child can report failure. Then its run_on_finish method can check each batch for such a flag (of failure), and set a variable that controls the forking.

use warnings;
use strict;
use Parallel::ForkManager;    

my $pm = Parallel::ForkManager->new(2);  # only 2 for a managable demo
my $stop_forking;

# The sub gets 6 parameters, but only first (pid) is always defined
# The last one is what a child process may have passed
$pm->run_on_finish(  
    sub { $stop_forking = 1 if defined $_[-1] } 
); 

for my $i (0..9)
{
    last if $stop_forking;

    $pm->start and next;    # forks
    my $ret = run_job($i);  # child process

    # Pass data to parent under a condition
    if ($ret eq 'FAIL') {  $pm->finish(0, \$ret) }  # child exits 
    else                {  $pm->finish }
}
$pm->wait_all_children;

sub run_job { 
    my ($i) = $_[0];
    sleep 2;
    print "Child: job $i exiting\n";
    return ($i == 3 ? 'FAIL' : 1);
}

This stops forking after the batch of jobs within which $i == 3. Add prints for diagnostics.

The "callback" run_on_finish runs only once a whole batch completes. The anonymous sub in it always receives 6 arguments, but only the first one, the child pid, is always defined. The last one has data possibly passed by the child, and when that happens we set the flag. A child can return data by passing a reference to finish method. To only indicate a condition we can simply pass anything. I use \$ret as an example of passing actual data.

See documentation for more, but this does what you ask. For yet far more see Forks::Super.


If you wish to fork as you do, I'd first put in a little sleep there, so you don't bombard the server with too many requests. Your children can talk with the parent using socketpair. The failed child can write while all others can simply close their socket. The parent keeps checking, for example with can_read from IO::Select. There is an example in perlipc. Since you only need children to write to the parent the pipe would suffice as well.

You can also do it with a signal. The child that fails sends (say) SIGUSR1 to the parent, which the parent traps and sets a global variable that controls further forks. This is simpler as the parent only traps that one signal and doesn't care where it comes from. See perlipc and sigtrap pragma.

You can also use a file, much like you do, which is probably simplest since here you don't care about racing issues (whether children writes overlap), but only about an empty file showing up.

However, in all these you'd also want to limit the number of parallel processes.

Finally, there are also modules that help with external commands, for example IPC::Run.


To run the callback right as each child exits use reap_finished_children. See this post.



标签: perl fork