Javascript (node.js) capped number of child proces

2019-06-07 03:30发布

问题:

hopefully I can describe what I'm looking for clearly enough. Working with Node and Python.

I'm trying to run a number of child processes (.py scripts, using child_process.exec()) in parallel, but no more than a specified number at a time (say, 2). I receive an unknown number of requests in batches (say this batch has 3 requests). I'd like to stop spawning processes until one of the current ones finishes.

for (var i = 0; i < requests.length; i++) {

    //code that would ideally block execution for a moment
    while (active_pids.length == max_threads){
        console.log("Waiting for more threads...");
        sleep(100)
        continue
    };

    //code that needs to run if threads are available
    active_pids.push(i);

    cp.exec('python python-test.py '+ requests[i],function(err, stdout){
        console.log("Data processed for: " + stdout);

        active_pids.shift();

          if (err != null){
              console.log(err);
          }
    });
}

I know that while loop doesn't work, it was the first attempt.

I'm guessing there's a way to do this with

setTimeout(someSpawningFunction(){

    if (active_pids.length == max_threads){
        return
    } else {
        //spawn process?
    }

},100)

But I can't quite wrap my head around it.

Or maybe

waitpid(-1)

Inserted in the for loop above in an if statement in place of the while loop? However I can't get the waitpid() module to install at the moment.

And yes, I understand that blocking execution is considered very bad in JS, but in my case, I need it to happen. I'd rather avoid external cluster manager-type libraries if possible.

Thanks for any help.

EDIT/Partial Solution

An ugly hack would be to use the answer from: this SO question (execSync()). But that would block the loop until the LAST child finished. That's my plan so far, but not ideal.

回答1:

async.timesLimit from the async library is the perfect tool to use here. It allows you to asynchronously run a function n times, but run a maximum of k of those function calls in parallel at any given time.

async.timesLimit(requests.length, max_threads, function(i, next){
    cp.exec('python python-test.py '+ requests[i], function(err, stdout){
        console.log("Data processed for: " + stdout);

        if (err != null){
            console.log(err);
        }

        // this task is resolved
        next(null, stdout);
    });
}, function(err, stdoutArray) {
  // this runs after all processes have run; what's next?
});

Or, if you want errors to be fatal and stop the loop, call next(err, stdout).



回答2:

You can just maintain a queue of external processes waiting to run and a counter for how many are currently running. The queue would simply contain an object for each process that had properties containing the data you need to know what process to run. You can just use an array of those objects for the queue.

Whenever you get a new request to run an external process, you add it to the queue and then you start up external processes increasing your counter each time you start one until your counter hits your max number.

Then, while monitoring those external processes, whenever one finishes, you decrement the counter and if your queue of tasks waiting to run is not empty, you start up another one and increase the counter again.

The async library has this type of functionality built in (running a specific number of operations at a time), though it isn't very difficult to implement yourself with a queue and a counter. The key is that you just have to hook into the completion even for your external process so you can maintain the counter and start up any new tasks that are waiting.

There is no reason to need to use synchronous or serial execution or to block in order to achieve your goal here.