Hei.
I'm using Node.JS
with child_process
to spawn bash processes. I'm trying to understand if i'm doing I/O bound, CPU bound or both.
I'm using pdftotext to extract the text of 10k+ files. To control concurrences, I'm using async.
Code:
let spawn = require('child_process').spawn;
let async = require('async');
let files = [
{
path: 'path_for_file'
...
},
...
];
let maxNumber = 5;
async.mapLimit(files, maxNumber, (file, callback) => {
let process = child_process.spawn('pdftotext', [
"-layout",
"-enc",
"UTF-8",
file.path,
"-"
]);
let result = '';
let error = '';
process.stdout.on('data', function(chunk) {
result += chunk.toString();
});
process.stderr.on('error', function(chunk) {
error += chunk.toString();
});
process.on('close', function(data) {
if (error) {
return callback(error, null);
}
callback(null, result);
});
}, function(error, files) {
if (error) {
throw new Error(error);
}
console.log(files);
});
I'm monitoring my Ubuntu usage and my CPU and Memory are very high when i run the program, and also sometimes I see only one file being processed at a time, is this normal?? What could be the problem??
I'm trying to understand the concept of child_process. Is pdftotext
a child process of Node.JS
? All child processes are running only in one core? And, how can i make more soft for my computer process the files?
Cool image of glancer:
Is this usage of Node.JS because of the child_process's??
Thanks.
If your jobs are CPU hungry, then the optimal number of jobs to run is typically the number of cores (or double that if the CPUs have hyperthreading). So if you have a 4 core machine you will typically see the optimal speed by running 4 jobs in parallel.
However, modern CPUs are heavily dependent on caches. This makes it hard to predict the optimal number of jobs to run in parallel. Throw in the latency from disks and it will make it even harder.
I have even seen jobs on systems in which the cores shared the CPU cache, and where it was faster to run a single job at a time - simply because it could then use the full CPU cache.
Due to that experience my advice has always been: Measure.
So if you have 10k jobs to run, then try running 100 random jobs with different number of jobs in parallel to see what the optimal number is for you. It is important to choose at random, so you also get to measure the disk I/O. If the files differ greatly in size, run the test a few times.
find pdfdir -type f > files
mytest() {
shuf files | head -n 100 |
parallel -j $1 pdftotext -layout -enc UTF-8 {} - > out;
}
export -f mytest
# Test with 1..10 parallel jobs. Sort by JobRuntime.
seq 10 | parallel -j1 --joblog - mytest | sort -nk 4
Do not worry about your CPUs running at 100%. That just means you get getting a return for all the money you spent at the computer store.
Your RAM is only a problem if the disk cache gets low (In your screenshot 754M is not low. When it gets < 100M it is low), because that may cause your computer to start swapping - which can slow it to a crawl.
Your Node.js code is I/O bound. It is doing almost none of the CPU work. You can see in your code that you are only creating external tasks and moving around the output from those tasks. You are not using long running loops or heavy math calculations. You are seeing high CPU numbers for the Node.js process because the pdftotext processes are its child processes, and therefore you are seeing its CPU values aggregated.