I'm trying to run multiple mongodump's on 26 servers in a bash script.
I can run 3 commands like
mongodump -h staging .... &
mongodump -h production .... &
mongodump -h web ... &
at the same time, and when one finishes I want to start another mongodump.
I can't run all 26 mongodumps commands at the same time, the server will run out on CPU. Max 3 mongodumps at the same time.
You can use xarg
's -P
option to run a specifiable number of invocations in parallel:
Note that the -P
option is not mandated by POSIX, but both GNU xargs
and BSD/macOS xargs
support it.
xargs -P 3 -n 1 mongodump -h <<<'staging production web more stuff and so on'
This runs mongodump -h staging
, mongodump -h production
, and mongodump -h web
in parallel, waits for all 3 calls to finish, then continues with mongodump -h more
, mongodump -h stuff
, and mongodump -h and
, and so on.
-n 1
grabs a single argument from the input stream and calls mongodump
; adjust as needed, single- or double-quoting arguments in the input if necessary.
Note: GNU xargs
- but not BSD xargs
- supports -P 0
, where 0
means: "run as many processes as possible simultaneously."
By default, the arguments supplied via stdin are appended to the specified command.
If you need to control where the respective arguments are placed in the resulting commands,
- provide the arguments line by line
- use
-I {}
to indicate that, and to define {}
as the placeholder for each input line.
xargs -P 3 -I {} mongodump -h {} after <<<$'staging\nproduction\nweb\nmore\nstuff'
Now each input arguments is substituted for {}
, allowing argument after
to come after.
Note, however, that each input line is invariably passed as a single argument.
BSD/macOS xargs
would allow you to combine -n
with -J {}
, without needing to provide line-based input, but GNU xargs
doesn't support -J
.
In short: only BSD/macOS allows you to combine placement of the input arguments with reading multiple arguments at once.
Note that xargs
does not serialize stdout output from commands in parallel, so that output from parallel processes can arrive interleaved.
Use GNU parallel
to avoid this problem - see below.
Alternative: parallel
xargs
has the advantage of being a standard utility, so on platforms where it supports -P
, there are no prerequisites.
In the Linux world (though also on macOS via Homebrew) there are two purpose-built utilities for running commands in parallel, which, unfortunately, share the same name; typically, you must install them on demand:
parallel
(a binary) from the moreutils
package - see its home page.
The - much more powerful - GNU parallel
(a Perl script) from the parallel
package Thanks, twalberg. - see its home page.
If you already have a parallel
utility, parallel --version
will tell you which one it is (GNU parallel
reports a version number and copyright information, "moreutils" parallel
complains about an invalid option and shows a syntax summary).
Using the "moreutils" parallel
:
parallel -j 3 -n 1 mongodump -h -- staging production web more stuff and so on
# Using -i to control placement of the argument, via {}
# Only *1* argument at at time supported in that case.
parallel -j 3 -i mongodump -h {} after -- staging production web more stuff and so on
Unlike xargs
, this parallel
implementation doesn't take the arguments to pass through from stdin; all pass-through arguments must be passed on the command line, following --
.
From what I can tell, the only features this parallel
implementation offers beyond what xargs
can do is:
- The
-l
option allows delaying further invocations until the system load overage is below the specified threshold.
- Possibly this (from the
man
page): "stdout and stderr is serialised through a corresponding internal pipe, in order to prevent annoying concurrent output behaviour.", though I've found this not be the case in the version whose man
page is dated 2009-07-2 - see last section.
Using GNU parallel
:
Tip of the hat to Ole Tange for his help.
parallel -P 3 -n 1 mongodump -h <<<$'staging\nproduction\nweb\nmore\nstuff\nand\nso\non'
# Alternative, using ::: followed by the target-command arguments.
parallel -P 3 -n 1 mongodump -h ::: staging production web more stuff and so on
# Using -n 1 and {} to control placement of the argument.
# Note that using -N rather than -n would allow per-argument placement control
# with {1}, {2}, ...
parallel -P 3 -n 1 mongodump -h {} after <<<$'staging\nproduction\nweb\nmore\nstuff\nand'
As with xargs
, pass-through arguments are supplied via stdin, but GNU parallel
also supports placing them on the command line, after a configurable separator (:::
by default).
Unlike with xargs
, each input line is considered a single argument.
Caveat: If your command involves quoted strings, you must use -q
to pass them through as distinct arguments; e.g., parallel -q sh -c 'echo hi, $0' ::: there
only works with -q
.
As with GNU xargs
, you can use -P 0
to run as many invocations as possible at once, taking full advantage of the machine's capabilities, meaning, according to Ole, "until GNU Parallel hits a limit (file handles and processes)".
- Conveniently, omitting
-P
doesn't just run one process at a time, as the other utilities do, but runs one process per CPU core.
Output from commands being executed in parallel is by default automatically serialized (grouped) on per-process basis, to avoid interleaved output.
- This is generally desirable, but note that it means that you'll only start to see the other commands' output once the first one that has created output has terminated.
- Use option
--line-buffer
(--lb
in more recent versions) to opt out of this behavior or
-u
(--ungroup
) to allow even a single output line to mix output from different processes; see the manual for details.
GNU parallel
, which is designed to be a better successor to xargs
, offers many more features: a notable example is the ability to perform sophisticated transformations on the pass-through arguments, optionally based on Perl regular expressions; see also: man parallel
and man parallel_tutorial
.
Optional reading: testing output serialization behavior
The following commands test how xargs
and the two parallel
implements deal with interleaved output from commands being run in parallel - whether they show output as it arrives, or try to serialize it:
There are 2 levels of serialization, both of which introduce overhead:
Line-level serialization: Prevent partial lines from different processes to be mixed on a single output line.
Process-level serialization: Ensure that all output lines from a given process are grouped together.
This is the most user-friendly method, but note that it means that you'll only start to see the other commands' output (in sequence) once the first one that has created output has terminated.
From what I can tell, only GNU parallel
offers any serialization (despite what the "moreutils" parallel
man page dated 2009-07-2 says[1]
), and it supports both methods.
The commands below assume the existence of executable script ./tst
with the following content:
#!/usr/bin/env bash
printf "$$: [1/2] entering with arg(s): $*"
sleep $(( $RANDOM / 16384 ))
printf " $$: [2/2] finished entering\n"
echo " $$: stderr line" >&2
echo "$$: stdout line"
sleep $(( $RANDOM / 8192 ))
echo " $$: exiting"
xargs
(both the GNU and BSD/macOS implementations, as found on Ubuntu 16.04 and macOS 10.12):
No serialization happens: a single output line can contain output from multiple processes.
$ xargs -P 3 -n 1 ./tst <<<'one two three'
2593: [1/2] entering with arg(s): one2594: [1/2] entering with arg(s): two 2593: [2/2] finished entering
2593: stderr line
2593: stdout line
2596: [1/2] entering with arg(s): three 2593: exiting
2594: [2/2] finished entering
2594: stderr line
2594: stdout line
2596: [2/2] finished entering
2596: stderr line
2596: stdout line
2594: exiting
2596: exiting
"moreutils" parallel
(version whose man
page is dated 2009-07-02)
No serialization happens: a single output line can contain output from multiple processes.
$ parallel -j 3 ./tst -- one two three
3940: [1/2] entering with arg(s): one3941: [1/2] entering with arg(s): two3942: [1/2] entering with arg(s): three 3941: [2/2] finished entering
3941: stderr line
3941: stdout line
3942: [2/2] finished entering
3942: stderr line
3942: stdout line
3940: [2/2] finished entering
3940: stderr line
3940: stdout line
3941: exiting
3942: exiting
GNU parallel
(version 20170122)
Process-level serialization (grouping) happens by default.
Use --line-buffer
(--lb
in newer versions) to choose line-level serialization instead, or opt out of any kind of serialization with -u
(--ungroup
).
Note how, in each group, stderr output comes after stdout output (whereas the man page that comes with version 20170122 claims that stderr output comes first).
$ parallel -P 3 ./tst ::: one two three
2544: [1/2] entering with arg(s): one 2544: [2/2] finished entering
2544: stdout line
2544: exiting
2544: stderr line
2549: [1/2] entering with arg(s): three 2549: [2/2] finished entering
2549: stdout line
2549: exiting
2549: stderr line
2546: [1/2] entering with arg(s): two 2546: [2/2] finished entering
2546: stdout line
2546: exiting
2546: stderr line
[1] "stdout and stderr is serialised through a corresponding internal pipe, in order to prevent annoying concurrent output behaviour."
Do tell me if I'm missing something.