Context
I need to optimize deduplication using 'sort -u' and my linux machine has an old implementation of 'sort' command (i.e. 5.97) that has not '--parallel' option. Although 'sort' implements parallelizable algorithms (e.g. merge-sort), I need to make such parallelization explicit. Therefore, I make it by hand via 'xargs' command that outperforms ~2.5X w.r.t. to the single 'sort -u' method ... when it works fine.
Here the intuition of what I am doing.
I am running a bash script that splits an input file (e.g. file.txt) into several parts (e.g. file.txt.part1, file.txt.part2, file.txt.part3, file.txt.part4). The resulting parts are passed to the 'xargs' command in order to perform parallel deduplication via the sortu.sh script (details at the end). sortu.sh wraps the invocation of 'sort -u' and outputs the resulting file name (e.g. "sortu.sh file.txt.part1" outputs "file.txt.part1.sorted"). Then the resulting sorted parts are passed to a 'sort --merge -u' that merges/deduplicates the input parts assuming that such parts are already sorted.
The problem I am experiencing is on the parallelization via 'xargs'. Here a simplified version of my code:
AVAILABLE_CORES=4
PARTS="file.txt.part1
file.txt.part2
file.txt.part3
file.txt.part4"
SORTED_PARTS=$(echo "$PARTS" | xargs --max-args=1 \
--max-procs=$AVAILABLE_CORES \
bash sortu.sh \
)
...
#More code for merging the resulting parts $SORTED_PARTS
...
The expecting result is a list of sorted parts into the variable SORTED_PARTS:
echo "$SORTED_PARTS"
file.txt.part1.sorted
file.txt.part2.sorted
file.txt.part3.sorted
file.txt.part4.sorted
Symptom
Nevertheless, (sometimes) there is a missing sorted part. For instance, the file.txt.part2.sorted:
echo "$SORTED_PARTS"
file.txt.part1.sorted
file.txt.part3.sorted
file.txt.part4.sorted
This symptom is non-deterministic in its occurrence (i.e. the execution for the same file.txt succeeds and in another time it fails) or in the missing file (i.e. it is not always the same sorted missing part).
Problem
I have a race condition where all the sortu.sh instances finish and 'xargs' sends EOF before the stdout is flushed.
Question
Is there a way to ensure stdout flushing before 'xagrs' sends EOF?
Constraints
I am not able to use neither parallel command nor "--parallel" option of sort command.
sortu.sh code
#!/bin/bash
SORTED=$1.sorted
sort -u $1 > $SORTED
echo $SORTED
The below doesn't write contents out to disk at all, and parallelizes the split process, the sort processes, and the merge, performing all of these at once.
This version has been backported to bash 3.2; a version built for newer releases of bash wouldn't need
eval
.