In a bash script I am using a many-producer single-consumer pattern. Producers are background processes writing lines into a fifo (via GNU Parallel). The consumer reads all lines from the fifo, then sorts, filters, and prints the formatted result to stdout.
However, it could take a long time until the full result is available. Producers are usually fast on the first few results but then would slow down. Here I am more interested to see chunks of data every few seconds, each sorted and filtered individually.
mkfifo fifo
parallel ... >"$fifo" &
while chunk=$(read with timeout 5s and at most 10s <"$fifo"); do
process "$chunk"
done
The loop would run until all producers are done and all input is read. Each chunk is read until there has been no new data for 5s, or until 10s have passed since the chunk was started. A chunk may also be empty if there was no new data for 10s.
I tried to make it work like this:
output=$(mktemp)
while true; do
wasTimeout=0 interruptAt=$(( $(date '+%s') + 10 ))
while true; do
IFS= read -r -t5 <>"${fifo}"
rc="$?"
if [[ "${rc}" -gt 0 ]]; then
[[ "${rc}" -gt 128 ]] && wasTimeout=1
break
fi
echo "$REPLY" >>"${output}"
if [[ $(date '+%s') -ge "${interruptAt}" ]]; then
wasTimeout=1
break
fi
done
echo '---' >>"${output}"
[[ "${wasTimeout}" -eq 0 ]] && break
done
Tried some variations of this. In the form above it reads the first chunk but then loops forever. If I use <"${fifo}"
(no read/write as above) it blocks after the first chunk. Maybe all of this could be simplified with buffer
and/or stdbuf
? But both of them define blocks by size, not by time.
Something like this:
This is not a trivial problem to resolve. As I hinted, a C program (or a program in some programming language other than the shell) is probably the best solution. Some of the complicating factors are:
alarm()
is likely available everywhere, but has only 1-second resolution which is liable to accumulated rounding errors. (Compile this version withmake UFLAGS=-DUSE_ALARM
; on macOS, usemake UFLAGS=-DUSE_ALARM LDLIB2=
.)setitimer()
uses microsecond timing and thestruct timeval
type. (Compile this version withmake UFLAGS=-DUSE_SETITIMER
; on macOS, compile withmake UFLAGS=-DUSE_SETITIMER LDLIB2=
.)timer_create()
andtimer_settime()
etc use the modern nanosecond typestruct timespec
. This is available on Linux; it is not available on macOS 10.14.5 Mojave or earlier. (Compile this version withmake
; it won't work on macOS.)The program usage message is:
This code is available in my SOQ (Stack Overflow Questions) repository on GitHub as file
chunker79.c
in the src/so-5631-4784 sub-directory. You will need some of the support code from the src/libsoq directory too.My SOQ repository also has a script
gen-data.sh
which makes use of some custom programs to generate a data stream such as this (the seed value is written to standard error, not standard output):When fed into
chunker79
with default options, I get output like:If you analyze the time intervals (look at the first two fields in the output lines), that output meets the specification. A still more detailed analysis is shown by:
There is a noticeable pause in this setup between when the output from
chunker79
appears and whengen-data.sh
completes. That's due to Bash waiting on all processes in the pipeline to complete, andgen-data.sh
doesn't complete until the next time it writes to the pipe after the message that finisheschunker79
. This is an artefact of this test setup; it wouldn't be a factor in the shell script outlined in the question.I would consider writing a safe multi-threaded program with queues.
I know Java better, but there might be more modern suitable languages like Go and Kotlin.