Most elegant unix shell one-liner to sum list of n

2019-07-25 14:44发布

问题:

As regards integer adding one-liners, several proposed shell scripting solutions exist;
however, on closer look at each of the solutions chosen, there are inherent limitations:

  • awk ones would choke at arbitrary precision and integer size (it behaves C-like, afterall)
  • bc ones would rather be unhappy with arbitrarily long inputs: (sed 's/$/+\\/g';echo 0)|bc

Understanding that there may be issues of portability on top of that across platforms (see [1] [2]) which is undesired, is there a generic solution which is a winner on both practicality and brevity?

Hint: SunOS & MacOSX are examples where portability would be an issue.
fi. could dc command permit to handle arbitrarily large 2^n, integer or otherwise, inputs?

[1] awk: https://stackoverflow.com/a/450821/1574494 or https://stackoverflow.com/a/25245025/1574494 or Printing long integers in awk

[2] bc: Bash command to sum a column of numbers

回答1:

The one I usually use is paste -sd+|bc:

$ time seq 1 20000000 | paste -sd+|bc
200000010000000

real    0m10.092s
user    0m10.854s
sys     0m0.481s

(For strict Posix compliance, paste needs to be provided with an explicit argument: paste -sd+ -|bc. Apparently that is necessary with the BSD paste implementation installed by default on OS X.)

However, that will fail for larger inputs, because bc buffers an entire expression in memory before evaluating it. On my system, bc ran out of memory trying to add 100 million numbers, although it was able to do 70 million. But other systems may have smaller capacities.

Since bc has variables, you could avoid long lines by repetitively adding to a variable instead of constructing a single long expression. This is (as far as I know) 100% Posix compliant, but there is a 3x time penalty:

$ time seq 1 20000000|sed -e's/^/s+=/;$a\' -es|bc
200000010000000

real    0m29.224s
user    0m44.119s
sys     0m0.820s

Another way to handle the case where the input size exceeds bc's buffering capacity would be to use the standard xargs tool to add the numbers in groups:

$ time seq 1 100000000 |
> IFS=+ xargs sh -c 'echo "$*"' _ | bc | paste -sd+ | bc
5000000050000000

real    1m0.289s
user    1m31.297s
sys     0m19.233s

The number of input lines used by each xargs evaluation will vary from system to system, but it will normally be in the hundreds and it might be much more. Obviously, the xargs | bc invocations could be chained arbitrarily to increase capacity.

It might be necessary to limit the size of the xargs expansion using the -s switch, on systems where ARG_MAX exceeds the capacity of the bc command. Aside from performing an experiment to establish the bc buffer limit, there is no portable way to establish what that limit might be but it certainly should be no less than LINE_MAX which is guaranteed to be at least 2048. Even with 100-digit addends, that will allow a reduction by a factor of 20, so a chain of 10 xargs|bc pipes would handle over 1013 addends assuming you were prepared to wait a couple of months for that to complete.

As an alternative to constructing a large fixed-length pipeline, you could use a function to recursively pipe the output from xargs|bc until only one value is produced:

radd () { 
    if read a && read b; then
        { printf '%s\n%s\n' "$a" "$b"; cat; } |
          IFS=+ xargs -s $MAXLINE sh -c 'echo "$*"' _ |
          bc | radd
    else
        echo "$a"
    fi
}

If you use a very conservative value for MAXLINE, the above is quite slow, but with plausible larger values it is not much slower than the simple paste|bc solution:

$ time seq 1 20000000 | MAXLINE=2048 radd
200000010000000

real    1m38.850s
user    0m46.465s
sys     1m34.503s

$ time seq 1 20000000 | MAXLINE=60000 radd 
200000010000000

real    0m12.097s
user    0m17.452s
sys     0m5.090s

$ time seq 1 100000000 | MAXLINE=60000 radd 
5000000050000000

real    1m3.972s
user    1m31.394s
sys     0m27.946s

As well as the bc solutions, I timed some other possibilities. As shown above, with an input of 20 million numbers, paste|bc took 10 seconds. That's almost identical to the time used by adding 20 million numbers with

gawk -M '{s+=$0} END{print s}'

Programming languages such as python and perl proved to be faster:

# 9.2 seconds to sum 20,000,000 integers
python -c $'import sys\nprint(sum(int(x) for x in sys.stdin))'
# 5.1 seconds
perl -Mbignum -lne '$s+=$_; END{print $s}'

I was unable to test dc -f - -e '[+z1<r]srz1<rp' on large inputs, since its performance appears to be quadratic (or worse); it summed 25 thousand numbers in 3 seconds, but it took 19 seconds to sum 50 thousand and 90 seconds to do 100 thousand.

Although bc is not the fastest and memory limitations require awkward workarounds, it has the advantage of working out of the box on Posix-compliant systems without the necessity to install enhanced versions of any standard utility (awk) or programming languages not required by Posix (perl and python).



回答2:

An optimal solution for dc(1) sums the inputs as they are read:

$ jot 1000000 | sed '2,$s/$/+/;$s/$/p/' | dc
500000500000


回答3:

You can use gawk with the -M flag:

$ seq 1 20000000 | gawk -M '{s+=$0} END{print s}'
200000010000000

Or Perl with bignum enabled:

$ seq 1 20000000 | perl -Mbignum -lne '$s+=$_; END{print $s}'
200000010000000


回答4:

$ seq 1000|(sum=0;while read num; do sum=`echo $sum+$num|bc -l`;done;echo $sum) 500500

Also, this one will not win a top-speed prize, however it IS:

  • oneliner, yes.
  • portable
  • adds lists of any length
  • adds numbers of any precision (each number's length limited only by MAXLINE)
  • does not rely on external tools such as python/perl/awk/R etc

with a stretch, you may call it elegant too ;-) come on guys, show the better way to do this!



回答5:

It seems that the following does the trick:

$ seq 1000|dc -f - -e '[+z1<r]srz1<rp'
500500

but, is it the optimal solution?