How can I quickly sum all numbers in a file?

2020-01-24 03:22发布

I have a file which contains several thousand numbers, each on it's own line:

34
42
11
6
2
99
...

I'm looking to write a script which will print the sum of all numbers in the file. I've got a solution, but it's not very efficient. (It takes several minutes to run.) I'm looking for a more efficient solution. Any suggestions?

27条回答
一夜七次
2楼-- · 2020-01-24 04:01
sed ':a;N;s/\n/+/;ta' file|bc
查看更多
太酷不给撩
3楼-- · 2020-01-24 04:04

Running R scripts

I've written an R script to take arguments of a file name and sum the lines.

#! /usr/local/bin/R
file=commandArgs(trailingOnly=TRUE)[1]
sum(as.numeric(readLines(file)))

This can be sped up with the "data.table" or "vroom" package as follows:

#! /usr/local/bin/R
file=commandArgs(trailingOnly=TRUE)[1]
sum(data.table::fread(file))
#! /usr/local/bin/R
file=commandArgs(trailingOnly=TRUE)[1]
sum(vroom::vroom(file))

Benchmarking

Same benchmarking data as @glenn jackman.

for ((i=0; i<1000000; i++)) ; do echo $RANDOM; done > random_numbers

In comparison to the R call above, running R 3.5.0 as a script is comparable to other methods (on the same Linux Debian server).

$ time R -e 'sum(scan("random_numbers"))'  
 0.37s user
 0.04s system
 86% cpu
 0.478 total

R script with readLines

$ time Rscript sum.R random_numbers
  0.53s user
  0.04s system
  84% cpu
  0.679 total

R script with data.table

$ time Rscript sum.R random_numbers     
 0.30s user
 0.05s system
 77% cpu
 0.453 total

R script with vroom

$ time Rscript sum.R random_numbers     
  0.54s user 
  0.11s system
  93% cpu
  0.696 total

Comparison with other languages

For reference here as some other methods suggested on the same hardware

Python 2 (2.7.13)

$ time python2 -c "import sys; print sum((float(l) for l in sys.stdin))" < random_numbers 
 0.27s user 0.00s system 89% cpu 0.298 total

Python 3 (3.6.8)

$ time python3 -c "import sys; print(sum((float(l) for l in sys.stdin)))" < random_number
0.37s user 0.02s system 98% cpu 0.393 total

Ruby (2.3.3)

$  time ruby -e 'sum = 0; File.foreach(ARGV.shift) {|line| sum+=line.to_i}; puts sum' random_numbers
 0.42s user
 0.03s system
 72% cpu
 0.625 total

Perl (5.24.1)

$ time perl -nle '$sum += $_ } END { print $sum' random_numbers
 0.24s user
 0.01s system
 99% cpu
 0.249 total

Awk (4.1.4)

$ time awk '{ sum += $0 } END { print sum }' random_numbers
 0.26s user
 0.01s system
 99% cpu
 0.265 total
$ time awk '{ sum += $1 } END { print sum }' random_numbers
 0.34s user
 0.01s system
 99% cpu
 0.354 total

C (clang version 3.3; gcc (Debian 6.3.0-18) 6.3.0 )

 $ gcc sum.c -o sum && time ./sum < random_numbers   
 0.10s user
 0.00s system
 96% cpu
 0.108 total

Update with additional languages

Lua (5.3.5)

$ time lua -e 'sum=0; for line in io.lines() do sum=sum+line end; print(sum)' < random_numbers 
 0.30s user 
 0.01s system
 98% cpu
 0.312 total

tr (8.26) must be timed in bash, not compatible with zsh

$time { { tr "\n" + < random_numbers ; echo 0; } | bc; }
real    0m0.494s
user    0m0.488s
sys 0m0.044s

sed (4.4) must be timed in bash, not compatible with zsh

$  time { head -n 10000 random_numbers | sed ':a;N;s/\n/+/;ta' |bc; }
real    0m0.631s
user    0m0.628s
sys     0m0.008s
$  time { head -n 100000 random_numbers | sed ':a;N;s/\n/+/;ta' |bc; }
real    1m2.593s
user    1m2.588s
sys     0m0.012s

note: sed calls seem to work faster on systems with more memory available (note smaller datasets used for benchmarking sed)

Julia (0.5.0)

$ time julia -e 'print(sum(readdlm("random_numbers")))'
 3.00s user 
 1.39s system 
 136% cpu 
 3.204 total
$  time julia -e 'print(sum(readtable("random_numbers")))'
 0.63s user 
 0.96s system 
 248% cpu 
 0.638 total

Notice that as in R, file I/O methods have different performance.

查看更多
Animai°情兽
4楼-- · 2020-01-24 04:05

None of the solution thus far use paste. Here's one:

paste -sd+ filename | bc

As an example, calculate Σn where 1<=n<=100000:

$ seq 100000 | paste -sd+ | bc -l
5000050000

(For the curious, seq n would print a sequence of numbers from 1 to n given a positive number n.)

查看更多
我命由我不由天
5楼-- · 2020-01-24 04:05

C always wins for speed:

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv) {
    ssize_t read;
    char *line = NULL;
    size_t len = 0;
    double sum = 0.0;

    while (read = getline(&line, &len, stdin) != -1) {
        sum += atof(line);
    }

    printf("%f", sum);
    return 0;
}

Timing for 1M numbers (same machine/input as my python answer):

$ gcc sum.c -o sum && time ./sum < numbers 
5003371677.000000
real    0m0.188s
user    0m0.180s
sys     0m0.000s
查看更多
该账号已被封号
6楼-- · 2020-01-24 04:10

I prefer to use GNU datamash for such tasks because it's more succinct and legible than perl or awk. For example

datamash sum 1 < myfile

where 1 denotes the first column of data.

查看更多
ゆ 、 Hurt°
7楼-- · 2020-01-24 04:10

Another for fun

sum=0;for i in $(cat file);do sum=$((sum+$i));done;echo $sum

or another bash only

s=0;while read l; do s=$((s+$l));done<file;echo $s

But awk solution is probably best as it's most compact.

查看更多
登录 后发表回答