I have a file which contains several thousand numbers, each on it's own line:
34
42
11
6
2
99
...
I'm looking to write a script which will print the sum of all numbers in the file. I've got a solution, but it's not very efficient. (It takes several minutes to run.) I'm looking for a more efficient solution. Any suggestions?
Running R scripts
I've written an R script to take arguments of a file name and sum the lines.
This can be sped up with the "data.table" or "vroom" package as follows:
Benchmarking
Same benchmarking data as @glenn jackman.
In comparison to the R call above, running R 3.5.0 as a script is comparable to other methods (on the same Linux Debian server).
R script with readLines
R script with data.table
R script with vroom
Comparison with other languages
For reference here as some other methods suggested on the same hardware
Python 2 (2.7.13)
Python 3 (3.6.8)
Ruby (2.3.3)
Perl (5.24.1)
Awk (4.1.4)
C (clang version 3.3; gcc (Debian 6.3.0-18) 6.3.0 )
Update with additional languages
Lua (5.3.5)
tr (8.26) must be timed in bash, not compatible with zsh
sed (4.4) must be timed in bash, not compatible with zsh
note: sed calls seem to work faster on systems with more memory available (note smaller datasets used for benchmarking sed)
Julia (0.5.0)
Notice that as in R, file I/O methods have different performance.
None of the solution thus far use
paste
. Here's one:As an example, calculate Σn where 1<=n<=100000:
(For the curious,
seq n
would print a sequence of numbers from1
ton
given a positive numbern
.)C always wins for speed:
Timing for 1M numbers (same machine/input as my python answer):
I prefer to use GNU datamash for such tasks because it's more succinct and legible than perl or awk. For example
where 1 denotes the first column of data.
Another for fun
or another bash only
But awk solution is probably best as it's most compact.