Consider this program:
use std::io::BufRead;
use std::io;
fn main() {
let mut n = 0;
let stdin = io::stdin();
for _ in stdin.lock().lines() {
n += 1;
}
println!("{}", n);
}
Why is it over 10x as slow as the GNU version of wc? Take a look at how I measure it:
$ yes | dd count=1000000 | wc -l
256000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 1.16586 s, 439 MB/s
$ yes | dd count=1000000 | ./target/release/wc
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 41.685 s, 12.3 MB/s
256000000
There are many reasons why your code is way slower than the original
wc
. There are a few things you pay for which you actually don't need at all. By removing those, you can already get a quite substantial speed boost.Heap allocations
BufRead::lines()
returns an iterator which yieldsString
elements. Due to this design, it will (it has to!) allocate memory for every single line. Thelines()
method is a convenient method to write code easily, but it isn't supposed to be used in high performance situations.To avoid allocating heap memory for every single line, you can use
BufRead::read_line()
instead. The code is a bit more verbose, but as you can see, we are reusing the heap memory ofs
:On my notebook, this results in:
As you can see, it improved things a lot, but is still not equivalent.
UTF-8 validation
Since we are reading into a
String
, we are validating the raw input from stdin to be proper UTF-8. This costs time! But we are only interested in the raw bytes, since we only need to count the newline characters (0xA
). We can get rid of UTF-8 checks by using aVec<u8>
andBufRead::read_until()
:This results in:
That's a 60% improvement. But the original
wc
is still faster by a factor of 3.5x!Further possible improvements
Now we used up all the low hanging fruit to boost performance. In order to match
wc
s speed, one would have to do some serious profiling, I think. In our current solution,perf
reports the following:memchr
; I don't think this can be improved<StdinLock as std::io::BufRead>::fill_buf()
and<StdinLock as std::io::BufRead>::consume()
A huge part of the remaining time is spent in
main
directly (due to inlining). From the looks of it, we are also paying a bit for the cross platform abstraction. There is some time spent forMutex
methods and stuff.But at this point, I'm just guessing, because I don't have the time to look into this further. Sorry :<
But please note, that
wc
is an old tool and is highly optimized for the platform it's running on and for the task it's performing. I guess that knowledge about Linux internal things would help performance a lot. This is really specialized, so I wouldn't expect to match the performance that easily.Here's a version I got from Arnavion on #rust-beginners IRC:
This gets performance very close to what original
wc
does.This is because your version is in no way equivalent to the GNU one which doesn't allocate any memory for strings, but only moves the file pointer and increments different counters. In addition, it processes raw bytes while Rust's
String
must be valid UTF-8.GNU wc source