Why is my Rust version of “wc” slower than the one

2020-07-10 09:44发布

问题:

Consider this program:

use std::io::BufRead;
use std::io;

fn main() {
    let mut n = 0;
    let stdin = io::stdin();
    for _ in stdin.lock().lines() {
        n += 1;
    }
    println!("{}", n);
}

Why is it over 10x as slow as the GNU version of wc? Take a look at how I measure it:

$ yes | dd count=1000000 | wc -l
256000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 1.16586 s, 439 MB/s
$ yes | dd count=1000000 | ./target/release/wc
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 41.685 s, 12.3 MB/s
256000000

回答1:

There are many reasons why your code is way slower than the original wc. There are a few things you pay for which you actually don't need at all. By removing those, you can already get a quite substantial speed boost.

Heap allocations

BufRead::lines() returns an iterator which yields String elements. Due to this design, it will (it has to!) allocate memory for every single line. The lines() method is a convenient method to write code easily, but it isn't supposed to be used in high performance situations.

To avoid allocating heap memory for every single line, you can use BufRead::read_line() instead. The code is a bit more verbose, but as you can see, we are reusing the heap memory of s:

let mut n = 0;
let mut s = String::new();
let stdin = io::stdin();
let mut lock = stdin.lock();
loop {
    s.clear();
    let res = lock.read_line(&mut s);
    if res.is_err() || res.unwrap() == 0 {
        break;
    }
    n += 1;
}
println!("{}", n);

On my notebook, this results in:

$ yes | dd count=1000000 | wc -l
256000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 0,981827 s, 521 MB/s

$ yes | dd count=1000000 | ./wc 
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 6,87622 s, 74,5 MB/s
256000000

As you can see, it improved things a lot, but is still not equivalent.

UTF-8 validation

Since we are reading into a String, we are validating the raw input from stdin to be proper UTF-8. This costs time! But we are only interested in the raw bytes, since we only need to count the newline characters (0xA). We can get rid of UTF-8 checks by using a Vec<u8> and BufRead::read_until():

let mut n = 0;
let mut v = Vec::new();
let stdin = io::stdin();
let mut lock = stdin.lock();
loop {
    v.clear();
    let res = lock.read_until(0xA, &mut v);
    if res.is_err() || res.unwrap() == 0 {
        break;
    }
    n += 1;
}
println!("{}", n);

This results in:

1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 4,24162 s, 121 MB/s
256000000

That's a 60% improvement. But the original wc is still faster by a factor of 3.5x!

Further possible improvements

Now we used up all the low hanging fruit to boost performance. In order to match wcs speed, one would have to do some serious profiling, I think. In our current solution, perf reports the following:

  • around 11% of the time is spent in memchr; I don't think this can be improved
  • around 18% is spent in <StdinLock as std::io::BufRead>::fill_buf() and
  • around 6% is spent in <StdinLock as std::io::BufRead>::consume()

A huge part of the remaining time is spent in main directly (due to inlining). From the looks of it, we are also paying a bit for the cross platform abstraction. There is some time spent for Mutex methods and stuff.

But at this point, I'm just guessing, because I don't have the time to look into this further. Sorry :<

But please note, that wc is an old tool and is highly optimized for the platform it's running on and for the task it's performing. I guess that knowledge about Linux internal things would help performance a lot. This is really specialized, so I wouldn't expect to match the performance that easily.



回答2:

This is because your version is in no way equivalent to the GNU one which doesn't allocate any memory for strings, but only moves the file pointer and increments different counters. In addition, it processes raw bytes while Rust's String must be valid UTF-8.

GNU wc source



回答3:

Here's a version I got from Arnavion on #rust-beginners IRC:

use std::io::Read;

fn main() {
    let mut buffer = [0u8; 1024];
    let stdin = ::std::io::stdin();
    let mut stdin = stdin.lock();
    let mut wc = 0usize;
    loop {
        match stdin.read(&mut buffer) {
            Ok(0) => {
                break;
            },
            Ok(len) => {
                wc += buffer[0..len].into_iter().filter(|&&b| b == b'\n').count();
            },
            Err(err) => {
                panic!("{}", err);
            },
        }
    };
    println!("{}", wc);
}

This gets performance very close to what original wc does.