Is there any way to allocate a standard Rust array

2020-04-07 05:09发布

There are several questions already on Stack Overflow about allocating an array (say [i32]) on the heap. The general recommendation is boxing, e.g. Box<[i32]>. But while boxing works fine enough for smaller arrays, the problem is that the array being boxed has to first be allocated on the stack.

So if the array is too large (say 10 million elements), you will - even with boxing - get a stack overflow (one is unlikely to have a stack that large).

The suggestion then is using Vec<T> instead, that is Vec<i32> in our example. And while that does do the job, it does have a performance impact.

Consider the following program:

fn main() {
    const LENGTH: usize = 10_000;
    let mut a: [i32; LENGTH] = [0; LENGTH];
    for j in 0..LENGTH {
        for i in 0..LENGTH {
            a[i] = j as i32;
        }
    }
}

time tells me that this program takes about 2.9 seconds to run. I use 10'000 in this example, so I can allocate that on the stack, but I really want one with 10 million.

Now consider the same program but with Vec<T> instead:

fn main() {
    const LENGTH: usize = 10_000;
    let mut a: Vec<i32> = vec![0; LENGTH];
    for j in 0..LENGTH {
        for i in 0..LENGTH {
            a[i] = j as i32;
        }
    }
}

time tells me that this program takes about 5 seconds to run. Now time isn't super exact, but the difference of about 2 seconds for such a simple program is not an insignificant impact.

Storage is storage, the program with array is just as fast when the array is boxed. So it's not the heap slowing the Vec<T> version down, but the Vec<T> structure itself.

I also tried with a HashMap (specifically HashMap<usize, i32> to mimic an array structure), but that's far slower than the Vec<T> solution.

If my LENGTH had been 10 million, the first version wouldn't even have run.

If that's not possible, is there a structure that behaves like an array (and Vec<T>) on the heap, but can match the speed and performance of an array?

标签: arrays rust heap
2条回答
祖国的老花朵
2楼-- · 2020-04-07 05:18

If you really want a heap-allocated array, i.e. Box<[i32; LENGTH]> then you can use:

fn main() {
    const LENGTH: usize = 10_000_000;

    let mut a = {
        let mut v: Vec<i32> = Vec::with_capacity(LENGTH);

        // Explicitly set length which is safe since the allocation is
        // sized correctly.
        unsafe { v.set_len(LENGTH); };

        // While not required for this particular example, in general
        // we want to initialize elements to a known value.
        let mut slice = v.into_boxed_slice();
        for i in &mut slice[..] {
            *i = 0;
        }

        let raw_slice = Box::into_raw(slice);

        // Using `from_raw` is safe as long as the pointer is
        // retrieved using `into_raw`.
        unsafe {
            Box::from_raw(raw_slice as *mut [i32; LENGTH])
        }
    };

    // This is the micro benchmark from the question.
    for j in 0..LENGTH {
        for i in 0..LENGTH {
            a[i] = j as i32;
        }
    }
}

It's not going to be faster than using a vector since Rust does bounds-checking even on arrays, but it has a smaller interface which might make sense in terms of software design.

查看更多
闹够了就滚
3楼-- · 2020-04-07 05:24

Summary: your benchmark is flawed; just use a Vec (as described here), possibly with into_boxed_slice, as it is incredibly unlikely to be slower than a heap allocated array.


Unfortunately, your benchmarks are flawed. First of all, you probably didn't compile with optimizations (--release for cargo, -O for rustc). Because if you would have, the Rust compiler would have removed all of your code. See the assembly here. Why? Because you never observe the vector/array, so there is no need to do all that work in the first place.

Also, your benchmark is not testing what you actually want to test. You are comparing an stack-allocated array with a heap-allocated vector. You should compare the Vec to a heap allocated array.

Don't feel bad though: writing benchmarks is incredible hard for many reasons. Just remember: if you don't know a lot about writing benchmarks, better don't trust your own benchmarks without asking others first.


I fixed your benchmark and included all three possibilities: Vec, array on stack and array on heap. You can find the full code here. The results are:

running 3 tests
test array_heap  ... bench:   9,600,979 ns/iter (+/- 1,438,433)
test array_stack ... bench:   9,232,623 ns/iter (+/- 720,699)
test vec_heap    ... bench:   9,259,912 ns/iter (+/- 691,095)

Surprise: the difference between the versions are way less than the variance of the measurement. So we can assume they all are pretty equally fast.

Note that this benchmark is still pretty bad. The two loops can just be replaced by one loop setting all array elements to LENGTH - 1. From taking a quick look at the assembly (and from the rather long time of 9ms), I think that LLVM is not smart enough to actually perform this optimization. But things like this are important and one should be aware of that.


Finally, let's discuss why both solutions should be equally fast and whether there are actually differences in speed.

The data section of a Vec<T> has exactly the same memory layout as a [T]: just many Ts contiguously in memory. Super simple. This also means both exhibit the same caching-behavior (specifically, being very cache-friendly).

The only difference is that a Vec might have more capacity than elements. So Vec itself stores (pointer, length, capacity). That is one word more than a simple (boxed) slice (which stores (pointer, length)). A boxed array doesn't need to store the length, as it's already in the type, so it is just a simple pointer. Whether or not we store one, two or three words is not really important when you will have millions of elements anyway.

Accessing one element is the same for all three: we do a bounds check first and then calculate the target pointer via base_pointer + size_of::<T>() * index. But it's important to note that the array storing its length in the type means that the bounds check can be removed more easily by the optimizer! This can be a real advantage.

However, bounds checks are already usually removed by the smart optimizer. In the benchmark code I posted above, there are no bounds checks in the assembly. So while a boxed array could be a bit faster by removed bounds checks, (a) this will be a minor performance difference and (b) it's very unlikely that you will have a lot of situations where the bound check is removed for the array but not for the Vec/slice.

查看更多
登录 后发表回答