Why is it worse to initialize a two dimensional ar

2019-01-18 23:57发布

问题:

for(int i = 0; i<100; i++)

    for(int j = 0; j<100; j++)

         array[j][i] = 0;
         // array[i][j] = 0;

My professor said it was much more costly to initialize a two dimensional array in the first way as opposed to the second. Can someone explain what is going on underneath the hood which makes that the case? Or, do the two means of initialization have equal performance?

回答1:

As @dlev mentioned, this is due to locality of reference and has to do with how the physical hardware in the computer works.

Inside the computer, there are many different types of memory. Typically, only certain memory locations (registers) can have actual operations performed on them; the rest of the time, if you're performing operations on data, you have to load it from memory into a register, perform some computation, then write it back.

Main memory (RAM) is much, much slower than registers, often by a factor of hundreds to thousands. Consequently, reading from memory should be avoided if at all possible. To address this, most computers typically have special regions of memory called caches. The job of the cache is to hold data that has recently been accessed from memory such that if that same memory region is accessed again, the value can be pulled from the cache (fast) rather than from main memory (slow). Typically, caches are designed so that if a value is read in from memory, that value, plus a whole bunch of adjacent values, are pulled into the cache. That way, if you iterate over an array, then after reading the first value, the rest of the values from the array will be sitting in the cache and can be accessed more efficiently.

The reason that your code is slower than it needs to be is that it doesn't access the array elements sequentially. In C, 2D arrays are laid out in row-major order, meaning that the memory is arranged as

A[0][0] A[0][4] A[0][5] ... A[1][0] A[1][6] A[1][7] ... A[2][0] A[2][8] A[2][9] ...

Consequently, if you use this for loop:

for (int i = 0; i < N; i++) {
    for (int j = 0; j < M; j++) {
        // Do something with A[i][j]
    }
}

Then you get excellent locality, because you will be accessing array elements in the order in which they appear in memory. This makes the number of reads of main memory very small, since everything is typically in cache and ready to go.

However, if you interchange the loops, as you've done, your accesses jump around in memory and are not necessarily consecutive. This means that you will have a lot of cache misses in which the memory address you read next isn't in the cache. This increases the number of cache loads, which can dramatically slow down the program.

Compilers are starting to get smart enough to interchange loops like this automatically, but we're still a ways away from being able to ignore these details. As a general rule, when writing C or C++ code for multidimensional arrays, try to iterate in row-major order rather than column-major order. You can get noticeable speedups in your program.

Hope this helps!



回答2:

I'll probably get downvoted for this, but if you are programming C, then the "best" is most likely:

memset(array, 0, sizeof(array));

Then you can defer all responsibility of optimizing (which you are obviously worried about) to the implementation of memset. Any specific hardware advantages can be done there.

http://en.wikipedia.org/wiki/Sizeof#Using_sizeof_with_arrays/

http://www.cplusplus.com/reference/clibrary/cstring/memset/

Another observation is that if you are init'ing to zero, ask yourself why? If your array is static (which for this large it probably is?), then cstartup will initialize to zero for you. Again, this will probably use the most efficient way for your hardware.



回答3:

I'm a bit late to the party, and there is an excellent answer already. However, I thought I could contribute by demonstrating how one could answer this question experimentally using a profiling tool (on Linux).

I'll use the perf tool in the Ubuntu 10.10 package linux-tools-common.

Here's the little C program I wrote to answer this question:

// test.c
#define DIM 1024

int main()
{
    int v[DIM][DIM];
    unsigned i, j;

    for (i = 0; i < DIM; i++) {
        for (j = 0; j < DIM; j++) {
#ifdef ROW_MAJOR_ORDER
            v[i][j] = 0;
#else
            v[j][i] = 0;
#endif
        }
    }

    return 0;
}

Then compile the two different versions:

$ gcc test.c -O0 -DROW_MAJOR_ORDER -o row-maj
$ gcc test.c -O0 -o row-min

Note I've disabled optimization with -O0 so gcc has no chance to rearrange our loop to be more efficient.

We can list the performance statistics available with perf by doing perf list. In this case, we are interested in cache misses which is the event cache-misses.

Now it's as simple as running each version of the program numerous times and taking an average:

$ perf stat -e cache-misses -r 100 ./row-min

 Performance counter stats for './row-min' (100 runs):

             286468  cache-misses               ( +-   0.810% )

        0.016588860  seconds time elapsed   ( +-   0.926% )

$ perf stat -e cache-misses -r 100 ./row-maj

 Performance counter stats for './row-maj' (100 runs):

               9594  cache-misses               ( +-   1.203% )

        0.006791615  seconds time elapsed   ( +-   0.840% )

And now we've experimentally verified that you do in fact see two orders of magnitude more cache misses with the "row-minor" version.



回答4:

If you look at the memory locations accessed by each technique, the second will access consecutive bytes, while the first will hop around by 100-byte leaps. The memory cache will work much more efficiently if you do it the second way.