What does vectorization mean?

2020-02-20 07:08发布

问题:

Is it a good idea to vectorize the code? What are good practices in terms of when to do it? What happens underneath?

回答1:

Vectorization means that the compiler detects that your independent instructions can be executed as one SIMD instruction. Usual example is that if you do something like

for(i=0; i<N; i++){
  a[i] = a[i] + b[i];
}

It will be vectorized as (using vector notation)

for (i=0; i<(N-N%VF); i+=VF){
  a[i:i+VF] = a[i:i+VF] + b[i:i+VF];
}

Basically the compiler picks one operation that can be done on VF elements of the array at the same time and does this N/VF times instead of doing the single operation N times.

It increases performance, but puts more requirement on the architecture.



回答2:

As mentioned above, vectorization is used to make use of SIMD instructions, which can perform identical operations of different data packed into large registers.

A generic guideline to enable a compiler to autovectorize a loop is to ensure that there are no flow- and anti-dependencies b/w data elements in different iterations of a loop.

http://en.wikipedia.org/wiki/Data_dependency

Some compilers like the Intel C++/Fortran compilers are capable of autovectorizing code. In case it was not able to vectorize a loop, the Intel compiler is capable of reporting why it could not do that. There reports can be used to modify the code such that it becomes vectorizable (assuming it's possible)

Dependencies are covered in depth in the book 'Optimizing Compilers for Modern Architectures: A Dependence-based Approach'



回答3:

It's SSE code Generation.

You have a loop with float matrix code in it matrix1[i][j] + matrix2[i][j] and the compiler generates SSE code.



回答4:

Vectorization need not be limited to single register which can hold large data. Like using '128' bit register to hold '4 x 32' bit data. It depends on architectural limitations. Some architecture have different execution units which have registers of their own. In that case, a part of the data can be fed to that execution unit and the result can be taken from a register corresponding to that execution unit.

For example, consider the below case.

for(i=0; i < N; i++)
{
a[i] = a[i] + b[i];
}



If I am working on an architecture which has two execution units, then my vector size is defined as two. The loop mentioned above will be reframed as

for(i=0; i<(N/2); i+=2)
{
a[i] = a[i] + b[i] ;


a[i+1] = a[i+1] + b[i+1];
}

NOTE: The 2 inside the for statement is derived from the vector size.

As I am having two execution units the two statements inside the loop will be fed into the two execution units. The sum will be accumulated in the execution units separately. Finally the sum of accumulated values (from two execution units) will be carried out.

The good practices are
1. The constraints like dependency (between different iterations of the loop) needs to be checked before vectorizing the loop.
2. Function calls needs to be prevented.
3. Pointer access can create aliasing and it needs to be prevented.



回答5:

Maybe also have a look at libSIMDx86 (source code).

A nice example well explained is:

Choosing to Avoid Branches: A Small Altivec Example