什么是“量化”？(What is “vectorization”?)

有好几次，现在，我已经遇到了在MATLAB，FORTRAN这个词......一些其他的......但我从来没有找到一个解释这是什么意思，和它做什么？所以我在这里问，什么是矢量，而这是什么意思，例如，“一环矢量化”？

Answer 1:

许多CPU具有“载体”或同时施加相同的操作以数据的两个，四个或更多个“SIMD”指令集。现代x86芯片具备SSE指令，很多PPC芯片有“AltiVec技术”的指示，甚至有些ARM芯片有一个向量指令集，称为NEON。

“矢量化”（简化的）是重写循环的过程中，而不是处理的阵列的N倍，它处理（例如）阵列的4个元件同时N / 4倍的单个元件，使得。

（我选择了4，因为这就是现代化的硬件是最有可能直接支持;而“矢量”一词也被用来形容一个更高级别的软件转换，你可能只是抽象掉循环干脆，只是描述的阵列，而不是元素操作这包括它们）

量化和循环展开的区别是：考虑以下非常简单的循环，增加了两个数组中的元素，并且存储结果的第三阵列。

for (int i=0; i<16; ++i)
    C[i] = A[i] + B[i];

展开这个循环需要将它弄成这个样子：

for (int i=0; i<16; i+=4) {
    C[i]   = A[i]   + B[i];
    C[i+1] = A[i+1] + B[i+1];
    C[i+2] = A[i+2] + B[i+2];
    C[i+3] = A[i+3] + B[i+3];
}

矢量化了，而另一方面，产生这样的：

for (int i=0; i<16; i+=4)
    addFourThingsAtOnceAndStoreResult(&C[i], &A[i], &B[i]);

其中，“addFourThingsAtOnceAndStoreResult”是什么内在的（一个或多个）编译器使用指定向量指令的占位符。请注意，一些编译器能够自动向量化非常简单的循环这样的，它可以经常通过编译选项来启用。更复杂的算法仍然需要帮助，从程序员产生良好的载体代码。

Answer 2:

矢量为标量方案转换为矢量程序的术语。矢量化程序可以从一个单一的指令运行多个操作，而标可对操作数的一次只运行。

从维基百科：

标量的方法：

for (i = 0; i < 1024; i++)
{
   C[i] = A[i]*B[i];
}

矢量化的方法：

for (i = 0; i < 1024; i+=4)
{
   C[i:i+3] = A[i:i+3]*B[i:i+3];
}

Answer 3:

它指的是做一个名单上单一数学运算的能力 - 或“载体” - 在一个单一的步骤数。你与Fortran语言经常看到它，因为这是有科学计算，这与超级计算机，其中矢量运算第一次出现关联。几乎所有时下台式机CPU提供某种形式的量化运算，通过像Intel的SSE技术。图形处理器还提供矢量运算的一种形式。

Answer 4:

矢量化是在需要高效地处理数据的大块在科学计算大大使用。

在实际编程应用程序，我知道这是在numpy的使用（不知道其他人）。

numpy的（包在python科学计算），使用矢量为n维阵列，其通常是较慢的，如果有用于处理阵列内置蟒选项进行的快速操纵。

虽然万吨的解释是在那里，这里是矢量化定义为numpy的文档页面

矢量描述没有任何明确的循环，索引等，在代码 - 这些事情正在发生，当然，只是在优化，预编译的C代码“幕后”。量化代码有许多优点，其中有：

量化代码更简洁，更易于阅读
更少的代码通常意味着较少的错误
代码更接近于标准的数学符号（使其更容易，通常情况下，正确编码数学结构）
矢量导致更多的“Python化”代码。如果没有量化，我们的代码将与效率低下，难以将散落阅读for循环。

Answer 5:

Vectorization, in simple words, means optimizing the algorithm so that it can utilize SIMD instructions in the processors.

AVX, AVX2 and AVX512 are the instruction sets (intel) that perform same operation on multiple data in one instruction. for eg. AVX512 means you can operate on 16 integer values(4 bytes) at a time. What that means is that if you have vector of 16 integers and you want to double that value in each integers and then add 10 to it. You can either load values on to general register [a,b,c] 16 times and perform same operation or you can perform same operation by loading all 16 values on to SIMD registers [xmm,ymm] and perform the operation once. This lets speed up the computation of vector data.

In vectorization we use this to our advantage, by remodelling our data so that we can perform SIMD operations on it and speed up the program.

Only problem with vectorization is handling conditions. Because conditions branch the flow of execution. This can be handled by masking. By modelling the condition into an arithmetic operation. eg. if we want to add 10 to value if it is greater then 100. we can either.

if(x[i] > 100) x[i] += 10; // this will branch execution flow.

or we can model the condition into arithmetic operation creating a condition vector c,

c[i] = x[i] > 100; // storing the condition on masking vector
x[i] = x[i] + (c[i] & 10) // using mask

this is very trivial example though... thus, c is our masking vector which we use to perform binary operation based on its value. This avoid branching of execution flow and enables vectorization.

Vectorization is as important as Parallelization. Thus, we should make use of it as much possible. All modern days processors have SIMD instructions for heavy compute workloads. We can optimize our code to use these SIMD instructions using vectorization, this is similar to parrallelizing our code to run on multiple cores available on modern processors.

I would like to leave with the mention of OpenMP, which lets yo vectorize the code using pragmas. I consider it as a good starting point. Same can be said for OpenACC.