I have a column vector A which is 10 elements long. I have a matrix B which is 10 by 10. The memory storage for B is column major. I would like to overwrite the first row in B with the column vector A.
Clearly, I can do:
for ( int i=0; i < 10; i++ )
{
B[0 + 10 * i] = A[i];
}
where I've left the zero in 0 + 10 * i
to highlight that B uses column-major storage (zero is the row-index).
After some shenanigans in CUDA-land tonight, I had a thought that there might be a CPU function to perform a strided memcpy?? I guess at a low-level, performance would depend on the existence of a strided load/store instruction, which I don't recall there being in x86 assembly?