int MAX_DIM = 100;
float a[MAX_DIM][MAX_DIM]__attribute__ ((aligned(16)));
float b[MAX_DIM][MAX_DIM]__attribute__ ((aligned(16)));
float d[MAX_DIM][MAX_DIM]__attribute__ ((aligned(16)));
/*
* I fill these arrays with some values
*/
for(int i=0;i<MAX_DIM;i+=1){
for(int j=0;j<MAX_DIM;j+=4){
for(int k=0;k<MAX_DIM;k+=4){
__m128 result = _mm_load_ps(&d[i][j]);
__m128 a_line = _mm_load_ps(&a[i][k]);
__m128 b_line0 = _mm_load_ps(&b[k][j+0]);
__m128 b_line1 = _mm_loadu_ps(&b[k][j+1]);
__m128 b_line2 = _mm_loadu_ps(&b[k][j+2]);
__m128 b_line3 = _mm_loadu_ps(&b[k][j+3]);
result = _mm_add_ps(result, _mm_mul_ps(_mm_shuffle_ps(a_line, a_line, 0x00), b_line0));
result = _mm_add_ps(result, _mm_mul_ps(_mm_shuffle_ps(a_line, a_line, 0x55), b_line1));
result = _mm_add_ps(result, _mm_mul_ps(_mm_shuffle_ps(a_line, a_line, 0xaa), b_line2));
result = _mm_add_ps(result, _mm_mul_ps(_mm_shuffle_ps(a_line, a_line, 0xff), b_line3));
_mm_store_ps(&d[i][j],result);
}
}
}
the above code I made to make matrix multiplication using SSE. the code runs as flows I take 4 elements from row from a multiply it by 4 elements from a column from b and move to the next 4 elements in the row of a and next 4 elements in column b
I get an error Segmentation fault (core dumped)
I don't really know why
I use gcc 5.4.0 on ubuntu 16.04.5
Edit :
The segmentation fault was solved by _mm_loadu_ps
Also there is something wrong with logic i will be greatfull if someone helps me to find it
The segmentation fault was solved by _mm_loadu_ps
Also there is something wrong with logic...
You're loading 4 overlapping windows on b[k][j+0..7]
. (This is why you needed loadu
).
Perhaps you meant to load b[k][j+0]
, +4
, +8
, +12
? If so, you should align b
by 64, so all four loads come from the same cache line (for performance). Strided access is not great, but using all 64 bytes of every cache line you touch is a lot better than getting row-major vs. column-major totally wrong in scalar code with no blocking.
I take 4 elements from row from a
multiply it by 4 elements from a column from b
I'm not sure your text description describes your code.
Unless you've already transposed b
, you can't load multiple values from the same column with a SIMD load, because they aren't contiguous in memory.
C multidimensional arrays are "row major": the last index is the one that varies most quickly when moving to the next higher memory address. Did you think that _mm_loadu_ps(&b[k][j+1])
was going to give you b[k+0..3][j+1]
? If so, this is a duplicate of SSE matrix-matrix multiplication (That question is using 32-bit integer, not 32-bit float, but same layout problem. See that for a working loop structure.)
To debug this, put a simple pattern of values into b[]
. Like
#include <stdalign.>
alignas(64) float b[MAX_DIM][MAX_DIM] = {
0000, 0001, 0002, 0003, 0004, ...,
0100, 0101, 0102, ...,
0200, 0201, 0202, ...,
};
// i.e. for (...) b[i][j] = 100 * i + j;
Then when you step through your code in the debugger, you can see what values end up in your vectors.
For your a[][]
values, maybe use 90000.0 + 100 * i + j
so if you're looking at registers (instead of C variables) you can still tell which values are a
and which are b
.
Related:
Ulrich Drepper's What Every Programmer Should Know About Memory shows an optimized matmul with cache-blocking with SSE instrinsics for double-precision. Should be straightforward to adapt for float
.
How does BLAS get such extreme performance? (You might want to just use an optimized matmul library; tuning matmul for optimal cache-blocking is non-trivial but important)
- Matrix Multiplication with blocks
- Poor maths performance in C vs Python/numpy has some links to other questions
- how to optimize matrix multiplication (matmul) code to run fast on a single processor core