Why do we need aligned memory for SSE/AVX?
One of the answer I often get is aligned memory load is much faster than unaligned memory load. Then, why is this aligned memory load is much faster than unaligned memory load?
Why do we need aligned memory for SSE/AVX?
One of the answer I often get is aligned memory load is much faster than unaligned memory load. Then, why is this aligned memory load is much faster than unaligned memory load?
This is not just specific to SSE (or even x86). On most architectures loads and stores need to be naturally aligned otherwise they either (a) generate an exception or (b) need two or more cycles plus some fix up in order to handle the misaligned load/store transparently. On x86 (b) is true for data types < 16 bytes but (a) is true for SSE data types unless you explicitly use misaligned versions of the load/store instructions which can handle misaligned data.
You might wonder: why not just use the misaligned versions of these SSE load/store instructions regardless of alignment? The answer is that these instructions are typically much slower than their aligned counterparts as they generally behave as per (b) above, which makes them typically 2x or more slower, apart from recent Intel CPUs such as Core i7, where the penalty is much smaller, but not insignificant.