In a piece of C++ code that does something similar to (but not exactly) matrix multiplication, I load 4 contiguous doubles into 4 YMM registers like this:
# a is a 64-byte aligned array of double
__m256d b0 = _mm256_broadcast_sd(&b[4*k+0]);
__m256d b1 = _mm256_broadcast_sd(&b[4*k+1]);
__m256d b2 = _mm256_broadcast_sd(&b[4*k+2]);
__m256d b3 = _mm256_broadcast_sd(&b[4*k+3]);
I compiled the code with gcc-4.8.2 on a Sandy Bridge machine. Hardware event counters (Intel PMU) suggests that the CPU actually issue 4 separate load from the L1 cache. Although at this point I'm not limited by L1 latency or bandwidth, I'm very interested to know if there is a way to load the 4 doubles with one 256-bit load (or two 128-bit loads) and shuffle them into 4 YMM registers. I looked through the Intel Intrinsics Guide but couldn't find a way to accomplish the shuffling required. Is that possible?
(If the premise that the CPU doesn't combine the 4 consecutive loads is actually wrong, please let me know.)
In my matrix multiplication code I only have to use the broadcast once per kernel code but if you really want to load four doubles in one instruction and then broadcast them to four registers you can do it like this
#include <stdio.h>
#include <immintrin.h>
int main() {
double in[] = {1,2,3,4};
double out[4];
__m256d x4 = _mm256_loadu_pd(in);
__m256d t1 = _mm256_permute2f128_pd(x4, x4, 0x0);
__m256d t2 = _mm256_permute2f128_pd(x4, x4, 0x11);
__m256d broad1 = _mm256_permute_pd(t1,0);
__m256d broad2 = _mm256_permute_pd(t1,0xf);
__m256d broad3 = _mm256_permute_pd(t2,0);
__m256d broad4 = _mm256_permute_pd(t2,0xf);
_mm256_storeu_pd(out,broad1);
printf("%f %f %f %f\n", out[0], out[1], out[2], out[3]);
_mm256_storeu_pd(out,broad2);
printf("%f %f %f %f\n", out[0], out[1], out[2], out[3]);
_mm256_storeu_pd(out,broad3);
printf("%f %f %f %f\n", out[0], out[1], out[2], out[3]);
_mm256_storeu_pd(out,broad4);
printf("%f %f %f %f\n", out[0], out[1], out[2], out[3]);
}
Edit: Here is another solution based on Paul R's suggestion.
__m256 t1 = _mm256_broadcast_pd((__m128d*)&b[4*k+0]);
__m256 t2 = _mm256_broadcast_pd((__m128d*)&b[4*k+2]);
__m256d broad1 = _mm256_permute_pd(t1,0);
__m256d broad2 = _mm256_permute_pd(t1,0xf);
__m256d broad3 = _mm256_permute_pd(t2,0);
__m256d broad4 = _mm256_permute_pd(t2,0xf);
TL;DR: It's almost always best to just do four broadcast-loads using _mm256_set1_pd()
. This very good on Haswell and later, where vbroadcastsd ymm,[mem]
doesn't require an ALU shuffle operation, and usually also the best option for Sandybridge/Ivybridge (where it's a 2-uop load + shuffle instruction).
It also means you don't need to care about alignment at all, beyond natural alignment for a double
.
The first vector is ready sooner than if you did a two-step load + shuffle, so out-of-order execution can potentially get started on the code using these vectors while the first one is still loading. AVX512 can even fold broadcast-loads into memory operands for ALU instructions, so doing it this way will allow a recompile to take slight advantage of AVX512 with 256b vectors.
(It's usually best to use set1(x)
, not _mm256_broadcast_sd(&x)
; If the AVX2-only register-source form of vbroadcastsd
isn't available, the compiler can choose to store -> broadcast-load or to do two shuffles. You never know when inlining will mean your code will run on inputs that are already in registers.)
If you're really bottlenecked on load-port resource-conflicts or throughput, not total uops or ALU / shuffle resources, it might help to replace a pair of 64->256b broadcasts with a 16B->32B broadcast-load (vbroadcastf128
/_mm256_broadcast_
pd
) and two in-lane shuffles (vpermilpd
or vunpckl/hpd
(_mm256_shuffle_pd
)).
Or with AVX2: load 32B and use 4 _mm256_permute4x64_pd
shuffles to broadcast each element into a separate vector.
Source Agner Fog's insn tables (and microarch pdf):
Intel Haswell and later:
vbroadcastsd ymm,[mem]
and other broadcast-load insns are 1uop instructions that are handled entirely by a load port (the broadcast happens "for free").
The total cost of doing four broadcast-loads this way is 4 instructions. fused-domain: 4uops. unfused-domain: 4 uops for p2/p3. Throughput: two vectors per cycle.
Haswell only has one shuffle unit, on port5. Doing all your broadcast-loads with load+shuffle will bottleneck on p5.
Maximum broadcast throughput is probably with a mix of vbroadcastsd ymm,m64
and shuffles:
## Haswell maximum broadcast throughput with AVX1
vbroadcastsd ymm0, [rsi]
vbroadcastsd ymm1, [rsi+8]
vbroadcastf128 ymm2, [rsi+16] # p23 only on Haswell, also p5 on SnB/IvB
vunpckhpd ymm3, ymm2,ymm2
vunpcklpd ymm2, ymm2,ymm2
vbroadcastsd ymm4, [rsi+32] # or vaddpd ymm0, [rdx+something]
#add rsi, 40
Any of these addressing modes can be two-register indexed addressing modes, because they don't need to micro-fuse to be a single uop.
AVX1: 5 vectors per 2 cycles, saturating p2/p3 and p5. (Ignoring cache-line splits on the 16B load). 6 fused-domain uops, leaving only 2 uops per 2 cycles to use the 5 vectors... Real code would probably use some of the load throughput to load something else (e.g. a non-broadcast 32B load from another array, maybe as a memory operand to an ALU instruction), or to leave room for stores to steal p23 instead of using p7.
## Haswell maximum broadcast throughput with AVX2
vmovups ymm3, [rsi]
vbroadcastsd ymm0, xmm3 # special-case for the low element; compilers should generate this from _mm256_permute4x64_pd(v, 0)
vpermpd ymm1, ymm3, 0b01_01_01_01 # NASM syntax for 0x99
vpermpd ymm2, ymm3, 0b10_10_10_10
vpermpd ymm3, ymm3, 0b11_11_11_11
vbroadcastsd ymm4, [rsi+32]
vbroadcastsd ymm5, [rsi+40]
vbroadcastsd ymm6, [rsi+48]
vbroadcastsd ymm7, [rsi+56]
vbroadcastsd ymm8, [rsi+64]
vbroadcastsd ymm9, [rsi+72]
vbroadcastsd ymm10,[rsi+80] # or vaddpd ymm0, [rdx + whatever]
#add rsi, 88
AVX2: 11 vectors per 4 cycles, saturating p23 and p5. (Ignoring cache-line splits for the 32B load...). Fused-domain: 12 uops, leaving 2 uops per 4 cycles beyond this.
I think 32B unaligned loads are a bit more fragile in terms of performance than unaligned 16B loads like vbroadcastf128
.
Intel SnB/IvB:
vbroadcastsd ymm, m64
is 2 fused-domain uops: p5 (shuffle) and p23 (load).
vbroadcastss xmm, m32
and movddup xmm, m64
are single-uop load-port-only. Interestingly, vmovddup ymm, m256
is also a single-uop load-port-only instruction, but like all 256b loads, it occupies a load port for 2 cycles. It can still generate a store-address in the 2nd cycle. This uarch doesn't deal well with cache-line splits for unaligned 32B-loads, though. gcc defaults to using movups / vinsertf128 for unaligned 32B loads with -mtune=sandybridge
/ -mtune=ivybridge
.
4x broadcast-load: 8 fused-domain uops: 4 p5 and 4 p23. Throughput: 4 vectors per 4 cycles, bottlenecking on port 5. Multiple loads from the same cache line in the same cycle don't cause a cache-bank conflict, so this is nowhere near saturating the load ports (also needed for store-address generation). That only happens on the same bank of two different cache lines in the same cycle.
Multiple 2-uop instructions with no other instructions between is the worst case for the decoders if the uop-cache is cold, but a good compiler would mix in single-uop instructions between them.
SnB has 2 shuffle units, but only the one on p5 can handle shuffles that have a 256b version in AVX. Using a p1 integer-shuffle uop to broadcast a double to both elements of an xmm register doesn't get us anywhere, since vinsertf128 ymm,ymm,xmm,i
takes a p5 shuffle uop.
## Sandybridge maximum broadcast throughput: AVX1
vbroadcastsd ymm0, [rsi]
add rsi, 8
one per clock, saturating p5 but only using half the capacity of p23.
We can save one load uop at the cost of 2 more shuffle uops, throughput = two results per 3 clocks:
vbroadcastf128 ymm2, [rsi+16] # 2 uops: p23 + p5 on SnB/IvB
vunpckhpd ymm3, ymm2,ymm2 # 1 uop: p5
vunpcklpd ymm2, ymm2,ymm2 # 1 uop: p5
Doing a 32B load and unpacking it with 2x vperm2f128
-> 4x vunpckh/lpd
might help if stores are part of what's competing for p23.
Here is a variant built upon Z Boson's original answer (before edit), using two 128-bit loads instead of one 256-bit load.
__m256d b01 = _mm256_castpd128_pd256(_mm_load_pd(&b[4*k+0]));
__m256d b23 = _mm256_castpd128_pd256(_mm_load_pd(&b[4*k+2]));
__m256d b0101 = _mm256_permute2f128_pd(b01, b01, 0);
__m256d b2323 = _mm256_permute2f128_pd(b23, b23, 0);
__m256d b0000 = _mm256_permute_pd(b0101, 0);
__m256d b1111 = _mm256_permute_pd(b0101, 0xf);
__m256d b2222 = _mm256_permute_pd(b2323, 0);
__m256d b3333 = _mm256_permute_pd(b2323, 0xf);
In my case this is slightly faster than using one 256-bit load, possibly because the first permute can start before the second 128-bit load completes.
Edit: gcc compiles the two loads and the first 2 permutes into
vmovapd (%rdi),%xmm8
vmovapd 0x10(%rdi),%xmm4
vperm2f128 $0x0,%ymm8,%ymm8,%ymm1
vperm2f128 $0x0,%ymm4,%ymm4,%ymm2
Paul R's suggestion of using _mm256_broadcast_pd() can be written as:
__m256d b0101 = _mm256_broadcast_pd((__m128d*)&b[4*k+0]);
__m256d b2323 = _mm256_broadcast_pd((__m128d*)&b[4*k+2]);
which compiles into
vbroadcastf128 (%rdi),%ymm6
vbroadcastf128 0x10(%rdi),%ymm11
and is faster than doing two vmovapd+vperm2f128 (tested).
In my code, which is bound by vector execution ports instead of L1 cache accesses, this is still slightly slower than 4 _mm256_broadcast_sd(), but I imagine that L1 bandwidth-constrained code can benefit greatly from this.