How are the gather instructions in AVX2 implemente

2019-01-17 10:48发布

问题:

Suppose I'm using AVX2's VGATHERDPS - this should load 8 single-precision floats using 8 DWORD indices.

What happens when the data to be loaded exists in different cache-lines? Is the instruction implemented as a hardware loop which fetches cache-lines one by one? Or, can it issue a load to multiple cache-lines at once?

I read a couple of papers which state the former (and that's the one which makes more sense to me), but I would like to know a bit more about this.

Link to one paper: http://arxiv.org/pdf/1401.7494.pdf

回答1:

I did some benchmarking of the AVX gather instructions and it seems to be a fairly simple brute force implementation - even when the elements to be loaded are contiguous it seems that there is still one read cycle per element, so performance is really no better than just doing scalar loads.