Suppose I'm using AVX2's VGATHERDPS - this should load 8 single-precision floats using 8 DWORD indices.
What happens when the data to be loaded exists in different cache-lines? Is the instruction implemented as a hardware loop which fetches cache-lines one by one? Or, can it issue a load to multiple cache-lines at once?
I read a couple of papers which state the former (and that's the one which makes more sense to me), but I would like to know a bit more about this.
Link to one paper: http://arxiv.org/pdf/1401.7494.pdf