Is it possible to vectorize this nested for with S

I've never written assembly code for SSE optimization, so sorry if this is a noob question. In this aritcle is explained how to vectorize a for with a conditional statement. However, my code (taken from here ) is of the form:

   for (int j=-halfHeight; j<=halfHeight; ++j)
   {
      for(int i=-halfWidth; i<=halfWidth; ++i)
      {
         const float rx = ofsx + j * a12;
         const float ry = ofsy + j * a22;
         float wx = rx + i * a11;
         float wy = ry + i * a21;
         const int x = (int) floor(wx);
         const int y = (int) floor(wy);
         if (x >= 0 && y >= 0 && x < width && y < height)
         {
            // compute weights
            wx -= x; wy -= y;
            // bilinear interpolation
            *out++ =
               (1.0f - wy) * ((1.0f - wx) * im.at<float>(y,x)   + wx * im.at<float>(y,x+1)) +
               (       wy) * ((1.0f - wx) * im.at<float>(y+1,x) + wx * im.at<float>(y+1,x+1));
         } else {
            *out++ = 0;
         }
      }
   }

So, from my understanding, there are several differences with the linked article:

Here we have a nested for: I've always seen one level for in vectroization, never seen a nested loop
The if condition is based on scalar values (x and y) and not on the array: how can I adapt the linked example to this?
The out index isn't based on i or j (so it's not out[i] or out[j]): how can I fill out in this way?

In particular I'm confused because for indexes are always used as array indexes, while here are used to compute variables while the vector is incremented cycle by cycle

I'm using icpc with -O3 -xCORE-AVX2 -qopt-report=5 and a bunch of others optimization flags. According to Intel Advisor, this is not vectorized, and using #pragma omp simd generates warning #15552: loop was not vectorized with "simd"

Bilinear interpolation is a rather tricky operation to vectorize, and I wouldn't try it for your first SSE trick. The problem is that the values you need to fetch are not nicely ordered. They're sometimes repeated, sometimes skipped. The good news is, interpolating images is a common operation, and you can likely find a pre-written library to do that, like OpenCV

remap() is always a good choice. Just build two arrays of wx and wy which represent the fractional source locations of each pixel, and let remap() do the interpolation.

However, in this case, it looks like an affine transform. That is, the fractional source pixel is related to the source pixel by a 2x3 matrix multiplication. That's the offset and a11/a12/a21/a22 variables. OpenCV has such a transform. Read about it here: http://docs.opencv.org/3.1.0/d4/d61/tutorial_warp_affine.html

All you'll have to do is map your input variables into matrix form and call the affine transform.