Explaining ARM Neon Image Sampling

I'm trying to write a better version of cv::resize() of the OpenCV, and I came a cross a code that is here: https://github.com/rmaz/NEON-Image-Downscaling/blob/master/ImageResize/BDPViewController.m The code is for downsampling an image by 2 but I can not get the algorithm. I would like first to convert that algorithm to C then try to modify it for Learning purposes. Is it easy also to convert it to downsample by any size ?

The function is:

static void inline resizeRow(uint32_t *dst, uint32_t *src, uint32_t pixelsPerRow)
{
    const uint32_t * rowB = src + pixelsPerRow;

    // force the number of pixels per row to a multiple of 8
    pixelsPerRow = 8 * (pixelsPerRow / 8);

    __asm__ volatile("Lresizeloop: \n" // start loop
                     "vld1.32 {d0-d3}, [%1]! \n" // load 8 pixels from the top row
                     "vld1.32 {d4-d7}, [%2]! \n" // load 8 pixels from the bottom row
                     "vhadd.u8 q0, q0, q2 \n" // average the pixels vertically
                     "vhadd.u8 q1, q1, q3 \n"
                     "vtrn.32 q0, q2 \n" // transpose to put the horizontally adjacent pixels in different registers
                     "vtrn.32 q1, q3 \n"
                     "vhadd.u8 q0, q0, q2 \n" // average the pixels horizontally
                     "vhadd.u8 q1, q1, q3 \n"
                     "vtrn.32 d0, d1 \n" // fill the registers with pixels
                     "vtrn.32 d2, d3 \n"
                     "vswp d1, d2 \n"
                     "vst1.64 {d0-d1}, [%0]! \n" // store the result
                     "subs %3, %3, #8 \n" // subtract 8 from the pixel count
                     "bne Lresizeloop \n" // repeat until the row is complete
: "=r"(dst), "=r"(src), "=r"(rowB), "=r"(pixelsPerRow)
: "0"(dst), "1"(src), "2"(rowB), "3"(pixelsPerRow)
: "q0", "q1", "q2", "q3", "cc"
);
}

To call it:

 // downscale the image in place
    for (size_t rowIndex = 0; rowIndex < height; rowIndex+=2)
    {
        void *sourceRow = (uint8_t *)buffer + rowIndex * bytesPerRow;
        void *destRow = (uint8_t *)buffer + (rowIndex / 2) * bytesPerRow;
        resizeRow(destRow, sourceRow, width);
    }

The algorithm is pretty straightforward. It reads 8 pixels from the current line and 8 from the line below. It then uses the vhadd (halving-add) instruction to average the 8 pixels vertically. It then transposes the position of the pixels so that the horizontally adjacent pixel pairs are now in separate registers (arranged vertically). It then does another set of halving-adds to average those together. The result is then transformed again to put them in their original positions and written to the destination. This algorithm could be rewritten to handle different integral sizes of scaling, but as written it can only do 2x2 to 1 reduction with averaging. Here's the C code equivalent:

static void inline resizeRow(uint32_t *dst, uint32_t *src, uint32_t pixelsPerRow)
{
    uint8_t * pSrc8 = (uint8_t *)src;
    uint8_t * pDest8 = (uint8_t *)dst;
    int stride = pixelsPerRow * sizeof(uint32_t);
    int x;
    int r, g, b, a;

    for (x=0; x<pixelsPerRow; x++)
    {
       r = pSrc8[0] + pSrc8[4] + pSrc8[stride+0] + pSrc8[stride+4];
       g = pSrc8[1] + pSrc8[5] + pSrc8[stride+1] + pSrc8[stride+5];
       b = pSrc8[2] + pSrc8[6] + pSrc8[stride+2] + pSrc8[stride+6];
       a = pSrc8[3] + pSrc8[7] + pSrc8[stride+3] + pSrc8[stride+7];
       pDest8[0] = (uint8_t)((r + 2)/4); // average with rounding
       pDest8[1] = (uint8_t)((g + 2)/4);
       pDest8[2] = (uint8_t)((b + 2)/4);
       pDest8[3] = (uint8_t)((a + 2)/4);
       pSrc8 += 8; // skip forward 2 source pixels
       pDest8 += 4; // skip forward 1 destination pixel
    }