I need to improve a loop, because is called by my application thousands of times. I suppose I need to do it with Neon, but I don´t know where to begin.
Assumptions / pre-conditions:
w
is always 320 (multiple of 16/32).pa
andpb
are 16-byte alignedma
andmb
are positive.
int whileInstruction (const unsigned char *pa,const unsigned char *pb,int ma,int mb,int w)
{
int sum=0;
do {
sum += ((*pa++)-ma)*((*pb++)-mb);
} while(--w);
return sum;
}
This attempt at vectorizing it is not working well, and isn't safe (missing clobbers), but demonstrates what I'm trying to do:
int whileInstruction (const unsigned char *pa,const unsigned char *pb,int ma,int mb,int w)
{
asm volatile("lsr %2, %2, #3 \n"
".loop: \n"
"# load 8 elements: \n"
"vld4.8 {d0-d3}, [%1]! \n"
"vld4.8 {d4-d7}, [%2]! \n"
"# do the operation: \n"
"vaddl.u8 q7, d0, r7 \n"
"vaddl.u8 q8, d1, d8 \n"
"vmlal.u8 q7, q7, q8 \n"
"# Sum the vector a save in sum (this is wrong):\n"
"vaddl.u8 q7, d0, r7 \n"
"subs %2, %2, #1 \n" // Decrement iteration count
"bne .loop \n" // Repeat unil iteration count is not zero
:
: "r"(pa), "r"(pb), "r"(w),"r"(ma),"r"(mb),"r"(sum)
: "r4", "r5", "r6","r7","r8","r9"
);
return sum;
}
Here is a simple NEON implementation. I have tested this against the scalar code to make sure that it works. Note that for best performance both
pa
andpb
should be 16 byte aligned.Well another solution for my problem taken the perfect solution by Paul R, in the case the w is equal to 8, what happens usually it is possible to use this function:
Maybe it is possible to improve it.