my image processing project works with grayscale images. I have ARM Cortex-A8 processor platform. I want to make use of the NEON.
I have a grayscale image( consider the example below) and in my alogorithm, I have to add only the columns.
How can I load four 8-bit pixel values in parallel, which are uint8_t, as four uint32_t into one of the 128-bit NEON registers? What intrinsic do I have to use to do this?
I mean:
I must load them as 32 bits because if you look carefully, the moment I do 255 + 255 is 512, which can't be held in a 8-bit register.
e.g.
255 255 255 255 ......... (640 pixels)
255 255 255 255
255 255 255 255
255 255 255 255
.
.
.
.
.
(480 pixels)
I will recommend that you spend a bit of time understanding how SIMD works on ARM. Look at:
Take a look at:
- http://blogs.arm.com/software-enablement/161-coding-for-neon-part-1-load-and-stores/
- http://blogs.arm.com/software-enablement/196-coding-for-neon-part-2-dealing-with-leftovers/
- http://blogs.arm.com/software-enablement/241-coding-for-neon-part-3-matrix-multiplication/
- http://blogs.arm.com/software-enablement/277-coding-for-neon-part-4-shifting-left-and-right/
to get you started. You can then implement your SIMD code using inline assembler or corresponding ARM intrinsics recommended by domen.
Depends on your compiler and (possible lack of) extensions.
Ie. for GCC, this might be a starting point: http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html
If you need to sum up to 480 8-bit values then you would technically need 17 bits of intermediate storage. However, if you perform the additions in two stages, ie, top 240 rows then bottom 240 rows, you can do it in 16-bits each. Then you can add the results from the two halves to get the final answer.
There is actually a NEON instruction that is suitable for your algorithm called vaddw. It will add a dword vector to a qword vector, with the latter containing elements that are twice as wide as the former. In your case, vaddw.u8 can be used to add 8 pixels to 8 16-bit accumulators. Then, vaddw.u16 can be used to add the two sets of 8 16-bit accumulators into one set of 8 32-bit ones - note that you must use the instruction twice to get both halves.
If necessary, you can also convert the values back to 16-bit or 8-bit by using vmovn or vqmovn.
There is not instruction that can load your 4 8bit value into 4 32bit register.
you must load them and then use a vshl twice.
because neon can't use 32 registers you'll have to work on 8 pixels (and not 4)
You can use only 16bits register. it should be enough...
Load the 4 bytes using a single-lane load instruction (vld1 <register>[<lane>], [<address]
) into a q-register, then use two move-long instructions (vmovl
) to promote them first to 16 and then to 32 bit. The result should be something like (in GNU syntax)
vld1 d0[0], [<address>] @Now d0 = (*<addr>, *<addr+1>, *<addr+2>, *<addr+3>, <junk>, ... <junk> )
vmovl.u8 q0, d0 @Now q1 = (d0, d1) = ((uint16_t)*<addr>, ... (uint16_t)*<addr+3>, <junk>, ... <junk>)
vmovl.u16 q0, d2 @Now d0 = ((uint32_t)*<addr>, ... (uint32_t)*<addr+3>), d1 = (<junk>, ... <junk>)
If you can guarantee that <address>
is 4-byte aligned, then write [<address>: 32]
instead in the load instruction, to save a cycle or two. If you do that and the address isn't aligned, you'll get a fault, however.
Um, I just realized you want to use intrinsics, not assembly, so here's the same thing with intrinsics.
uint32x4_t v8; // Will actually hold 4 uint8_t
v8 = vld1_lane_u32(ptr, v8, 0);
const uint16x4_t v16 = vget_low_u16(vmovl_u8(vreinterpret_u8_u32(v8)));
const uint32x4_t v32 = vmovl_u16(v16);