Fast way to swap endianness using opencl

2019-08-26 01:18发布

I'm reading and writing lots of FITS and DNG images which may contain data of an endianness different from my platform and/or opencl device.

Currently I swap the byte order in the host's memory if necessary which is very slow and requires an extra step.

Is there a fast way to pass a buffer of int/float/short having wrong endianess to an opencl-kernel?

Using an extra kernel run just for fixing the endianess would be ok; using some overheadless auto-fixing-read/-write operation would be perfect.

I know about the variable attribute ((endian(host/device))) but this doesn't help with a big endian FITS file on a little endian platform using a little endian device.

I thought about a solution like this one (neither implemented nor tested, yet):

uint4 mask = (uint4) (3, 2, 1, 0);
uchar4 swappedEndianness = shuffle(originalEndianness, mask);
// to be applied on a float/int-buffer somehow

Hoping there's a better solution out there.

Thanks in advance, runtimeterror

2条回答
疯言疯语
2楼-- · 2019-08-26 01:56

Most processor architectures perform best when using instructions to complete the operation which can fit its register width, for example 32/64-bit width. When CPU/GPU performs such byte-wise operators, using subscripts .wxyz for uchar4, they needs to use a mask to retrieve each byte from the integer, shift the byte, and then using integer add or or operator to the result. For the endianness swaping, the processor needs to perform above integer and, shift, add/or for 4 times because there are 4 bytes.

The most efficient way is as follows

#define EndianSwap(n) (rotate(n & 0x00FF00FF, 24U)|(rotate(n, 8U) & 0x00FF00FF)

n could be in any gentype, for example, an uint4 variable. Because OpenCL does not allow C++ type overloading, so the best choice is macro.

查看更多
小情绪 Triste *
3楼-- · 2019-08-26 02:05

Sure. Since you have a uchar4 - you can simply swizzle the components and write them back.

output[tid] = input[tid].wzyx;

swizzling is very also performant on SIMD architectures with very little cost, so you should be able to combine it with other operations in your kernel.

Hope this helps!

查看更多
登录 后发表回答