I've got a flat array of byte RGB values that goes R1 G1 B1 R2 G2 B2 R3 G3 B3 ... Rn Gn Bn
. So my data looks like:
char imageData[WIDTH * HEIGHT * 3];
But I want to pass a WIDTH*HEIGHT array to an existing C library that expects a single plane of this data. That would be a sequence of just the R values (or just the G, or just the B).
It's easy enough to allocate a new array and copy the data (duh). But the images are very large. If it weren't a C library but took some kind of iteration interface to finesse the "slicing" traversal, that would be great. But I can't edit the code I'm calling...it wants a plain old pointer to a block of sequential memory.
HOWEVER I have write access to this array. It is viable to create a routine that would sort it into color planes. I'd also need a reverse transformation that would put it back, but by definition the same method that sorted it into planes could be applied to unsort it.
How efficiently can I (in place) turn this array into R1 R2 R3 ... Rn G1 G2 G3 ... Gn B1 B2 B3 ... Bn
and then back again? Any non-naive algorithms?
If you only need one plane, this seems pretty easy. If you need all 3 you will probably have better luck with a more sophisticated algorithm.
It shouldn't be too hard to run the loop backwards from high to low to reverse the process.
This paper "A Simple In-Place Algorithm for In-Shuffle" describes how to transpose matrix of 2*N and gives a hint how to do it for other cases, so 3*N may be also possible. This answer to other question shows that it is indeed possible.
Or use an algorithm which writes each value to its transposed place, then does the same for the value from that place, and so on until cycle is connected. Flag processed values in a bit vector. And continue until this vector is all 1s.
Both algorithms are not cache-friendly. Probably some clever use of PREFETCH instruction can improve this.
Edit:
C++, RGB to single planes, not optimized:
My original intent is to use a bit vector of size WIDTHHEIGHT, which gives overhead of WIDTHHEIGHT/8. But it is always possible to sacrifice speed for space. The bit vector may be of size WIDTH or HEIGHT or any desirable value, even 0. The trick is to maintain a pointer to the cell, before which all values are transposed. The bit vector is for cells, starting from this pointer. After it is all 1s, It is moved to next position, then all the algorithm steps are performed except actual data movement. And the bit vector is ready to continue transposing. This variant is O(N^2) instead of O(N).
Edit2:
PREFITCH optimization is not difficult to implement: just calculate indexes, invoke PREFETCH, and put indexes to a queue (ringbuffer), then get indexes from the queue and move data.
Edit3:
The idea of other algorithm, which is O(1) size, O(N*log(N)) time, is cache-friendly and may be faster than "cycle" algorithm (for image sizes < 1Gb):
this function do this R1 R2 R3 ... Rn G1 G2 G3 ... Gn B1 B2 B3 ... Bn,,