可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I'm not sure the exact term for what I'm trying to do. I have an 8x8
block of bits
stored in 8 bytes
, each byte stores one row. When I'm finished, I'd like each byte to store one column.
For example, when I'm finished:
Byte0out = Byte0inBit0 + Byte1inBit0 + Byte2inBit0 + Byte3inBit0 + ...
Byte1out = Byte0inBit1 + Byte1inBit1 + Byte2inBit1 + Byte3inBit1 + ...
What is the easiest way to do this in C which performs well?
回答1:
This code is cribbed directly from "Hacker's Delight" - Figure 7-2 Transposing an 8x8-bit matrix, I take no credit for it:
void transpose8(unsigned char A[8], int m, int n,
unsigned char B[8]) {
unsigned x, y, t;
// Load the array and pack it into x and y.
x = (A[0]<<24) | (A[m]<<16) | (A[2*m]<<8) | A[3*m];
y = (A[4*m]<<24) | (A[5*m]<<16) | (A[6*m]<<8) | A[7*m];
t = (x ^ (x >> 7)) & 0x00AA00AA; x = x ^ t ^ (t << 7);
t = (y ^ (y >> 7)) & 0x00AA00AA; y = y ^ t ^ (t << 7);
t = (x ^ (x >>14)) & 0x0000CCCC; x = x ^ t ^ (t <<14);
t = (y ^ (y >>14)) & 0x0000CCCC; y = y ^ t ^ (t <<14);
t = (x & 0xF0F0F0F0) | ((y >> 4) & 0x0F0F0F0F);
y = ((x << 4) & 0xF0F0F0F0) | (y & 0x0F0F0F0F);
x = t;
B[0]=x>>24; B[n]=x>>16; B[2*n]=x>>8; B[3*n]=x;
B[4*n]=y>>24; B[5*n]=y>>16; B[6*n]=y>>8; B[7*n]=y;
}
I didn't check if this rotates in the direction you need, if not you might need to adjust the code.
Also, keep in mind datatypes & sizes - int
& unsigned (int)
might not be 32 bits on your platform.
BTW, I suspect the book (Hacker's Delight) is essential for the kind of work you're doing... check it out, lots of great stuff in there.
回答2:
If you are looking for the simplest solution:
/* not tested, not even compiled */
char bytes_in[8];
char bytes_out[8];
/* please fill bytes_in[] here with some pixel-crap */
memset(bytes_out, 0, 8);
for(int i = 0; i < 8; i++) {
for(int j = 0; j < 8; j++) {
bytes_out[i] = (bytes_out[i] << 1) | ((bytes_in[j] >> (7 - i)) & 0x01);
}
}
If your are looking for the fastest solution:
How to transpose a bit matrix in the assembly by utilizing SSE2.
回答3:
Lisp prototype:
(declaim (optimize (speed 3) (safety 0)))
(defun bit-transpose (a)
(declare (type (simple-array unsigned-byte 1) a))
(let ((b (make-array 8 :element-type '(unsigned-byte 8))))
(dotimes (j 8)
(dotimes (i 8)
(setf (ldb (byte 1 i) (aref b j))
(ldb (byte 1 j) (aref a i)))))
b))
This is how you can run the code:
#+nil
(bit-transpose (make-array 8 :element-type 'unsigned-byte
:initial-contents '(1 2 3 4 5 6 7 8)))
;; => #(85 102 120 128 0 0 0 0)
Occasionally I disassemble code to check that there are no unnecessary calls to safety functions.
#+nil
(disassemble #'bit-transpose)
This is a benchmark. Run the function often enough to process a (binary) HDTV image.
#+nil
(time
(let ((a (make-array 8 :element-type 'unsigned-byte
:initial-contents '(1 2 3 4 5 6 7 8)))
(b (make-array 8 :element-type 'unsigned-byte
:initial-contents '(1 2 3 4 5 6 7 8))))
(dotimes (i (* (/ 1920 8) (/ 1080 8)))
(bit-transpose a))))
That took only took 51ms. Note that I'm consing quite a lot because the function allocates new 8 byte arrays all the time. I'm sure an implementation in C can be tweaked a lot more.
Evaluation took:
0.051 seconds of real time
0.052004 seconds of total run time (0.052004 user, 0.000000 system)
101.96% CPU
122,179,503 processor cycles
1,048,576 bytes consed
Here are some more test cases:
#+nil
(loop for j below 12 collect
(let ((l (loop for i below 8 collect (random 255))))
(list l (bit-transpose (make-array 8 :element-type 'unsigned-byte
:initial-contents l)))))
;; => (((111 97 195 202 47 124 113 164) #(87 29 177 57 96 243 111 140))
;; ((180 192 70 173 167 41 30 127) #(184 212 221 232 193 185 134 27))
;; ((244 86 149 57 191 65 129 178) #(124 146 23 24 159 153 35 213))
;; ((227 244 139 35 38 65 214 64) #(45 93 82 4 66 27 227 71))
;; ((207 62 236 89 50 64 157 120) #(73 19 71 207 218 150 173 69))
;; ((89 211 149 140 233 72 193 192) #(87 2 12 57 7 16 243 222))
;; ((97 144 19 13 135 198 238 33) #(157 116 120 72 6 193 97 114))
;; ((145 119 3 85 41 202 79 134) #(95 230 202 112 11 18 106 161))
;; ((42 153 67 166 175 190 114 21) #(150 125 184 51 226 121 68 58))
;; ((58 232 38 210 137 254 19 112) #(80 109 36 51 233 167 170 58))
;; ((27 245 1 197 208 221 21 101) #(239 1 234 33 115 130 186 58))
;; ((66 204 110 232 46 67 37 34) #(96 181 86 30 0 220 47 10)))
Now I really want to see how my code compares to Andrejs Cainikovs' C solution
(Edit: I think its wrong):
#include <string.h>
unsigned char bytes_in[8]={1,2,3,4,5,6,7,8};
unsigned char bytes_out[8];
/* please fill bytes_in[] here with some pixel-crap */
void bit_transpose(){
memset(bytes_out, 0, 8);
int i,j;
for(i = 0; i < 8; i++)
for(j = 0; j < 8; j++)
bytes_out[i] = (bytes_out[i] << 1) | ((bytes_in[j] >> (7 - i)) & 0x01);
}
int
main()
{
int j,i;
for(j=0;j<100;j++)
for(i=0;i<(1920/8*1080/8);i++)
bit_transpose();
return 0;
}
And benchmarking it:
wg@hp:~/0803/so$ gcc -O3 trans.c
wg@hp:~/0803/so$ time ./a.out
real 0m0.249s
user 0m0.232s
sys 0m0.000s
Each loop over the HDTV image takes 2.5ms. That is quite a lot faster than my unoptimized Lisp.
Unfortunately the C code doesn't give the same results like my lisp:
#include <stdio.h>
int
main()
{
int j,i;
bit_transpose();
for(i=0;i<8;i++)
printf("%d ",(int)bytes_out[i]);
return 0;
}
wg@hp:~/0803/so$ ./a.out
0 0 0 0 1 30 102 170
回答4:
This sounds a lot like a so-called "Chunky to planar" routine used on displays that use bitplanes. The following link uses MC68K assembler for its code, but provides a nice overview of the problem (assuming I understood the question correctly):
http://membres.multimania.fr/amycoders/sources/c2ptut.html
回答5:
You really want to do something like this with SIMD instructions with something like the GCC vector vector support: http://ds9a.nl/gcc-simd/example.html
回答6:
If you wanted an optimized solution you would use the SSE extensions in x86.
You'd need to use 4 of these SIMD opcodes.
MOVQ - move 8 bytes
PSLLW - packed shift left logical words
PMOVMSKB - packed move mask byte
And 2 regular x86 opcodes
LEA - load effective address
MOV - move
byte[] m = byte[8]; //input
byte[] o = byte[8]; //output
LEA ecx, [o]
// ecx = the address of the output array/matrix
MOVQ xmm0, [m]
// xmm0 = 0|0|0|0|0|0|0|0|m[7]|m[6]|m[5]|m[4]|m[3]|m[2]|m[1]|m[0]
PMOVMSKB eax, xmm0
// eax = m[7][7]...m[0][7] the high bit of each byte
MOV [ecx+7], al
// o[7] is now the last column
PSLLW xmm0, 1
// shift 1 bit to the left
PMOVMSKB eax, xmm0
MOV [ecx+6], al
PSLLW xmm0, 1
PMOVMSKB eax, xmm0
MOV [ecx+5], al
PSLLW xmm0, 1
PMOVMSKB eax, xmm0
MOV [ecx+4], al
PSLLW xmm0, 1
PMOVMSKB eax, xmm0
MOV [ecx+3], al
PSLLW xmm0, 1
PMOVMSKB eax, xmm0
MOV [ecx+2], al
PSLLW xmm0, 1
PMOVMSKB eax, xmm0
MOV [ecx+1], al
PSLLW xmm0, 1
PMOVMSKB eax, xmm0
MOV [ecx], al
25 x86 opcodes/instructions as opposed to the stacked for...loop solution with 64 iterations.
Sorry the notation is not the ATT style syntax that c/c++ compilers accept.
回答7:
This is similar to the get column in a bitboard problem and can be solved efficiently by considering those input bytes as 8 bytes of a 64-bit integer. If bit 0 is the least significant one and byte 0 is the first byte in the array then I assume you want to do the following
b07 b06 b05 b04 b03 b02 b01 b00 b70 b60 b50 b40 b30 b20 b10 b00
b17 b16 b15 b14 b13 b12 b11 b10 b71 b61 b51 b41 b31 b21 b11 b01
b27 b26 b25 b24 b23 b22 b21 b20 b72 b62 b52 b42 b32 b22 b12 b02
b37 b36 b35 b34 b33 b32 b31 b30 => b73 b63 b53 b43 b33 b23 b13 b03
b47 b46 b45 b44 b43 b42 b41 b40 => b74 b64 b54 b44 b34 b24 b14 b04
b57 b56 b55 b54 b53 b52 b51 b50 b75 b65 b55 b45 b35 b25 b15 b05
b67 b66 b65 b64 b63 b62 b61 b60 b76 b66 b56 b46 b36 b26 b16 b06
b77 b76 b75 b74 b73 b72 b71 b70 b77 b67 b57 b47 b37 b27 b17 b07
with bXY is byte X's bit number Y. Masking out all the first 7 columns and read the array as an uint64_t we'll have
0000000h 0000000g 0000000f 0000000e 0000000d 0000000c 0000000b 0000000a
in little endian, with abcdefgh
are b00 to b70 respectively. Now we just need to multiply that value with the magic number 0x2040810204081 to make a value with hgfedcba
in the MSB which is the flipped form in the result
uint8_t get_byte(uint64_t matrix, unsigned col)
{
const uint64_t column_mask = 0x8080808080808080ull;
const uint64_t magic = 0x2040810204081ull;
return ((matrix << (7 - col)) & column_mask) * magic >> 56;
}
// You may need to change the endianness if you address the data in a different way
uint64_t block8x8 = ((uint64_t)byte[7] << 56) | ((uint64_t)byte[6] << 48)
| ((uint64_t)byte[5] << 40) | ((uint64_t)byte[4] << 32)
| ((uint64_t)byte[3] << 24) | ((uint64_t)byte[2] << 16)
| ((uint64_t)byte[1] << 8) | (uint64_t)byte[0];
for (int i = 0; i < 8; i++)
byte_out[i] = get_byte(block8x8, i);
In reality you should read directly into an 8-byte array so that you don't need to combine the bytes later, but you need to align the array properly
In AVX2 Intel introduced the PDEP instruction (accessible via the _pext_u64
intrinsic) in the BMI2 instruction set for this purpose so the function can be done in a single instruction
data[i] = _pext_u64(matrix, column_mask << (7 - col));
More ways to transpose the array can be found in the chess programming wiki