Calculating Hamming weight efficiently in matlab

2020-02-04 11:05发布

Given a MATLAB uint32 to be interpreted as a bit string, what is an efficient and concise way of counting how many nonzero bits are in the string?

I have a working, naive approach which loops over the bits, but that's too slow for my needs. (A C++ implementation using std::bitset count() runs almost instantly).

I've found a pretty nice page listing various bit counting techniques, but I'm hoping there is an easy MATLAB-esque way.

http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive


Update #1

Just implemented the Brian Kernighan algorithm as follows:

w = 0;
while ( bits > 0 )
    bits = bitand( bits, bits-1 );
    w = w + 1;
end

Performance is still crappy, over 10 seconds to compute just 4096^2 weight calculations. My C++ code using count() from std::bitset does this in subsecond time.


Update #2

Here is a table of run times for the techniques I've tried so far. I will update it as I get additional ideas/suggestions.

Vectorized Scheiner algorithm                =>    2.243511 sec
Vectorized Naive bitget loop                 =>    7.553345 sec
Kernighan algorithm                          =>   17.154692 sec
length( find( bitget( val, 1:32 ) ) )        =>   67.368278 sec
nnz( bitget( val, 1:32 ) )                   =>  349.620259 sec
Justin Scheiner's algorithm, unrolled loops  =>  370.846031 sec
Justin Scheiner's algorithm                  =>  398.786320 sec
Naive bitget loop                            =>  456.016731 sec
sum(dec2bin(val) == '1')                     => 1069.851993 sec


Comment: The dec2bin() function in MATLAB seems to be very poorly implemented. It runs extremely slow.

Comment: The "Naive bitget loop" algorithm is implemented as follows:

w=0;
for i=1:32
   if bitget( val, i ) == 1
       w = w + 1;
   end
end

Comment: The loop unrolled version of Scheiner's algorithm looks as follows:

function w=computeWeight( val )
w = val;
w = bitand(bitshift(w, -1), uint32(1431655765)) + ...
    bitand(w, uint32(1431655765));

w = bitand(bitshift(w, -2), uint32(858993459)) + ...
    bitand(w, uint32(858993459));

w = bitand(bitshift(w, -4), uint32(252645135)) + ...
    bitand(w, uint32(252645135));

w = bitand(bitshift(w, -8), uint32(16711935)) + ...
    bitand(w, uint32(16711935));

w = bitand(bitshift(w, -16), uint32(65535)) + ...
    bitand(w, uint32(65535));

9条回答
手持菜刀,她持情操
2楼-- · 2020-02-04 11:40

Implemented the "Best 32 bit Algorithm" from the Stanford link at the top. The improved algorithm reduced processing time by 6%. Also optimized the segment size and found that 32K is stable and improves time by 15% over 4K. Expect 4Kx4K time to be 40% of Vectorized Scheiner Algorithm.

function w = Ham(w)
% Input uint32
% Output vector of Ham wts
 for i=1:32768:length(w)
  w(i:i+32767)=Ham_seg(w(i:i+32767));
 end
end

% Segmentation gave reduced time by 50%

function w=Ham_seg(w)
 %speed
 b1=uint32(1431655765); 
 b2=uint32(858993459);
 b3=uint32(252645135);
 b7=uint32(63); % working orig binary mask

 w = bitand(bitshift(w, -1), b1) + bitand(w, b1);
 w = bitand(bitshift(w, -2), b2) + bitand(w, b2);
 w =bitand(w+bitshift(w, -4),b3);
 w =bitand(bitshift(w,-24)+bitshift(w,-16)+bitshift(w,-8)+w,b7);

end
查看更多
贼婆χ
3楼-- · 2020-02-04 11:42

Did some timing comparisons on Matlab Cody. Determined a Segmented Modified Vectorized Scheiner gives optimimum performance.

Have >50% time reduction based on Cody 1.30 sec to 0.60 sec change for an L=4096*4096 vector.

function w = Ham(w)
% Input uint32
% Output vector of Ham wts

 b1=uint32(1431655765); % evaluating saves 15% of time 1.30 to 1.1 sec
 b2=uint32(858993459);
 b3=uint32(252645135);
 b4=uint32(16711935);
 b5=uint32(65535);

 for i=1:4096:length(w)
  w(i:i+4095)=Ham_seg(w(i:i+4095),b1,b2,b3,b4,b5);
 end
end

% Segmentation reduced time by 50%

function w=Ham_seg(w,b1,b2,b3,b4,b5)
 % Passing variables or could evaluate b1:b5 here


 w = bitand(bitshift(w, -1), b1) + bitand(w, b1);
 w = bitand(bitshift(w, -2), b2) + bitand(w, b2);
 w = bitand(bitshift(w, -4), b3) + bitand(w, b3);
 w = bitand(bitshift(w, -8), b4) + bitand(w, b4);
 w = bitand(bitshift(w, -16), b5) + bitand(w, b5);

end





vt=randi(2^32,[4096*4096,1])-1;
% for vt being uint32 the floor function gives unexpected values
tic
v=num_ones(mod(vt,65536)+1)+num_ones(floor(vt/65536)+1); % 0.85 sec
toc
% a corrected method is
v=num_ones(mod(vt,65536)+1)+num_ones(floor(double(vt)/65536)+1);
toc
查看更多
Animai°情兽
4楼-- · 2020-02-04 11:44
num_ones=uint8(zeros(intmax('uint32')/2^6,1));
% one time load of array not implemented here
tic
for i=1:4096*4096
 %v=num_ones(rem(i,64)+1)+num_ones(floor(i/64)+1); % 1.24 sec
 v=num_ones(mod(i,64)+1)+num_ones(floor(i/64)+1); % 1.20 sec
end
toc
tic
num_ones=uint8(zeros(65536,1));
for i=0:65535
 num_ones(i+1)=length( find( bitget( i, 1:32 ) ) ) ;
end
toc
% 0.43 sec to load
% smaller array to initialize
% one time load of array
tic
for i=1:4096*4096
 v=num_ones(mod(i,65536)+1)+num_ones(floor(i/65536)+1); %  0.95 sec
 %v=num_ones(mod(i,65536)+1)+num_ones(bitshift(i,-16)+1); % 16 sec for 4K*1K
end
toc
%vectorized
tic
num_ones=uint8(zeros(65536,1));
for i=0:65535
 num_ones(i+1)=length( find( bitget( i, 1:32 ) ) ) ;
end % 0.43 sec
toc
vt=randi(2^32,[4096*4096,1])-1;
tic
v=num_ones(mod(vt,65536)+1)+num_ones(floor(vt/65536)+1); % 0.85 sec
toc
查看更多
Explosion°爆炸
5楼-- · 2020-02-04 11:45

Try splitting the job into smaller parts. My guess is that if you want to process all data at once, matlab is trying to do each operation on all integers before taking successive steps and the processor's cache is invalidated with each step.

for i=1:4096,
    «process bits(i,:)»
end
查看更多
混吃等死
6楼-- · 2020-02-04 11:46

Unless this is a MATLAB implementation exercise, you might want to just take your fast C++ implementation and compile it as a mex function, once per target platform.

查看更多
Animai°情兽
7楼-- · 2020-02-04 11:49

I'd be interested to see how fast this solution is:

function r = count_bits(n)

shifts = [-1, -2, -4, -8, -16];
masks = [1431655765, 858993459, 252645135, 16711935, 65535];

r = n;
for i=1:5
   r = bitand(bitshift(r, shifts(i)), masks(i)) + ...
      bitand(r, masks(i));
end

Going back, I see that this is the 'parallel' solution given on the bithacks page.

查看更多
登录 后发表回答