As you all might know that the MIPS instruction set supports clz (count leading zero) as follows:
clz $t0,$t1 count leading zeros t0 = # of leading zeros in t1
I am writing a single cycle datapath in verilog and was just wondering what the ALU needs to support in order for me to do this... any ideas??
Here's a possible approach (I'm ignoring the case of an input of 0, which is probably best treated as a special case):
- The number of leading zeros in a 32-bit number is either:
- the number of leading zeros in the top 16 bits, if any of the top 16 bits are non-zero; or
- 16, plus the number of leading zeros in the bottom 16 bits, if the top 16 bits are all zero
- That gives the top bit of the 5-bit result (ignoring the special case of an input of 0...).
- Now you need to find the number of leading zeros in a 16-bit number, so apply the same principle again.
- etc.
In Verilog, it might look something like this:
result[4] = (value[31:16] == 16'b0);
val16 = result[4] ? value[15:0] : value[31:16];
result[3] = (val16[15:8] == 8'b0);
val8 = result[3] ? val16[7:0] : val16[15:8];
result[2] = (val8[7:4] == 4'b0);
val4 = result[2] ? val8[3:0] : val8[7:4];
result[1] = (val4[3:2] == 2'b0);
result[0] = result[1] ? ~val4[1] : ~val4[3];
The simplest implementation I can think of (not very optimized) is checking the word against 32 (in case of 32-bit) masks, longest first, deciding which fits first and returning its number.
Something like (pseudocode):
if word == 0: return 32
elsif (word & 1) == 0: return 31
elsif (word & 3) == 0: return 30
etc.
Build a clz16 unit which looks at 16 bits, and has a 4-bit result (0..15) and 'allzero' output. Put two of these together to make clz32, you need a mux to select which 4 lower bits and a bit of logic for the upper 2 output bits.
The clz16 is made of two clz8 in the same way. The clz8 is made of two clz4.
The clz4 is just three boolean functions of <= 4 inputs, so it doesn't matter much how you do it, synth will boil it down to a few gates.
This hierarchical approach is larger than Matthew Slattery's solution with the cascaded muxes, but probably not by that much (it doesn't need the wide gates to switch the muxes), and I believe it allows a lower prop. delay. Both approaches scale well to larger sizes (e.g 64, 128 bits) with delay prop. to log2(n).