I want to design a synthesizable module in Verilog which will take only one cycle in calculating square root of given input of 32 bit.
相关问题
- Do the Java Integer and Double objects have unnece
- Finding k smallest elements in a min heap - worst-
- binary search tree path list
- High cost encryption but less cost decryption
- How to get a fixed number of evenly spaced points
相关文章
- What are the problems associated to Best First Sea
- Coin change DP solution to keep track of coins
- Algorithm for partially filling a polygonal mesh
- Robust polygon normal calculation
- Algorithm for maximizing coverage of rectangular a
- How to measure complexity of a string?
- Select unique/deduplication in SSE/AVX
- How to smooth the blocks of a 3D voxel world?
[Edit1] repaired code
Recently found the results where off even if tests determine all was OK so I dig deeper and found out that I had a silly bug in my equation and due to name conflicts with my pgm environment the tests got false positives so I overlooked it before. Now it work in all cases as it should.
The best thing I can think of (except approximation or large LUT) is binary search without multiplication, here C++ code:
Standard binary search
sqrt(xx)
is setting bits ofx
from MSB to LSB so that result ofx*x <= xx
. Luckily we can avoid the multiplication by simply rewrite the thing as incrementing multiplicant... in each iteration the olderx*x
result can be used like this:Where
x0
is value ofx
from last iteration andx1
is actual value. Them
is weight of actual processed bit. The(2*m)
and(m*m)
are constant and can be used as LUT and bit-shift so no need to multiply. Only addition is needed. Sadly the iteration is bound to sequential computation forbid paralelisation so the result is16T
at best.In the code
a0
represents lastx*x
anda1
represents actual iteratedx*x
As you can see the
sqrt
is done in16 x (BitShiftLeft,BitShiftRight,OR,Plus,Compare)
where the bit shift and LUT can be hardwired.If you got super fast gates for this in comparison to the rest you can multiply the input clock by
16
and use that as internal timing for SQRT module. Something similar to the old days when there was MC clock as Division of source CPU clock in old Intel CPU/MCUs ... This way you can get1T
timing (or multiple of it depends on the multiplication ratio).I got the code here it is
There is conversion to a logarithm, halving, and converting back.
For an idea how to implement "combinatorial" log and antilog, see Michael Dunn's EDN article showing priority encoder, barrel shifter & lookup table, with three log variants in System Verilog for down-load.
(Priority encoder, barrel shifter & lookup table look promising for "one-step-Babylonian/Heron/Newton/-Raphson. But that would probably still need a 128K by 9 bits lookup table.)
While not featuring "verilog",
Tole Sutikno: "An Optimized Square Root Algorithm for Implementation in FPGA Hardware" shows a combinatorial implementation of a modified (binary) digit-by-digit algorithm.
The usual means of doing this in hardware is using a CORDIC. A general implementation allows the calculation of a variety of transcendental functions (cos/sin/tan) and... square roots depending on how you initialize and operate the CORDIC.
It's an iterative algorithm so to do it in a single cycle you'd unroll the loop into as many iterations as you require for your desired precision and chain the instances together.
Specifically if you operated the CORDIC in vectoring mode, initialize it with [x, 0] and rotate to 45 degrees the [x', y'] final output will be a multiplicative constant away. i.e. sqrt(x) = x' * sqrt(2) * K