Fastest way to get the integer part of sqrt(n)?

2019-01-10 09:09发布

问题:

As we know if n is not a perfect square, then sqrt(n) would not be an integer. Since I need only the integer part, I feel that calling sqrt(n) wouldn't be that fast, as it takes time to calculate the fractional part also.

So my question is,

Can we get only the integer part of sqrt(n) without calculating the actual value of sqrt(n)? The algorithm should be faster than sqrt(n) (defined in <math.h> or <cmath>)?

If possible, you can write the code in asm block also.

回答1:

I would try the Fast Inverse Square Root trick.

It's a way to get a very good approximation of 1/sqrt(n) without any branch, based on some bit-twiddling so not portable (notably between 32-bits and 64-bits platforms).

Once you get it, you just need to inverse the result, and takes the integer part.

There might be faster tricks, of course, since this one is a bit of a round about.

EDIT: let's do it!

First a little helper:

// benchmark.h
#include <sys/time.h>

template <typename Func>
double benchmark(Func f, size_t iterations)
{
  f();

  timeval a, b;
  gettimeofday(&a, 0);
  for (; iterations --> 0;)
  {
    f();
  }
  gettimeofday(&b, 0);
  return (b.tv_sec * (unsigned int)1e6 + b.tv_usec) -
         (a.tv_sec * (unsigned int)1e6 + a.tv_usec);
}

Then the main body:

#include <iostream>

#include <cmath>

#include "benchmark.h"

class Sqrt
{
public:
  Sqrt(int n): _number(n) {}

  int operator()() const
  {
    double d = _number;
    return static_cast<int>(std::sqrt(d) + 0.5);
  }

private:
  int _number;
};

// http://www.codecodex.com/wiki/Calculate_an_integer_square_root
class IntSqrt
{
public:
  IntSqrt(int n): _number(n) {}

  int operator()() const 
  {
    int remainder = _number;
    if (remainder < 0) { return 0; }

    int place = 1 <<(sizeof(int)*8 -2);

    while (place > remainder) { place /= 4; }

    int root = 0;
    while (place)
    {
      if (remainder >= root + place)
      {
        remainder -= root + place;
        root += place*2;
      }
      root /= 2;
      place /= 4;
    }
    return root;
  }

private:
  int _number;
};

// http://en.wikipedia.org/wiki/Fast_inverse_square_root
class FastSqrt
{
public:
  FastSqrt(int n): _number(n) {}

  int operator()() const
  {
    float number = _number;

    float x2 = number * 0.5F;
    float y = number;
    long i = *(long*)&y;
    //i = (long)0x5fe6ec85e7de30da - (i >> 1);
    i = 0x5f3759df - (i >> 1);
    y = *(float*)&i;

    y = y * (1.5F - (x2*y*y));
    y = y * (1.5F - (x2*y*y)); // let's be precise

    return static_cast<int>(1/y + 0.5f);
  }

private:
  int _number;
};


int main(int argc, char* argv[])
{
  if (argc != 3) {
    std::cerr << "Usage: %prog integer iterations\n";
    return 1;
  }

  int n = atoi(argv[1]);
  int it = atoi(argv[2]);

  assert(Sqrt(n)() == IntSqrt(n)() &&
          Sqrt(n)() == FastSqrt(n)() && "Different Roots!");
  std::cout << "sqrt(" << n << ") = " << Sqrt(n)() << "\n";

  double time = benchmark(Sqrt(n), it);
  double intTime = benchmark(IntSqrt(n), it);
  double fastTime = benchmark(FastSqrt(n), it);

  std::cout << "Number iterations: " << it << "\n"
               "Sqrt computation : " << time << "\n"
               "Int computation  : " << intTime << "\n"
               "Fast computation : " << fastTime << "\n";

  return 0;
}

And the results:

sqrt(82) = 9
Number iterations: 4096
Sqrt computation : 56
Int computation  : 217
Fast computation : 119

// Note had to tweak the program here as Int here returns -1 :/
sqrt(2147483647) = 46341 // real answer sqrt(2 147 483 647) = 46 340.95
Number iterations: 4096
Sqrt computation : 57
Int computation  : 313
Fast computation : 119

Where as expected the Fast computation performs much better than the Int computation.

Oh, and by the way, sqrt is faster :)



回答2:

Edit: this answer is foolish - use (int) sqrt(i)

After profiling with proper settings (-march=native -m64 -O3) the above was a lot faster.


Alright, a bit old question, but the "fastest" answer has not been given yet. The fastest (I think) is the Binary Square Root algorithm, explained fully in this Embedded.com article.

It basicly comes down to this:

unsigned short isqrt(unsigned long a) {
    unsigned long rem = 0;
    int root = 0;
    int i;

    for (i = 0; i < 16; i++) {
        root <<= 1;
        rem <<= 2;
        rem += a >> 30;
        a <<= 2;

        if (root < rem) {
            root++;
            rem -= root;
            root++;
        }
    }

    return (unsigned short) (root >> 1);
}

On my machine (Q6600, Ubuntu 10.10) I profiled by taking the square root of the numbers 1-100000000. Using iqsrt(i) took 2750 ms. Using (unsigned short) sqrt((float) i) took 3600ms. This was done using g++ -O3. Using the -ffast-math compile option the times were 2100ms and 3100ms respectively. Note this is without using even a single line of assembler so it could probably still be much faster.

The above code works for both C and C++ and with minor syntax changes also for Java.

What works even better for a limited range is a binary search. On my machine this blows the version above out of the water by a factor 4. Sadly it's very limited in range:

#include <stdint.h>

const uint16_t squares[] = {
    0, 1, 4, 9,
    16, 25, 36, 49,
    64, 81, 100, 121,
    144, 169, 196, 225,
    256, 289, 324, 361,
    400, 441, 484, 529,
    576, 625, 676, 729,
    784, 841, 900, 961,
    1024, 1089, 1156, 1225,
    1296, 1369, 1444, 1521,
    1600, 1681, 1764, 1849,
    1936, 2025, 2116, 2209,
    2304, 2401, 2500, 2601,
    2704, 2809, 2916, 3025,
    3136, 3249, 3364, 3481,
    3600, 3721, 3844, 3969,
    4096, 4225, 4356, 4489,
    4624, 4761, 4900, 5041,
    5184, 5329, 5476, 5625,
    5776, 5929, 6084, 6241,
    6400, 6561, 6724, 6889,
    7056, 7225, 7396, 7569,
    7744, 7921, 8100, 8281,
    8464, 8649, 8836, 9025,
    9216, 9409, 9604, 9801,
    10000, 10201, 10404, 10609,
    10816, 11025, 11236, 11449,
    11664, 11881, 12100, 12321,
    12544, 12769, 12996, 13225,
    13456, 13689, 13924, 14161,
    14400, 14641, 14884, 15129,
    15376, 15625, 15876, 16129,
    16384, 16641, 16900, 17161,
    17424, 17689, 17956, 18225,
    18496, 18769, 19044, 19321,
    19600, 19881, 20164, 20449,
    20736, 21025, 21316, 21609,
    21904, 22201, 22500, 22801,
    23104, 23409, 23716, 24025,
    24336, 24649, 24964, 25281,
    25600, 25921, 26244, 26569,
    26896, 27225, 27556, 27889,
    28224, 28561, 28900, 29241,
    29584, 29929, 30276, 30625,
    30976, 31329, 31684, 32041,
    32400, 32761, 33124, 33489,
    33856, 34225, 34596, 34969,
    35344, 35721, 36100, 36481,
    36864, 37249, 37636, 38025,
    38416, 38809, 39204, 39601,
    40000, 40401, 40804, 41209,
    41616, 42025, 42436, 42849,
    43264, 43681, 44100, 44521,
    44944, 45369, 45796, 46225,
    46656, 47089, 47524, 47961,
    48400, 48841, 49284, 49729,
    50176, 50625, 51076, 51529,
    51984, 52441, 52900, 53361,
    53824, 54289, 54756, 55225,
    55696, 56169, 56644, 57121,
    57600, 58081, 58564, 59049,
    59536, 60025, 60516, 61009,
    61504, 62001, 62500, 63001,
    63504, 64009, 64516, 65025
};

inline int isqrt(uint16_t x) {
    const uint16_t *p = squares;

    if (p[128] <= x) p += 128;
    if (p[ 64] <= x) p +=  64;
    if (p[ 32] <= x) p +=  32;
    if (p[ 16] <= x) p +=  16;
    if (p[  8] <= x) p +=   8;
    if (p[  4] <= x) p +=   4;
    if (p[  2] <= x) p +=   2;
    if (p[  1] <= x) p +=   1;

    return p - squares;
}

A 32 bit version can be downloaded here: https://gist.github.com/3481770



回答3:

While I suspect you can find a plenty of options by searching for "fast integer square root", here are some potentially-new ideas that might work well (each independent, or maybe you can combine them):

  1. Make a static const array of all the perfect squares in the domain you want to support, and perform a fast branchless binary search on it. The resulting index in the array is the square root.
  2. Convert the number to floating point and break it into mantissa and exponent. Halve the exponent and multiply the mantissa by some magic factor (your job to find it). This should be able to give you a very close approximation. Include a final step to adjust it if it's not exact (or use it as a starting point for the binary search above).


回答4:

I think Google search provides good articles like Calculate an integer square root which discussed about too many possible ways of fast calculation and there are good reference articles, I think no one here can provide better than them (and if someone can first will produce paper about it), but if you read them and there are ambiguity with them, then may be we can help you well.



回答5:

If you don't mind an approximation, how about this integer sqrt function I cobbled together.

int sqrti(int x)
{
    union { float f; int x; } v; 

    // convert to float
    v.f = (float)x;

    // fast aprox sqrt
    //  assumes float is in IEEE 754 single precision format 
    //  assumes int is 32 bits
    //  b = exponent bias
    //  m = number of mantissa bits
    v.x  -= 1 << 23; // subtract 2^m 
    v.x >>= 1;       // divide by 2
    v.x  += 1 << 29; // add ((b + 1) / 2) * 2^m

    // convert to int
    return (int)v.f;
}

It uses the algorithm described in this Wikipedia article. On my machine it's almost twice as fast as sqrt :)



回答6:

To do integer sqrt you can use this specialization of newtons method:

Def isqrt(N):

    a = 1
    b = N

    while |a-b| > 1
        b = N / a
        a = (a + b) / 2

    return a

Basically for any x the sqrt lies in the range (x ... N/x), so we just bisect that interval at every loop for the new guess. Sort of like binary search but it converges must faster.

This converges in O(loglog(N)) which is very fast. It also doesn't use floating point at all, and it will also work well for arbitrary precision integers.



回答7:

Why nobody suggests the quickest method?

If:

  1. the range of numbers is limited
  2. memory consumption is not crucial
  3. application launch time is not critical

then create int[MAX_X] filled (on launch) with sqrt(x) (you don't need to use the function sqrt() for it).

All these conditions fit my program quite well. Particularly, an int[10000000] array is going to consume 40MB.

What's your thoughts on this?



回答8:

In many cases, even exact integer sqrt value is not needed, enough having good approximation of it. (For example, it often happens in DSP optimization, when 32-bit signal should be compressed to 16-bit, or 16-bit to 8-bit, without loosing much precision around zero).

I've found this useful equation:

k = ceil(MSB(n)/2); - MSB(n) is the most significant bit of "n"


sqrt(n) ~= 2^(k-2)+(2^(k-1))*n/(2^(2*k))); - all multiplications and divisions here are very DSP-friendly, as they are only 2^k.

This equation generates smooth curve (n, sqrt(n)), its values are not very much different from real sqrt(n) and thus can be useful when approximate accuracy is enough.



回答9:

If you need performance on computing square root, I guess you will compute a lot of them. Then why not caching the answer? I don't know the range for N in your case, nor if you will compute many times the square root of the same integer, but if yes, then you can cache the result each time your method is called (in an array would be the most efficient if not too large).



回答10:

This is so short that it 99% inlines.

static inline int sqrtn(int num) {
    int i;
    __asm__ (
        "pxor %%xmm0, %%xmm0\n\t"   // clean xmm0 for cvtsi2ss
        "cvtsi2ss %1, %%xmm0\n\t"   // convert num to float, put it to xmm0
        "sqrtss %%xmm0, %%xmm0\n\t" // square root xmm0
        "cvttss2si %%xmm0, %0"      // float to int
        :"=r"(i):"r"(num):"%xmm0"); // i: result, num: input, xmm0: scratch register
    return i;
}

Why clean xmm0? Documentation of cvtsi2ss

The destination operand is an XMM register. The result is stored in the low doubleword of the destination operand, and the upper three doublewords are left unchanged.

GCC Intrinsic version (runs only on GCC):

#include <xmmintrin.h>
int sqrtn2(int num) {
    register __v4sf xmm0 = {0, 0, 0, 0};
    xmm0 = __builtin_ia32_cvtsi2ss(xmm0, num);
    xmm0 = __builtin_ia32_sqrtss(xmm0);
    return __builtin_ia32_cvttss2si(xmm0);
}

Intel Intrinsic version (tested on GCC, Clang, ICC):

#include <xmmintrin.h>
int sqrtn2(int num) {
    register __m128 xmm0 = _mm_setzero_ps();
    xmm0 = _mm_cvt_si2ss(xmm0, num);
    xmm0 = _mm_sqrt_ss(xmm0);
    return _mm_cvt_ss2si(xmm0);
}

^^^^ All of them requires SSE 1. (not even SSE 2)



回答11:

On my computer with gcc, with -ffast-math, converting a 32-bit integer to float and using sqrtf takes 1.2 s per 10^9 ops (without -ffast-math it takes 3.54 s).

The following algorithm uses 0.87 s per 10^9 at the expense of some accuracy: errors can be as much as -7 or +1 although the RMS error is only 0.79:

uint16_t SQRTTAB[65536];

inline uint16_t approxsqrt(uint32_t x) { 
  const uint32_t m1 = 0xff000000;
  const uint32_t m2 = 0x00ff0000;
  if (x&m1) {
    return SQRTTAB[x>>16];
  } else if (x&m2) {
    return SQRTTAB[x>>8]>>4;
  } else {
    return SQRTTAB[x]>>8;
  }
}

The table is constructed using:

void maketable() {
  for (int x=0; x<65536; x++) {
    double v = x/65535.0;
    v = sqrt(v);
    int y = int(v*65535.0+0.999);
    SQRTTAB[x] = y;
  }
}

I found that refining the bisection using further if statements does improve accuracy, but it also slows things down to the point that sqrtf is faster, at least with -ffast-math.