Avoiding Calls to floor()

2020-05-26 16:25发布

问题:

I am working on a piece of code where I need to deal with uvs (2D texture coordinates) that are not necessarily in the 0 to 1 range. As an example, sometimes I will get a uv with a u component that is 1.2. In order to handle this I am implementing a wrapping which causes tiling by doing the following:

u -= floor(u)
v -= floor(v)

Doing this causes 1.2 to become 0.2 which is the desired result. It also handles negative cases, such as -0.4 becoming 0.6.

However, these calls to floor are rather slow. I have profiled my application using Intel VTune and I am spending a huge amount of cycles just doing this floor operation.

Having done some background reading on the issue, I have come up with the following function which is a bit faster but still leaves a lot to be desired (I am still incurring type conversion penalties, etc).

int inline fasterfloor( const float x ) { return x > 0 ? (int) x : (int) x - 1; }

I have seen a few tricks that are accomplished with inline assembly but nothing that seems to work exactly correct or have any significant speed improvement.

Does anyone know any tricks for handling this kind of scenario?

回答1:

So you want a really fast float->int conversion? AFAIK int->float conversion is fast, but on at least MSVC++ a float->int conversion invokes a small helper function, ftol(), which does some complicated stuff to ensure a standards compliant conversion is done. If you don't need such strict conversion, you can do some assembly hackery, assuming you're on an x86-compatible CPU.

Here's a function for a fast float-to-int which rounds down, using MSVC++ inline assembly syntax (it should give you the right idea anyway):

inline int ftoi_fast(float f)
{
    int i;

    __asm
    {
        fld f
        fistp i
    }

    return i;
}

On MSVC++ 64-bit you'll need an external .asm file since the 64 bit compiler rejects inline assembly. That function basically uses the raw x87 FPU instructions for load float (fld) then store float as integer (fistp). (Note of warning: you can change the rounding mode used here by directly tweaking registers on the CPU, but don't do that, you'll break a lot of stuff, including MSVC's implementation of sin and cos!)

If you can assume SSE support on the CPU (or there's an easy way to make an SSE-supporting codepath) you can also try:

#include <emmintrin.h>

inline int ftoi_sse1(float f)
{
    return _mm_cvtt_ss2si(_mm_load_ss(&f));     // SSE1 instructions for float->int
}

...which is basically the same (load float then store as integer) but using SSE instructions, which are a bit faster.

One of those should cover the expensive float-to-int case, and any int-to-float conversions should still be cheap. Sorry to be Microsoft-specific here but this is where I've done similar performance work and I got big gains this way. If portability/other compilers are an issue you'll have to look at something else, but these functions compile to maybe two instructions taking <5 clocks, as opposed to a helper function that takes 100+ clocks.



回答2:

Old question, but I came across it and it made me convulse slightly that it hasn't been satisfactorily answered.

TL;DR: *Don't** use inline assembly, intrinsics, or any of the other given solutions for this! Instead, compile with fast/unsafe math optimizations ("-ffast-math -funsafe-math-optimizations -fno-math-errno" in g++). The reason why floor() is so slow is because it changes global state if the cast would overflow (FLT_MAX does not fit in a scalar integer type of any size), which also makes it impossible to vectorize unless you disable strict IEEE-754 compatibility, which you should probably not rely on anyway. Compiling with these flags disables the problem behavior.

Some remarks:

  1. inline assembly with scalar registers is not vectorizable, which drastically inhibits performance when compiling with optimizations. It also requires that any relevant values currently stored in vector registers be spilled to the stack and reloaded into scalar registers, which defeats the purpose of hand-optimization.

  2. Inline assembly using the SSE cvttss2si with the method you've outlined is actually slower on my machine than a simple for loop with compiler optimizations. This is likely because your compiler will allocate registers and avoid pipeline stalls better if you allow it to vectorize whole blocks of code together. For a short piece of code like this with few internal dependent chains and almost no chance of register spillage it has very little chance to do worse than hand-optimized code surrounded by asm().

  3. Inline assembly is unportable, unsupported in Visual Studio 64-bit builds, and insanely hard to read. Intrinsics suffer from the same caveats as well as the ones listed above.

  4. All the other listed ways are simply incorrect, which is arguably worse than being slow, and they give in each case such a marginal performance improvement that it doesn't justify the coarseness of the approach. (int)(x+16.0)-16.0 is so bad I won't even touch it, but your method is also wrong because it gives floor(-1) as -2. It's also a very bad idea to include branches in math code when it's so performance critical that the standard library won't do the job for you. So your (incorrect) way should look more like ((int) x) - (x<0.0), maybe with an intermediate so you don't have to perform the fpu move twice. Branches can cause a cache miss, which will completely negate any increase in performance; also, if math errno is disabled, then casting to int is the biggest remaining bottleneck of any floor() implementation. If you /really/ don't care about getting correct values for negative integers, it may be a reasonable approximation, but I wouldn't risk it unless you know your use case very well.

  5. I tried using bitwise casting and rounding-via-bitmask, like what SUN's newlib implementation does in fmodf, but it took a very long time to get right and was several times slower on my machine, even without the relevant compiler optimization flags. Very likely, they wrote that code for some ancient CPU where floating point operations were comparatively very expensive and there were no vector extensions, let alone vector conversion operations; this is no longer the case on any common architectures AFAIK. SUN is also the birthplace of the fast inverse sqrt() routine used by Quake 3; there is now an instruction for that on most architectures. One of the biggest pitfalls of micro-optimizations is that they become outdated quickly.



回答3:

The operation you want can be expressed using the fmod function (fmodf for floats rather than doubles):

#include <math.h>
u = fmodf(u, 1.0f);

Chances are reasonably good that your compiler will do this in the most efficient way that works.

Alternately, how concerned are you about last-bit precision? Can you put a lower bound on your negative values, such as something knowing that they're never below -16.0? If so, something like this will save you a conditional, which is quite likely to be useful if it's not something that can be reliably branch-predicted with your data:

u = (u + 16.0);  // Does not affect fractional part aside from roundoff errors.
u -= (int)u;     // Recovers fractional part if positive.

(For that matter, depending on what your data looks like and the processor you're using, if a large fraction of them are negative but a very small fraction are below 16.0, you might find that adding 16.0f before doing your conditional int-casting gives you a speedup because it makes your conditional predictable. Or your compiler may be doing that with something other than a conditional branch in which case it's not useful; it's hard to say without testing and looking at generated assembly.)



回答4:

If you are using Visual C++, check the "Enable Intrinsic Functions" compiler setting. If enabled it should make most math functions faster (including floor). Downside is that handling of edge cases (like NaN) could be incorrect, but for a game, you might not care.



回答5:

Another silly idea that might just work if the range is small...

Extract the exponent from the float using bitwise operations, then use a lookup table to find a mask that wipes unwanted bits from the mantissa. Use this to find the floor (wipe bits below the point) to avoid renormalising issues.

EDIT I deleted this as "too silly, plus with a +ve vs. -ve issue". Since it got upvoted anyway, it's undeleted and I'll leave it to others to decide how silly it is.



回答6:

If the range of values that may occur is sufficiently small, perhaps you can binary-search the floor value. For example, if values -2 <= x < 2 can occur...

if (u < 0.0)
{
  if (u < 1.0)
  {
    //  floor is 0
  }
  else
  {
    //  floor is 1
  }
}
else
{
  if (u < -1.0)
  {
    //  floor is -2
  }
  else
  {
    //  floor is -1
  }
}

I make no guarantees about this - I don't know how the efficiency of comparisons compares with floor - but it may be worth trying.



回答7:

What is the maximum input range of your u, v values ? If it's a fairly small range, e.g. -5.0 to +5.0, then it will be quicker to repeatedly add/subtract 1.0 until you get within range, rather than calling expensive functions such as floor.



回答8:

this one doesn't solve the casting cost but should be mathematically correct:

int inline fasterfloor( const float x ) { return x < 0 ? (int) x == x ? (int) x : (int) x -1 : (int) x; }


回答9:

If you are looping and using u and v as index coordinates, instead of flooring a float to get the coordinates, keep both a float and int of the same value and increment them together. This will give you a corresponding integer to use when needed.