This question already has an answer here:
-
How to alter a float by its smallest increment (or close to it)?
7 answers
A float
(a.k.a. single) value is a 4-byte value, and supposed to represent any real-valued number. Because of the way it is formatted and the finite number of bytes it is made off, there is a minimum value and a maximum value it can represent, and it has a finite precision, depending on it's own value.
I would like to know if there is a way to get the closest possible value above or below some reference value, given the finite precision of a float. With integers, this is trivial: one simply adds or subtracts 1. But with a float
, you can't simply add or subtract the minimum float value and expect it to be different from your original value. I.e.
float FindNearestSmaller (const float a)
{
return a - FLT_MIN; /* This doesn't necessarily work */
}
In fact, the above will almost never work. In the above case, the return will generally still equal a
, as the FLT_MIN
is far beyond the precision of a
. You can easily try this out for yourself: it works for e.g. 0.0f
, or for very small numbers of order FLT_MIN
, but not for anything between 0 and 100.
So how would you get the value that is closest but smaller or larger than a
, given floating point precision?
Note: Though i am mainly interested in a C/C++ answer, I assume the answer will be applicable for most programming languages.
The standard way to find a floating-point value's neighbors is the function nextafter
for double
and nextafterf
for float
. The second argument gives the direction. Remember that infinities are legal values in IEEE 754 floating-point, so you can very well call nextafter(x, +1.0/0.0)
to get the value immediately above x
, and this will work even for DBL_MAX
(whereas if you wrote nextafter(x, DBL_MAX)
, it would return DBL_MAX
when applied for x == DBL_MAX
).
Two non-standard ways that are sometimes useful are:
access the representation of the float
/double
as an unsigned integer of the same size, and increment or decrement this integer. The floating-point format was carefully designed so that for positive floats, and respectively for negative floats, the bits of the representation, seen as an integer, evolve monotonously with the represented float.
change the rounding mode to upward, and add the smallest positive floating-point number. The smallest positive floating-point number is also the smallest increment that there can be between two floats, so this will never skip any float. The smallest positive floating-point number is FLT_MIN * FLT_EPSILON
.
For the sake of completeness, I will add that even without changing the rounding mode from its “to nearest” default, multiplying a float by (1.0f + FLT_EPSILON)
produces a number that is either the immediate neighbor away from zero, or the neighbor after that. It is probably the cheapest if you already know the sign of the float you wish to increase/decrease and you don't mind that it sometimes does not produce the immediate neighbor. Functions nextafter
and nextafterf
are specified in such a way that a correct implementation on the x86 must test for a number of special values and FPU states, and is thus rather costly for what it does.
To go towards zero, multiply by 1.0f - FLT_EPSILON
.
This doesn't work for 0.0f
, obviously, and generally for the smaller denormalized numbers.
The values for which multiplying by 1.0f + FLT_EPSILON
advance by 2 ULPS are just below a power of two, specifically in the interval [0.75 * 2p … 2p). If you don't mind doing a multiplication and an addition, x + (x * (FLT_EPSILON * 0.74))
should work for all normal numbers (but still not for zero nor for all the small denormal numbers).
Look at the "nextafter" function, which is part of Standard C (and probably C++, but I didn't check).
I tried it out on my machine. And all three approaches:
1. adding with 1 and memcopying
2. adding FLT_EPSILON
3. multiplying by (1.0f + FLT_EPSILON)
seems to give the same answer.
see the result here
bash-3.2$ cc float_test.c -o float_test; ./float_test 1.023456 10
Original num: 1.023456
int added = 1.023456 01-eps added = 1.023456 mult by 01*(eps+1) = 1.023456
int added = 1.023456 02-eps added = 1.023456 mult by 02*(eps+1) = 1.023456
int added = 1.023456 03-eps added = 1.023456 mult by 03*(eps+1) = 1.023456
int added = 1.023456 04-eps added = 1.023456 mult by 04*(eps+1) = 1.023456
int added = 1.023457 05-eps added = 1.023457 mult by 05*(eps+1) = 1.023457
int added = 1.023457 06-eps added = 1.023457 mult by 06*(eps+1) = 1.023457
int added = 1.023457 07-eps added = 1.023457 mult by 07*(eps+1) = 1.023457
int added = 1.023457 08-eps added = 1.023457 mult by 08*(eps+1) = 1.023457
int added = 1.023457 09-eps added = 1.023457 mult by 09*(eps+1) = 1.023457
int added = 1.023457 10-eps added = 1.023457 mult by 10*(eps+1) = 1.023457
Code
#include <float.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <assert.h>
int main(int argc, char *argv[])
{
if(argc != 3) {
printf("Usage: <binary> <floating_pt_num> <num_iter>\n");
exit(0);
}
float f = atof(argv[1]);
int count = atoi(argv[2]);
assert(count > 0);
int i;
int num;
float num_float;
printf("Original num: %f\n", f);
for(i=1; i<=count; i++) {
memcpy(&num, &f, 4);
num += i;
memcpy(&num_float, &num, 4);
printf("int added = %f \t%02d-eps added = %f \tmult by %2d*(eps+1) = %f\n", num_float, i, f + i*FLT_EPSILON, i, f*(1.0f + i*FLT_EPSILON));
}
return 0;
}