How can I convert an integer
to a half precision float
(which is to be stored into an array unsigned char[2]
). The range to the input int will be from 1-65535. Precision is really not a concern.
I am doing something similar for converting to 16bit int
into an unsigned char[2]
, but I understand there is not half precision float
C++ datatype. Example of this below:
int16_t position16int = (int16_t)data;
memcpy(&dataArray, &position16int, 2);
It's a very straightforward thing, all the info you need is in Wikipedia.
Sample implementation:
#include <stdio.h>
unsigned int2hfloat(int x)
{
unsigned sign = x < 0;
unsigned absx = ((unsigned)x ^ -sign) + sign; // safe abs(x)
unsigned tmp = absx, manbits = 0;
int exp = 0, truncated = 0;
// calculate the number of bits needed for the mantissa
while (tmp)
{
tmp >>= 1;
manbits++;
}
// half-precision floats have 11 bits in the mantissa.
// truncate the excess or insert the lacking 0s until there are 11.
if (manbits)
{
exp = 10; // exp bias because 1.0 is at bit position 10
while (manbits > 11)
{
truncated |= absx & 1;
absx >>= 1;
manbits--;
exp++;
}
while (manbits < 11)
{
absx <<= 1;
manbits++;
exp--;
}
}
if (exp + truncated > 15)
{
// absx was too big, force it to +/- infinity
exp = 31; // special infinity value
absx = 0;
}
else if (manbits)
{
// normal case, absx > 0
exp += 15; // bias the exponent
}
return (sign << 15) | ((unsigned)exp << 10) | (absx & ((1u<<10)-1));
}
int main(void)
{
printf(" 0: 0x%04X\n", int2hfloat(0));
printf("-1: 0x%04X\n", int2hfloat(-1));
printf("+1: 0x%04X\n", int2hfloat(+1));
printf("-2: 0x%04X\n", int2hfloat(-2));
printf("+2: 0x%04X\n", int2hfloat(+2));
printf("-3: 0x%04X\n", int2hfloat(-3));
printf("+3: 0x%04X\n", int2hfloat(+3));
printf("-2047: 0x%04X\n", int2hfloat(-2047));
printf("+2047: 0x%04X\n", int2hfloat(+2047));
printf("-2048: 0x%04X\n", int2hfloat(-2048));
printf("+2048: 0x%04X\n", int2hfloat(+2048));
printf("-2049: 0x%04X\n", int2hfloat(-2049)); // first inexact integer
printf("+2049: 0x%04X\n", int2hfloat(+2049));
printf("-2050: 0x%04X\n", int2hfloat(-2050));
printf("+2050: 0x%04X\n", int2hfloat(+2050));
printf("-32752: 0x%04X\n", int2hfloat(-32752));
printf("+32752: 0x%04X\n", int2hfloat(+32752));
printf("-32768: 0x%04X\n", int2hfloat(-32768));
printf("+32768: 0x%04X\n", int2hfloat(+32768));
printf("-65504: 0x%04X\n", int2hfloat(-65504)); // legal maximum
printf("+65504: 0x%04X\n", int2hfloat(+65504));
printf("-65505: 0x%04X\n", int2hfloat(-65505)); // infinity from here on
printf("+65505: 0x%04X\n", int2hfloat(+65505));
printf("-65535: 0x%04X\n", int2hfloat(-65535));
printf("+65535: 0x%04X\n", int2hfloat(+65535));
return 0;
}
Output (ideone):
0: 0x0000
-1: 0xBC00
+1: 0x3C00
-2: 0xC000
+2: 0x4000
-3: 0xC200
+3: 0x4200
-2047: 0xE7FF
+2047: 0x67FF
-2048: 0xE800
+2048: 0x6800
-2049: 0xE800
+2049: 0x6800
-2050: 0xE801
+2050: 0x6801
-32752: 0xF7FF
+32752: 0x77FF
-32768: 0xF800
+32768: 0x7800
-65504: 0xFBFF
+65504: 0x7BFF
-65505: 0xFC00
+65505: 0x7C00
-65535: 0xFC00
+65535: 0x7C00
I asked the question of how to convert 32-bit floating points to 16-bit floating point.
Float32 to Float16
So from that you could very easily convert the int to a float and then use the question above to create a 16-bit float. I would suggest this is probably much easier than going from int directly to 16-bit float. Effectively by converting to 32-bit float you have done most of the hardwork and then you just need to shift a few bits around.
Edit: Looking at Alexey's excellent answer I think its highly likely that using a hardware int to float conversion and then bit shifting it around is likely to be a fair bit faster than his method. Might be worth profiling both methods and comparing them.
Following @kbok question comment I have used the first part of this answer to get the half float and then to get the array:
uint16_t position16float = float_to_half_branch(data);
memcpy(&dataArray, &position16float, 2);
If you are targeting to supported hardware, you can use intrinsics:
https://software.intel.com/en-us/articles/performance-benefits-of-half-precision-floats
https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-7679FF37-257B-4F90-8668-5B3AA62587AD.htm