I have to write a program that will simulate floating point multiplication. For this program, we assume that a single precision floating point number is stored in unsigned long a
. I have to multiply the number stored in a
by 2 using only the following operators: << >> | & ~ ^
I understand the functions of these operators, but I'm confused on the logic of how to go about implementing this. Any help would be greatly appreciated.
have to multiply the number stored in a by 2 using only the following operators: << >> | & ~ ^
since we are given an unsigned long to simulate a float value with a single point of precision, we're supposed to handle all that could be simulated. ref
First let's us assume the float is encoded as binary32 and that unsigned
is 32-bit. C does not require either of these.
First isolate the exponent to deal with the float
sub-groups: sub-normal, normal, infinity and NAN.
Below is some lightly tested code - I'll review later, For now consider it a pseudo code template.
#define FLT_SIGN_MASK 0x80000000u
#define FLT_MANT_MASK 0x007FFFFFu
#define FLT_EXPO_MASK 0x7F800000u
#define FLT_EXPO_LESSTHAN_MAXLVAUE(e) ((~(e)) & FLT_EXPO_MASK)
#define FLT_EXPO_MAX FLT_EXPO_MASK
#define FLT_EXPO_LSBit 0x00800000u
unsigned increment_expo(unsigned a) {
unsigned carry = FLT_EXPO_LSBit;
do {
unsigned sum = a ^ carry;
carry = (a & carry) << 1;
a = sum;
} while (carry);
return a;
}
unsigned float_x2_simulated(unsigned x) {
unsigned expo = x & FLT_EXPO_MASK;
if (expo) { // x is a normal, infinity or NaN
if (FLT_EXPO_LESSTHAN_MAXLVAUE(expo)) { // x is a normal
expo = increment_expo(expo); // Double the number
if (FLT_EXPO_LESSTHAN_MAXLVAUE(expo)) { // no overflow
return (x & (FLT_SIGN_MASK | FLT_MANT_MASK)) | expo;
}
return (x & FLT_SIGN_MASK) | FLT_EXPO_MAX;
}
// x is an infinity or NaN
return x;
}
// x is a sub-normal
unsigned m = (x & FLT_MANT_MASK) << 1; // Double the value
if (m & FLT_SIGN_MASK) {
// Doubling caused sub-normal to become normal
// Special code not needed here and the "carry" becomes the 1 exponent.
}
return (x & FLT_SIGN_MASK) | m;
}
Here is my code that uses bitwise operators.
This code multiply by 2 a single precision floating point increasing by 1 the floating point exponent and uses only bitwise operators; furthermore takes care of exponent and number signs (bits 30 and 31).
It doesn't pretend to cover all aspect of floating point elaboration.
Remember that if the bits 30 and/or 31 are changed by the code we had an overflow.
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <inttypes.h>
int main()
{
float f = -23.45F;
uint32_t *i=(uint32_t *)(&f);
uint32_t sgn;
uint32_t c,sc;
printf("%08X %f\n",*i,f);
sgn = *i & (0xC0000000); // copies bits 31 and 30
c = *i & (1U<<23);
*i ^= (1U<<23);
while(c)
{
sc = c << 1;
c = *i & sc;
*i ^= sc;
};
if (sgn != *i & (0xC0000000)) {
puts("Exponent overflow");
}
printf("%08X %f\n",*i,f);
return 0;
}
See also: Wikipedia Single-precision floating point
This is a simple code using the +
operator. It doesn't pretend to cover all aspect of floating point elaboration. This solution show you that incrementing of 1 the esponent of a single precision floating point, bits 23-29 (30 is the exponent sign), you obtain multiplication by 2.
This code uses bitwise operator only to consider sign bits and to avoid eventual exponent overflow.
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <inttypes.h>
int main()
{
float f = 23.45F;
uint32_t *i=(uint32_t *)(&f);
uint32_t app;
printf("%08X %f\n",*i,f);
app = *i & (0xC0000000); // copies bits 31 and 30
*i += (1U<<23);
*i &= ~(0xC0000000); // leave bits 31 and 30
*i |= app; // set original bits 31 and 30
printf("%08X %f\n",*i,f);
return 0;
}
See also: Wikipedia Single-precision floating-point
Function fpmul_by_2()
below implements the desired functionality, under the assumptions that 'unsigned long' is a 32-bit integer type and 'float' is a 32-bit floating-point type mapped to IEEE-754 'binary32'. It is further assumed that we are to mimic IEEE-754 multiplication with exceptions disabled, producing the masked response prescribed by the standard.
Two helper functions are used that implement 32-bit integer addition and comparison for equality, respectively. The addition is based on the definition of sum and carry bits in binary addition (see this previous question for a detailed explanation), while equality comparison makes use of the fact that (a^b) == 0
iff a == b
.
The processing of the floating-point argument needs to broadly distinguish three classes of operands: Denormals and zeros, normals, infinity and NaNs. Doubling of normals is accomplished by bumping the exponent, since we operate on a binary floating-point format. Overflow can occur, in which case infinity must be returned. Infinity and NaNs are returned unchanged, except that SNaNs are converted to QNaNs, which is the IEEE-754 prescribed masked response. Denormals and zeros are handled by literally doubling the significand. The handling of zeros, subnormals, and infinities may destroy the sign bit, so the sign bit of the argument is forced on the result.
The test framework included below tests fpmul_by_2()
exhaustively, which will only take a couple of minutes on a modern PC. I used the Intel compiler on a x64 platform running Windows.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
// assumptions:
// 'unsigned long' is a 32-bit type
// 'float' maps to IEEE-754 'binary32'. Exceptions are disabled
// add using definition of sum and carry bits in binary addition
unsigned long add (unsigned long a, unsigned long b)
{
unsigned long sum, carry;
carry = b;
do {
sum = a ^ carry;
carry = (a & carry) << 1;
a = sum;
} while (carry);
return sum;
}
// return 1 if a == b, else 0
int eq (unsigned long a, unsigned long b)
{
unsigned long t = a ^ b;
// OR all bits into lsb
t = t | (t >> 16);
t = t | (t >> 8);
t = t | (t >> 4);
t = t | (t >> 2);
t = t | (t >> 1);
return ~t & 1;
}
// compute 2.0f * a
unsigned long fpmul_by_2 (unsigned long a)
{
unsigned long expo_mask = 0x7f800000UL;
unsigned long expo_lsb = 0x00800000UL;
unsigned long qnan_mark = 0x00400000UL;
unsigned long sign_mask = 0x80000000UL;
unsigned long zero = 0x00000000UL;
unsigned long r;
if (eq (a & expo_mask, zero)) { // subnormal or zero
r = a << 1; // double significand
} else if (eq (a & expo_mask, expo_mask)) { // INF, NaNs
if (eq (a & ~sign_mask, expo_mask)) { // INF
r = a;
} else { // NaN
r = a | qnan_mark; // quieten SNaNs
}
} else { // normal
r = add (a, expo_lsb); // double by bumping exponent
if (eq (r & expo_mask, expo_mask)) { // overflow
r = expo_mask;
}
}
return r | (a & sign_mask); // result has sign of argument
}
float uint_as_float (unsigned long a)
{
float r;
memcpy (&r, &a, sizeof r);
return r;
}
unsigned long float_as_uint (float a)
{
unsigned long r;
memcpy (&r, &a, sizeof r);
return r;
}
int main (void)
{
unsigned long res, ref, a = 0;
do {
res = fpmul_by_2 (a);
ref = float_as_uint (2.0f * uint_as_float (a));
if (res != ref) {
printf ("error: a=%08lx res=%08lx ref=%08lx\n", a, res, ref);
return EXIT_FAILURE;
}
a++;
} while (a);
printf ("test passed\n");
return EXIT_SUCCESS;
}