Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

IEEE 754-2008 contains a half precision that is only 16 bits wide. The leftmost

ID: 3668130 • Letter: I

Question

IEEE 754-2008 contains a half precision that is only 16 bits wide. The leftmost bit is still the sign bit, the exponent is 5 bits wide and has a bias of 15, and the mantissa is 10 bits long. A hidden 1 is assumed. Write down the bit pattern to represent -1.5625 * 10-2 assuming a version of this format. Calculate the sum of 2.6125*102 and 4.150390625 * 10-1 by hand, assuming both numbers are stored in the 16-bit half precision described above. Assume 1 guard, 1 round bit, and 1 sticky bit, and round to the nearest even. Show all the steps. Write down the bit pattern in the fraction of value 1/6 assuming a floating point format that uses binary numbers in the fraction. Assume there are 24 bits, and you do not need to normalize. Is this representation exact?

Explanation / Answer

Part 1

We want to convert the number -1.5625 x10^-2 into the given sign precision format.

The first thing to do is to get rid of the decimal exponent, so we get -0.015625.

Now let's convert this into binary (using negative powers). The first negative power is 2^-1 = 0.5, and is too large. The next is 2^-2 = 0.25 and is also too large. Finally 2^-6=0.015625 fits ; on subtracting this from 0.015625 we get 0. Therefore

-1.5625 (decimal) = - 0.000001 (binary)

Let's normalise now. First we bring the first '1' to the left of the '.' by multiplying by a non-zero exponent:

-0.000001 = 1.00000 x 2^-6

Next we get rid of the '1' before the '.' and we add 15 to the exponent:

.00 x 2^9

We convert the exponent to binary as well:

.00 x 2^1001

Finally we need to set the sign bit to '1' because the number is negative, so the result is:

1 01001 0000000000

where the first bit is the sign, the next 5 bits are the exponent, and the remaining 10 are the mantissa.

The range and accuracy for this 16 bit format are less than single precision IEEE 754 format, for we have less bits for the exponent and the mantissa.

Now to the next exercise.

First let's convert the numbers 2.6125 x10^2 and 4.150390625 x 10^-1 to the above format.

2.6125 x10^2 in binary is 100000101.01 = 1.0000010101 x 2^8
4.150390625 x 10^-1 in binary is 0.0110101001 = 1.10101001 x 2^-2

Next we convert these two numbers in the 16 bit format described above so we get:

0 10111 0000010101
0 01101 0010101001

In order to add them we first need to shift the second number to the right, for the exponents must match:

1.0000010101 x 2^8 +
1.1010100100 x 2^-2
=
1.0000010101 x 2^8 +
0.0000000001 x 2^8

Note however that, since we have only 10 bits for the mantissa, we have to truncate the second number, and place some of its bits in the guard bit, round bit and sticky bit:

=
1.0000010101 x 2^8 +
0.0000000001 x 2^8
-------------------------
1.0000010110 x 2^8

Now we round the resulting number to 10 bits:

1.0000010110 x 2^8

which in 16 bit format is:

0 10111 0000010110

Part 2

The first negative power (2^-1 = 0.5) is too large to fit. The next one is 2^-2 = 0.25 and it also doesn't fit, 2^-3 =0.125 fits, so let's subtract that: 0.1666.. - 0.125 = 0.0416… The next power that fits is 2^-5 = 0.03125. Subtracting that from the remaining result we get 0.0416…- 0.03125 = 0.01035… Continuing on this line we get:

0.166666…. (decimal) = 0.0010101010101010….. (binary)

As we don't need to normalise the number, the exponent is 1, and the mantissa remains: 00010101010101010

Now, floating point numbers are made of three parts: the sign (one bit), the exponent (8 bits) and the mantissa (24 bits) for a total of 32 bits. If the total number of bits you have is 23, then it means you need to reduce the size in bits allocated for the mantissa and the exponent. Let's say for instance that we keep 6 bits for the exponent, and 17 bits for the mantissa.

In this case the sign bit is 0 (1 bit) because the number is non-negative.

The mantissa is 0.0010101010101010 (17 bits)

The exponent is 000001 (6 bits).

So the resulting floating bit representation is 0 00010101010101010 000001