Recall that the number line looks like the following. We only show the positive side, but the negative side is symmetric.
+Infinity | |
================================== | |
Biggest Positive Normalized Number | Nmax |
+Normalized Numbers | |
Smallest Positive Normalized Number | Nmin |
================================== | |
Biggest Positive Denormalized Number | Dmax |
+Denormalized Numbers | D |
Smallest Positive Denormalized Number | Dmin |
================================== | |
Zero |
The first step to converting floating point representations is to determine the dividing line between each kind of number above. That is, what are the values for Nmax, Nmin, Dmax, Dmin?
In the table above, D is the total number of denormalized values. In a sense, the denormalized values divide up the smallest normalized value (Nmin) into D evenly spaced slices. We will use D later to calculate Dmax and Dmin.
For our desired floating point format, let's call the number of exponent bits k and fractional bits m. Then the binary representation of a floating point number looks like:
sign bit | exponent bits | fractional bits |
s | k | m |
To be concrete for the examples below, let's take s=1, k=3, m=5.
Recall the following for normalized values:
Normalized Exponent values = 2**k - 2, excluding all 0's (denormalized value) and all 1's (infinity and NaN)
2**3 - 2 = 6
So encoded exponent values range from 1 to 6.
Normalized Exponent Bias = 2**(k-1) - 1, subtracted from the encoded exponent to get the real exponent
2**(3-1) - 1 = 3
So actual exponent values (encoded - bias) range from -2 to 4.
Fractional part has an implied 1.
To find Nmax, we find the largest possible exponent value and the largest possible fractional value. With their forces combined, they form the largest possible normalized value.
Maximum, non-infinite exponent = all 1's except the last bit.
110 = 6
6 - (bias of 3) = 3
Maximum fractional part = all 1's
11111 = 31
Plus implied one = 1 and 31/32
Nmax = (1 and 31/32) * (2**3) = 15.75 |
So if you are trying to convert a number greater than 15.75, then you will need to round it to positive infinity instead.
To find Nmin, we find the smallest possible exponent value and the smallest possible fractional value (which is all 0's, plus the implied 1).
Minimum, normalized exponent = all 0's except 1 in the least significant bit
001 = 1
1 - (bias of 3) = -2
Minium fractional part = all 0's
00000 = 0
Plus implied one = 1
Nmin = (1) * (2**-2) = 0.25 |
So if you are trying to convert a number that is between 0 and 0.25, you will need to make it a denormalized value (see next section!)
Denormalized values are all evenly-spaced. You can think of them as evenly dividing Nmin, the smallest normalized number, into D = 2**m equal slices. How big are these slices?
Size of a denormalized slice: Nmin / 2**m
= (0.25) / (2**5)
= (1/4) / 32
= 1 / 512
Remember that there is no implied 1 for denormalized values. If there were, we would be encoding the same value as some normalized value, which is inefficient and confusing.
So now this slice tells us the spacing between each denormalized number. It also tells us the value of Dmax and Dmin, which are separated from Nmin and 0, respectively, by the size of this slice.
In our calculation below, we multiply Nmin by the number 1, written as (32/32) to make subtracting fractions easier. This comes from D = 2**5 = 32.
Dmax = Nmin - (1/D) = (32/32)*(1/4) - (1/512) = (32/512) - (1/512) = 31/512 |
Dmin = (1/D) = (1/512) |
Recall the following simple program from several sections ago:
#include <stdio.h>
int func(int a, int b) {
return a+b;
}
int main(void) {
int d = func(22,33);
printf("%d\n", d);
return 0;
}
The func
function disassembles to the following code.
0x00001faa | <func+0>: | push %ebp | Push the old ebp onto the stack |
0x00001fab | <func+1>: | mov %esp,%ebp | Change the ebp to point to the current top of stack. |
0x00001fad | <func+3>: | sub $0x8,%esp | Allocate some temporary scratch space (never used) |
0x00001fb0 | <func+6>: | mov 0xc(%ebp),%eax | Move the second argument into %eax |
0x00001fb3 | <func+9>: | add 0x8(%ebp),%eax | Add the first argument to the second argument, still in %eax |
0x00001fb6 | <func+12>: | leave | Restore %esp to %ebp, and %ebp to old value from stack |
0x00001fb7 | <func+13>: | ret | Pop return address from stack and return there. |
When we first enter the function func
, before we execute instruction <func_0>,
the stack and ebp/esp look like below:
%ebp = 0xbffff338
%esp = 0xbffff30c
0xbffff30c: | 0x00001fd9 | Current $esp and return address to main |
0xbffff310: | 0x00000016 | First argument, 0x16 = 22 |
0xbffff314: | 0x00000021 | Second argument, 0x21 = 33 |
Right after we execute instruction <func_0>, we've just pushed the current %ebp onto the stack, and things now look like this:
%ebp = 0xbffff338
%esp = 0xbffff308
0xbffff308: | 0xbffff338 | Old %ebp |
0xbffff30c: | 0x00001fd9 | return address to main |
0xbffff310: | 0x00000016 | First argument, 0x16 = 22 |
0xbffff314: | 0x00000021 | Second argument, 0x21 = 33 |
Right after we execute instruction <func_1>, we've moved the base pointer %ebp to point to the current stack top. But the stack contents haven't changed since the last instruction.
%ebp = 0xbffff308
%esp = 0xbffff308
Right after we execute instruction <func_2>, we've subtracted 8 bytes from the stack pointer %esp for some temporary scratch space. These are initialized to zero, and we never actually use them.
%ebp = 0xbffff308
%esp = 0xbffff300
0xbffff300: | 0x00000000 | Temporary scratch space |
0xbffff304: | 0x00000000 | Temporary scratch space |
0xbffff308: | 0xbffff338 | Old %ebp |
0xbffff30c: | 0x00001fd9 | return address to main |
0xbffff310: | 0x00000016 | First argument, 0x16 = 22 |
0xbffff314: | 0x00000021 | Second argument, 0x21 = 33 |
The <func_3> instruction moves the second argument in %eax. The mov
address
references %ebp + 12, which in this case is 0xbffff308 + 0xc = 0xbffff314.
From the stack in the previous section, we move the contents at that address (0x21) into %eax.
The <func_4> instruction adds the first argument to the contents of %eax (0x21 from above).
The add
address
references %ebp + 8, which in this case is 0xbffff308 + 0x8 = 0xbffff310.
From the stack in the previous section, we add the contents at that address (0x16) to the value
in %eax and store the total again in %eax.
Neither of these instructions changes the stack, so we won't redisplay that here. Okay, now back to our normally scheduled, stack-changing programming.
Right after we execute instruction <func_3>, we've executed a leave
instruction which returns the stack pointer %esp to %ebp, and all pops the old %ebp from
the stack into %ebp. So now %ebp and %esp are the same as they were at the beginning of the
function, before we executed <func_0>. Here's what it looks like:
%ebp = 0xbffff338
%esp = 0xbffff30c
0xbffff30c: | 0x00001fd9 | return address to main |
0xbffff310: | 0x00000016 | First argument, 0x16 = 22 |
0xbffff314: | 0x00000021 | Second argument, 0x21 = 33 |
Right after we execute instruction <func_3>, we've popped the old return address off the stack and returned to the main function. As a result, the stack changes back to the way it was before we left the main function to call func in the first place.
%ebp = 0xbffff338
%esp = 0xbffff310
0xbffff310: | 0x00000016 | First argument, 0x16 = 22 |
0xbffff314: | 0x00000021 | Second argument, 0x21 = 33 |
I spent a disproportionate amount of time on unions and struct packing in section. Rest assured, they are a small part of the course material, and nothing to stress about. However, here are some notes and a demo, just for completeness.
How does C allocate memory for members of a struct that have different sizes? For example, a char is 1 byte in size, but a double is 8 bytes. Let's write a program to find out.
#include <stdio.h>
struct type_a {
char a;
int b;
double c;
char d;
char e;
};
union union_a {
struct type_a a;
char bytes[16];
};
int main(void) {
int i;
union union_a u;
for (i = 0; i < sizeof(union union_a); i++) {
u.bytes[i] = 0x00;
printf("union_a[%u] = %hx\n", i, u.bytes[i]);
}
puts("\n");
u.a.a = 'z';
u.a.d = 'A';
u.a.e = 'B';
u.a.b = 123456789;
u.a.c = 3.141592653956;
for (i = 0; i < sizeof(union union_a); i++) {
printf("union_a[%u] = %hx\n", i, u.bytes[i]);
}
puts("\n");
printf("sizeof(union union_a): %d\n", sizeof(union union_a));
return 0;
}
From the output of this program, we can tell that C aligns ints to the nearest 4 byte boundary, but that it packs chars to fit within the same 4-byte word. All the space in between (padding) is just ignored and wasted space.
Here is how the struct is packed in memory, where each line represents a 4-byte word.
a (char) | |||
b (int) | |||
c (double, upper bits) | |||
c (double, lower bits) | |||
d (char) | e (char) |
union_a[0] = 7a
union_a[1] = 0
union_a[2] = 0
union_a[3] = 0
union_a[4] = 15
union_a[5] = ffcd
union_a[6] = 5b
union_a[7] = 7
union_a[8] = 49
union_a[9] = ffc2
union_a[10] = 50
union_a[11] = 54
union_a[12] = fffb
union_a[13] = 21
union_a[14] = 9
union_a[15] = 40
union_a[16] = 41
union_a[17] = 42
union_a[18] = 0
union_a[19] = 0
sizeof(union union_a): 20