Answers to puzzles on slide 105 in:
http://courses.cs.washington.edu/courses/cse351/15au/lectures/03-integersfloats_15au.pdf
1) No, we lose precision going from int to float
2) Yes
3) Yes
4) No, not enough range or precision going from double to float
5) Yes (negating is just flipping a bit on a float or double)
6) No, integer division is truncating, so left hand side is 0
7) No, due to the lack of associativity with fp operations. For example if d is BIG and d2 is SMALL, when adding them together you may lose d2. Thus we end up with 0 on the left hand side and d2 on the right hand side.
More comments on conversions (see slide 86):
The main thing that might seem sort of obvious is that conversions
from integer types to floating point types (unlike conversions from
unsigned to signed integer types) we actually *DO* change the bit
pattern!!
Going from int->float these are both 32-bit values, so we will
definitely have enough bits to represent any exponent we might need
(we are going from a max value of 2^31 -1 to something that can be a
1.fraction * 2^127, so overflow will not be possible. However we are
going from having 32 bits of *precision* to only having 23 bits of
precision in the world of floats. So we might have to round our
result.
Going from an int (32-bits) or float (32-bits) to a double (64 bits)
we only get *more* precision (52 bits) and *more* exponent range (up
to 2^1023) so it is an exact conversion.
Going from a long int (machine word-size, so 32-bits or 64-bits) to a
double (64-bits) we may get an exact answer if our long int is 32
bits, or a possibly rounded one if long int is 64 bits (kind of like
the 32-bit int -> float conversion above) (long int is typically the
size of a pointer).
Going from double(64-bit) or float to int, obviously we no longer have
a way to store any fractional part anymore. So any fraction would get
truncated. But also, even for a 32-bit float, we could have been
representing a much *larger* (in magnitude) value (remember max
exponent = 2^127 for a float) than can be represented by a 32-bit
integer (max value (2^31) -1 ). In this case, we set the value of the
int to be Tmin = -2^31. Generally we will do this whether the value
that won't fit is a really large positive number or a really small
negative number, OR if the float/double was storing a NaN value.
----------------
Why do we represent the exponent using a biased notation? I did not
really touch on this in lecture but one thing is that it means we can
then look at the bit pattern of a fp number and do comparisons between
two floats in the same way as we compare integers. The sign bit
serves the same purpose as it did in the integer world, and the
exponents go up from 0000 0001 (for -126) to 1111 1110 (for 127), so
that A will be < B just like it would if they were 32-bit ints, even
though the bits represent a float.
A = 0 10000001 00000000000000000000000 = 1.0 * 2^2 (exp = 129-127 = 2)
B = 0 10000010 00000000000000000000000 = 1.0 * 2^3 (exp = 130-127 = 3)
I would also encourage you take a look at the videos for floating point.
Hope this helps,
Ruth
-------------------------