Answers to puzzles on slide 105 in: http://courses.cs.washington.edu/courses/cse351/15au/lectures/03-integersfloats_15au.pdf 1) No, we lose precision going from int to float 2) Yes 3) Yes 4) No, not enough range or precision going from double to float 5) Yes (negating is just flipping a bit on a float or double) 6) No, integer division is truncating, so left hand side is 0 7) No, due to the lack of associativity with fp operations. For example if d is BIG and d2 is SMALL, when adding them together you may lose d2. Thus we end up with 0 on the left hand side and d2 on the right hand side. More comments on conversions (see slide 86): The main thing that might seem sort of obvious is that conversions from integer types to floating point types (unlike conversions from unsigned to signed integer types) we actually *DO* change the bit pattern!! Going from int->float these are both 32-bit values, so we will definitely have enough bits to represent any exponent we might need (we are going from a max value of 2^31 -1 to something that can be a 1.fraction * 2^127, so overflow will not be possible. However we are going from having 32 bits of *precision* to only having 23 bits of precision in the world of floats. So we might have to round our result. Going from an int (32-bits) or float (32-bits) to a double (64 bits) we only get *more* precision (52 bits) and *more* exponent range (up to 2^1023) so it is an exact conversion. Going from a long int (machine word-size, so 32-bits or 64-bits) to a double (64-bits) we may get an exact answer if our long int is 32 bits, or a possibly rounded one if long int is 64 bits (kind of like the 32-bit int -> float conversion above) (long int is typically the size of a pointer). Going from double(64-bit) or float to int, obviously we no longer have a way to store any fractional part anymore. So any fraction would get truncated. But also, even for a 32-bit float, we could have been representing a much *larger* (in magnitude) value (remember max exponent = 2^127 for a float) than can be represented by a 32-bit integer (max value (2^31) -1 ). In this case, we set the value of the int to be Tmin = -2^31. Generally we will do this whether the value that won't fit is a really large positive number or a really small negative number, OR if the float/double was storing a NaN value. ---------------- Why do we represent the exponent using a biased notation? I did not really touch on this in lecture but one thing is that it means we can then look at the bit pattern of a fp number and do comparisons between two floats in the same way as we compare integers. The sign bit serves the same purpose as it did in the integer world, and the exponents go up from 0000 0001 (for -126) to 1111 1110 (for 127), so that A will be < B just like it would if they were 32-bit ints, even though the bits represent a float. A = 0 10000001 00000000000000000000000 = 1.0 * 2^2 (exp = 129-127 = 2) B = 0 10000010 00000000000000000000000 = 1.0 * 2^3 (exp = 130-127 = 3) I would also encourage you take a look at the videos for floating point. Hope this helps, Ruth -------------------------