key -> hashcode -> compression -> array[index] -> value O(1) behavior if designed well worst hashcode of a student in class: 42 bad hashcode of a student in class: Age better hashcode of a student in class: Bday best hashcode of a student in class: social security number A hashcode must satisfy: o1.equals(o2) == true => o1.hashCode() == o2.hashCode() Satisfying the converse as much as possible is not required but strongly recommended, as it improves the performance. Data is often similar. Hashcodes should destroy any consistencies/similarities/coincidences. Hashcode of x+y for a Point is bad because (0,7) -> 7 (1,6) -> 7 (2,5) -> 7 (3,4) -> 7 (4,3) -> 7 (5,2) -> 7 (6,1) -> 7 (7,0) -> 7 37 x + y is much better (0,7) -> 7 (1,6) -> 43 (2,5) -> 79 (3,4) -> 115 (4,3) -> 151 (5,2) -> 187 (6,1) -> 223 (7,0) -> 259 Now compression. Suppose our array is size 10 and we use mod 10. (0,7) -> 7 -> 7 (1,6) -> 43 -> 3 (2,5) -> 79 -> 9 (3,4) -> 115 -> 5 (4,3) -> 151 -> 1 (5,2) -> 187 -> 7 (6,1) -> 223 -> 3 (7,0) -> 259 -> 9 Lots of collision, only the odd numbers being used. Let's use a prime number instead: 11, with compression mod 11. (0,7) -> 7 -> 7 (1,6) -> 43 -> 10 (2,5) -> 79 -> 2 (3,4) -> 115 -> 5 (4,3) -> 151 -> 8 (5,2) -> 187 -> 0 (6,1) -> 223 -> 3 (7,0) -> 259 -> 6 Much better. To avoid clustering and spread things out more, we can do (a * i + b) mod N, instead of just i mod N. Again, N should be prime for best performance. Suppose we had 3 coords... then we could do 37^2 *x + 37 * y + z, or (x*37+y)*37+z. We can't always eliminate collisions, so we need to have a way to resolve them. Chaining - in each bin, store a list of items sharing the same key (worst case O(n) -- all items may be stored in that list) Open-Addressing - if a bin is occupied, find the next unoccupied bin - may be as simple as linear probing or as complicated as double hashing. (worst case O(n) -- have to go through all indices before finding the empty one) Deleting is a little tricky for open-addressing -- need a placeholder marking the spot as once holding something so that items placed after it can still be found. Load Factor -- number of elements stored / size of hashtable chaining: 75%-100% is good. open-addressing: 25%-50% is good. (if too low, wasted space; if too high, poor performance) hash this string! http://www.amazon.com/gp/product/0930410963/102-7078559-5346557?v=glance&n=283155&5Fencoding=UTF8&v=glance when hash table gets full (from lots of elements or a lot of deleted fillers from probing), need to enlarge and rehash -- expensive, though not done often.