CSE 373, Summer 2019: P2 - Hashing

Table of Contents

  1. Summary

  2. Expectations

  3. Set up project

  4. ArrayDictionary constructor

  5. ChainedHashDictionary

  6. ChainedHashSet

  7. More test cases

  8. Group write-up

  9. Individual feedback survey

Summary

In this homework, you will implement a hash dictionary (also known as a hash map) and a hash set. We will be using these two data structures extensively in the next project.

This entire project is due on Wednesday, July 31 at 11:59pm.

You will use these files from your prior assignments

  • src/main/java/datastructures/dictionaries/ArrayDictionary.java
  • src/main/java/datastructures/lists/DoubleLinkedList.java

If you have chosen a new partner for this assignment, choose either of your submissions from HW2 and verify that these are functioning properly.

You will be modifying the following files:

  • src/main/java/datastructures/dictionaries/ArrayDictionary.java
  • src/main/java/datastructures/dictionaries/ChainedHashDictionary.java
  • src/main/java/datastructures/sets/ChainedHashSet.java

Additionally, here are a few more files that you might want to review while completing the assignment (note that this is just a starting point, not necessarily an exhaustive list):

  • src/test/java/datastructures/dictionaries/BaseTestDictionary.java
  • src/test/java/datastructures/dictionaries/TestChainedHashDictionary
  • src/test/java/datastructures/sets/TestChainedHashSet.java
  • src/main/java/datastructures/dictionaries/IDictionary.java
  • src/main/java/datastructures/sets/ISet.java
  • src/main/java/analysis/experiments/*

Here's another video overview. Note: this video is from 19wi, so some info in this video may be a little outdated.

Expectations

Here are some baseline expectations we expect you to meet in all projects:

  • Follow the course collaboration policies

  • DO NOT use any classes from java.util.*. There are only two exceptions to this rule:

    1. You may import and use the following classes:

      • java.util.Iterator
      • java.util.NoSuchElementException
      • java.util.Objects
      • java.util.Arrays
    2. You may import and use anything from java.util.* within your testing code.

  • DO NOT make modifications to instructor-provided code (unless told otherwise). If you need to temporarily change our code for debugging, make sure to change it back afterwards.

Section a: Set up project

  1. Clone the starter code from GitLab and open the project in your IDE. See the instructions from Project 0 if you need a reminder on how to do this.

  2. Copy your DoubleLinkedList.java and ArrayDictionary.java files from Project 1 to this new one.

  3. Copy your DoubleLinkedList delete tests from Project 1 and paste them directly into TestDoubleLinkedList.java.

  4. Next make sure everything works.

    Try running SanityCheck.java, and try running Checkstyle. Checkstyle should still report the same 5 errors with SanityCheck.java as it did with Project 0.

    Try running TestDoubleLinkedList and TestArrayDictionary, and make sure the tests still pass.

Section b: Implement new ArrayDictionary constructor

In order to run one of the upcoming experiments, you will add an extra constructor to the existing ArrayDictionary class. This constructor will be used when you implement ChainedHashDictionary in the next part. The constructor should take in an integer representing the initial capacity of the pairs array.

Below is the constructor stub you should implement:

public ArrayDictionary(int initialCapacity) {
    pairs = makeArrayOfPairs(initialCapacity);
    // ... initialize any extra fields you made as necessary
}

Tip: to make sure this constructor is working, we can refactor our code to always use this new constructor. Try replacing your existing 0-argument constructor with the following code that will call your new constructor:

private static final int DEFAULT_INITIAL_CAPACITY = /* ... some value */;
// feel free to reuse what value you were using originally here

public ArrayDictionary() {
    this(DEFAULT_INITIAL_CAPACITY);
}

Afterwards, make sure the tests in TestArrayDictionary still pass.

Section c: Implement ChainedHashDictionary

Task: Complete the ChainedHashDictionary class.

In this task, you will implement a hash table that uses separate chaining as its collision resolution strategy.

Correctly implementing your iterator will be tricky—don't leave it to the last minute! Try to finish the other methods in ChainedHashDictionary as soon as possible so you can move on to implementing iterator().

In the class when we covered separate chaining hash tables we used LinkedList as the chaining data structure. In this task, instead of LinkedList, you will use your ArrayDictionary (from Project 1) as the chaining data structure.

When you first create your chains array, it will contain null pointers. As key-value pairs are inserted in the table, you need to create the chains (ArrayDictionarys) as required. Let's say you created an array of size 5 (you can create array of any size), and you inserted the key-value pair ("a", 11).

Dictionary<String, Integer> map = new ChainedHashDictionary<>();
map.put("a", 11);

Your hash table should something like the following figure. In this example, the key "a" lands in index 2, but if might be in a different index depending on your table size. Also, in this example, ArrayDictionary (chain) is of size 3, but you can choose a different size for your ArrayDictionary.

ChainedHashDictionary internal state 1

Now, suppose you inserted a few more keys:

map.put("f", 13);
map.put("c", 12);

Your internal hash table should now look like the figure below. In this example, keys "a" and "f" both hash to the same index (2).

ChainedHashDictionary internal state 2

Notes:

  • The constructor you implement will take in a few parameters:
      resizingLoadFactorThreshold: if the ratio of items to buckets exceeds this, you should resize
      initialChainCount: how many chains/buckets there are initially
      chainInitialCapacity: the initial capacity of each ArrayDictionary inner chain
  • For the other, 0-argument constructor, you'll need to define some reasonable defaults in the final fields at the top of the class.
  • Use ArrayDictionary for your internal chains/buckets.
    • Whenever you make a new ArrayDictionary, be sure to use your new ArrayDictionary constructor to correct set its initial capacity.
  • If your ChainedHashDictionary receives a null key, use a hashcode of 0 for that key.
  • You may implement any resizing strategy covered in lecture—we recommend doubling the number of chains on every resize since it's the simplest to implement, though.
  • We will be asking about your implementation design decisions later on, so it may be helpful to read ahead so you can keep this in mind while you implement ChainedHashDictionary.
  • Correctly implementing your iterator will be tricky—don't leave it to the last minute! Try to finish the other methods in ChainedHashDictionary as soon as possible so you can move on to implementing iterator().
  • Do not try to implement your own hash function. Use the hash method hashCode() that Java provides for all classes: so to get the hash of a key, use keyHash = key.hashCode(). This method returns an integer, which can be negative or greater than chains.length. How would you handle this?
  • Recall that operations on a hash table slow down as the load factor increases, so you need to resize (expand) your internal array. When resizing your ArrayDictionary, you just copied over item from the old array to the new one. Here, how would you move items from one hash table to another?

Notes on the ChainedHashDictionaryIterator

Restrictions, assumptions, etc.:

  • You may not create any new data structures. Iterators are meant to be lightweight and so should not be copying the data contained in your dictionary to some other data structure.
  • You may (and probably should) call the .iterator() method on each IDictionary inside your chains array, however, as instantiating an iterator from an existing data structure is both low cost in space and time.
  • You may and should add extra fields to keep track of your iteration state. You can add as many fields as you want. If it helps, our reference implementation uses three (including the one we gave you).
  • Your iterator doesn't need to yield the pairs in any particular order.
  • You should assume that a client will not modify your underlying data structure (the ChainedHashDictionary) while you iterate over it. For example, the following will never happen:
    Iterator<KVPair<String,Integer>> itr = dictionary.iterator();
    itr.next();
    dictionary.put("hi", "373"); // this line will never happen and you can ignore this case
    itr.next();
    
    Note that there are some tests that do something that looks similar but is different: they modify the dictionary in between creating new iterator objects, which is allowed behavior—it's okay to modify your data structure and then loop over it again, as long as you do not modify it while looping over it.

Tips for planning your implementation:

  • Before you write any code, try designing an algorithm using pencil and paper and run through a few examples by hand. This means you should draw the chains array that has some varying number of ArrayDictionary objects scattered throughout, and you should try simulate what your algorithm does.

  • Try to come up with some invariants for your code. These invariants must always be true after the constructor finishes, and must always be true both before and after you call any method in your class.

    Having good invariants will greatly simplify the code you need to write, since they reduce the number of cases you need to consider while writing code. For example, if you decide that some field should never be null and write your code to ensure that it always gets updated to be non-null before the method terminates, you'll never need to do null checks for that field at the start of your methods.

    As another example, it's possible to pose the DoubleLinkedList iterator's implementation in terms of invariants:

    1. As long as the iterator has more values, the next field is always non-null and contains the next node to output.
    2. When the iterator has no more values, the next field is null.

    Additional notes:

    • Once you've decided on some invariants, write them down in a comment somewhere so that you don't forget about them. We'll ask about these again in the writeup as well.
    • You may also find it useful to write a helper method that checks your invariants and throws an exception if they're violated. You can then call this helper method at the start and end of each method if you're running into issues while debugging. (Be sure to disable this method once your iterator is fully working.)
    • It may be helpful to revisit your main ChainedHashDictionary code and add additional invariants there to reduce the number of cases for the chains array.
  • We strongly recommend you spend some time designing your iterator before coding. Getting the invariants correct can be tricky, and running through your proposed algorithm using pencil and paper is a good way of helping you iron them out.

Section d: Implement ChainedHashSet

Task: Complete the ChainedHashSet class.

In section c, you implemented the dictionary ADT with a hash table. You can also implement the set ADT using hash tables. Recall that sets store only a key, not a key-value pair. In sets, the primary operation of interest is contains(key), which returns whether a key is part of the set. Hash tables provide an efficient implementation for this operation.

Notes:

  • To avoid code duplication, we will use an internal dictionary of type ChainedHashDictionary<KeyType, Boolean> to store items (keys) in your ChainedHashSet. Since there are no key-values pairs (only keys) in sets, we will completely ignore the values in your dictionary: use a placeholder boolean whenever necessary.
  • Your code for this class should be very simple: your inner dictionary should be doing most of the work.

Section e (highly recommended): Consider more test cases

In this homework assignment, we won't be grading the tests that you write. But, to be thorough and to foster good habits, we strongly encourage you to write additional tests for your code, since they may help you spot different bugs or make you more confident in the correctness of your implementation. (Also remember that we have "secret" tests that you will be graded on in addition to the provided tests, so it's in your best interest to test more cases.)

For this assignment, you should focus in particular on edge cases. Whenever you see conditional logic in your code (if statements, loop conditions, etc.) you should consider writing a test to check its edge cases. To make sure your tests are truly comprehensive, you should make sure that your test cases end up running every possible conditional branch in your code.

Although you can see the inner workings of your own code, it may sometimes be difficult to write code to check the actual state of your data structures by calling their public methods. Instead, you can test the state of your data structures by accessing your private fields directly; see the blue box below for more details.

Testing fields

Just like in Project 1, we've specified your fields to be package-private, which means the tests located in the same package can actually access the internal fields of your ChainedHashDictionary and ChainedHashSet.

The constructor tests in TestChainedHashDictionary.java use a helper method to access the array of chains in your ChainedHashDictionary; feel free to use the same helper method in your own tests.

Section f: Complete group write-up

Task: Complete a write-up containing answers to the following questions.

You and your partner will work together on this write-up: your completed write-up MUST be in PDF format. You will submit it to Gradescope. Log into Gradescope with your @uw.edu email address. When you submit, mark which pages correspond to the questions we ask, and afterwards, the partner who uploaded the PDF must add the other partner as a group member on Gradescope. Do not have each member submit individually. A video showing how to do this can be found here. We may deduct points if you do not correctly submit as a group or do not properly assign your pages/problems.

Design decisions

Before we get to the experiments, here are some questions about design decisions you made while doing the programming portion of the assignment.

ChainedHashDictionary

For this first prompt, reflect on a design decision you deliberately made while implementing your ChainedHashDictionary. The specifications for this assignment are deliberately loose, so you should have needed to make some decisions on your own. Consider 1 such decision you had to make, and answer the following questions:

  • What was the situation—what functionality were you implementing, and what details about it were left unspecified in the instructions?
  • Describe at least two viable implementations that you considered.
  • Describe the pros and cons of each solution you listed.
  • What was your final choice? Why did you choose it over the other(s)—why did its pros and cons outweigh the pros and cons of the other(s)?

Your responses will be graded partially based on effort, and partially based on their clarity and how well they described your design decision. If you choose a design decision that had an obviously-best solution, or a problem in which you arbitrarily decided on a final solution, you may lose points.


ChainedHashDictionaryIterator

For this prompt, briefly describe the invariant(s) you chose for your ChainedHashDictionary iterator. Did you find them useful while implementing the iterator? Also note down any invariants you discarded for any reason (e.g., they were too inefficient/impossible to enforce, or they simply weren't useful).

This section will be graded based on completion.

Experiments

For each of the experiments, answer the bolded questions (and questions in the orange boxes) in your write-up. Just like before, a plot will automatically be generated to display the results of the experiments; include PNGs of the plots inside your write-up PDF.

The hypothesis/predict-based-on-the-code portions of your write-up will be graded based on completion, so just write down your honest thoughts before running the experiment. The post-experiment analysis portions will be graded based on the clarity of your explanations and whether they match up with your plot.


Experiment 1: Chaining with different hashCodes vs. AVL trees

This experiment explores how different hash code functions affect the runtime of a ChainedHashDictionary, and compare that to the runtime of an AVLTreeDictionary.

First, we’ll look at the tests involving the ChainedHashDictionary: test1, test2, and test3. Each uses a different class (FakeString1, FakeString2, and FakeString3 respectively) as keys for a ChainedHashDictionary. Each of the different fake string objects represent a string (by storing an array of chars) but each class has different implementations of the hashCode method. Read over these implementations of hashCode and take a look at the corresponding histograms (each plot shows the distributions of outputs of the hashCode methods across 80,000 randomly-generated fake strings).

hashCode histograms

Below is a histogram for FakeString1, the type of key used in test1

FakeString1 Histogram

Below is the histogram for FakeString2 (test2)

FakeString2 Histogram

Below is the histogram for FakeString3 (test3)

FakeString3 Histogram

Now, predict which test method will have the fastest and slowest asymptotic runtime growths.

Questions (after running the experiment)

  1. Note that the distributions for the hash codes used in test1 and test2 look very similar in shape, but in your graphs you produce, you should see that test2 (FakeString2) tends to run much faster than test1 (FakeString1)—why is that? Hint: look at the x-axis scale labels.
  2. You should also see that FakeString3 produces the fastest runtimes when used as keys for ChainedHashDictionary—why is this? Explain using the histogram above, citing at least 1 observation about the histogram.

Now, we’ll consider test4, which uses the AVLTreeDictionary. This test uses a fake string class that does not provide a working hashCode method, but is comparable in a way that mimics regular String comparisons. You should see that the AVL-tree-based implementation performs much better than the chained-hash implementation when used with bad key objects, but looks like it’s only about a constant factor worse than chained-hashing with good keys.

  1. What functionality does ChainedHashDictionary require from its keys? What else must be true of its keys in order for the dictionary to perform well, if anything?
  2. What functionality does AVLTreeDictionary require from its keys? What else must be true of its keys in order for the dictionary to perform well, if anything?
  3. Which of these two has a runtime with a better (slower) asymptotic growth: the AVLTreeDictionary, or the ChainedHashDictionary (with good keys)? (Your answer should be based on the properties of these implementations, not just the results of this graph.)

Experiment 2: Load factor thresholds for resizing

This experiment tests the runtime for ChainedHashDictionary's put method across different values of the load factor threshold for resizing.

First, answer the following prompts:

  1. Briefly describe the difference between test1 and test2.
  2. Which test do you expect to run faster? What about asymptotically faster?

Questions (after running the experiment)

  1. Why is using the load factor of 300 slow? Explain at a high level how this affects the ChainedHashDictionary.put behavior.
  2. This was not a part of this experiment, but explain why very small load factor thresholds (much less than 0.75; e.g., 0.05) might be wasteful.

Experiment 3: Initial internal chain capacities

This experiment tests the runtime for inserting elements into ChainedHashDictionary with different initial capacities for the internal ArrayDictionary chains.

Briefly describe the differences between the three tests.

Questions (after running the experiment)

Note that although the runtimes when using the three initial sizes are similar, using an initial capacity of 2 results in the fewest spikes and is generally the fastest. Why would a lower initial ArrayDictionary capacity result in a more consistent and faster runtime?


Experiment 4: Data structure memory usage, take 2

This last experiment will estimate the amount of memory used by DoubleLinkedList, ArrayDictionary, and AVLTreeDictionary as they grow in size. Predict the complexity class (constant, logarithmic, linear, \(n\log(n)\), quadratic, exponential) of memory usage for each of the 3 data structures the as its size increases.

Note: You may get the following warnings when the experiment; just ignore them:

  • WARNING: Unable to get Instrumentation. Dynamic Attach failed. You may add this JAR as -javaagent manually, or supply -Djdk.attach.allowAttachSelf
  • WARNING: Unable to attach Serviceability Agent. You can try again with escalated privileges. Two options: a) use -Djol.tryWithSudo=true to try with sudo; b) echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope

Questions (after running the experiment)

  1. Describe the overall shapes of the graphs. Explain why two of them are similar but one is different.
  2. You should see that test1 uses less memory than test3. Is the actual difference on your plot a difference in complexity classes or constant factors? What are some possible reasons that make the memory usages of DoubleLinkedList less than AVLTreeDictionary?

Section g: Complete individual feedback survey

Task: Submit a response to the feedback survey.

After finishing the project, take a couple minutes to complete this individual feedback survey on Canvas. (Each partner needs to submit their own individual response.)

Deliverables

The following deliverables are due on Wednesday, July 31 at 11:59pm.

Before submitting, be sure to double-check that:

Submit by pushing your code to GitLab and submitting your writeup to Gradescope. If you intend to submit late, fill out this late submission form when you submit.