CSE 373, Summer 2019: Project 3

Overview

Due Wednesday, August 7 at 11:59pm.

Submit by pushing your code to GitLab and submitting your writeup to Gradescope. If you intend to submit late, fill out this late submission form when you submit.

In this assignment, will practice writing more JUnit tests, and you will implement data structures and sorting algorithms for future use.

You will use these files from your prior assignments:

  • src/main/java/datastructures/dictionaries/ArrayDictionary.java
  • src/main/java/datastructures/dictionaries/ChainedHashDictionary.java
  • src/main/java/datastructures/sets/ChainedHashSet.java
  • src/main/java/datastructures/lists/DoubleLinkedList.java

If you have chosen a new partner for this assignment, choose either of your submissions from prior assignments and verify that these are functioning properly.

You will be modifying the following files:

  • src/main/java/datastructures/priorityqueues/ArrayHeapPriorityQueue.java
  • src/main/java/misc/Sorter.java
  • src/test/java/datastructures/priorityqueues/TestArrayHeapPriorityQueue.java
  • src/test/java/misc/TestSorter.java

Additionally, here are a few more files that you might want to review while completing the assignment (note that this is just a starting point, not necessarily an exhaustive list):

  • src/main/java/datastructures/priorityqueues/IPriorityQueue.java
  • src/test/java/datastructures/priorityqueues/IntPair.java
  • src/main/java/datastructures/dictionaries/IDictionary.java
  • src/main/java/analysis/experiments/*
  • src/main/java/analysis/utils/SortingAlgorithms.java

Expectations (added 8/5, but same as previous projects)

Here are some baseline expectations you should meet in all projects:

  • Follow the course collaboration policies.

  • DO NOT use or import any classes from java.util.*. There are only two exceptions to this rule:

    1. You may import and use the following classes:

      • java.util.Iterator
      • java.util.NoSuchElementException
      • java.util.Objects
      • java.util.Arrays
    2. You may import and use anything from java.util.* within your testing code.

  • DO NOT make modifications to instructor-provided code (unless told otherwise). If you need to temporarily change our code for debugging, make sure to change it back afterwards.

Part 1a: Set up your project and fix your data structures

Task: Set up Project 3 in your IDE and make sure all the tests for DoubleLinkedList, ArrayDictionary, ChainedHashSet and ChainedHashDictionary pass.

  1. Import your homework repo from GitLab. See the instructions from Project 0 if you need a reminder on how to do this.

  2. Copy your DoubleLinkedList.java, ArrayDictionary.java, ChainedHashDictionary.java, and ChainedHashSet.java files from Project 2 to this new one.

  3. Copy any extra tests you wrote to TestDoubleLinkedList.java or TestChainedHashDictionary.java

  4. Finally, make sure everything works.

    Try running the following tests and make sure all the tests pass.

    1. TestDoubleLinkedList.java
    2. TestArrayDictionary.java
    3. TestChainedHashDictionary.java
    4. TestChainedHashSet.java

    Try running SanityCheck.java, and try running Checkstyle. Checkstyle should still report the same 5 errors with SanityCheck.javaas it did in the previous projects.

Part 1b: Implement tests for ArrayHeapPriorityQueue and Sorter.topKSort(...)

Task: add tests for ArrayHeapPriorityQueue and Sorter.topKSort(...) to TestArrayHeapPriorityQueue, and TestSorter.

In parts 1c and 1d, you will be asked to implement an ArrayHeapPriorityQueue and the top-\(k\) sorting algorithm.

After you finish reading the specifications for these classes below, start by adding tests for these two classes. If you look through the source code of these two test classes, you'll notice that they're almost completely empty! It's now your responsibility to write comprehensive tests for these two classes.

We recommend that you write at least some tests before you implement ArrayHeapPriorityQueue and Sorter.topKSort. Doing so will help you better understand what your methods need to do (before you jump into coding), as well as provide a reference to check how well your code is working. Additionally, it may feel better to write your tests first and fix bugs as you implement the code—the alternative would be to implement all the code first, then write tests, only to realize that there is actually much more work remaining in fixing bugs. And of course, there's no real downside to having more tests, so feel free to add more after implementing the code—you'll probably come up with more test cases after you start implementing the code, anyways.

Note: This testing portion is a step up in difficulty compared to the DoubleLinkedList.delete testing from Project 1, so make sure not to take this portion lightly, and make use of the resources we provide, including the tips below, your partner, and TAs at office hours. Writing a good test suite for your code will take time and effort, and the scoring for this assignment will reflect that—the testing and regular code implementation portions of this assignment will have comparable weighting.

Just like we did in Project 1 part 1's TestDoubleLinkedListDelete.java, we will grade your tests by running them against many different buggy versions of ArrayHeapPriorityQueue and Sorter.topKSort(...). In order to help us do so efficiently, we ask that you keep all tests in TestArrayHeapPriorityQueue and TestSorter fast. Every test you write should take at most 1 second to run. Tests that take longer than this may time out and fail during grading. As a rule of thumb, your tests shouldn't need to add more than a thousand elements to your data structures. You may add as many tests as you want to these two files, however.

Test-writing advice

Getting started

  1. To start, come up with a comprehensive set of tests for the empty cases (an empty heap or input list)—for every single method, write tests that examine every possible case where the data structure/input is empty. For example, check that the size method properly returns 0 on the empty heap, or that contains returns false on any input. By planning comprehensive tests from the empty cases, you lay out a foundation to build future tests on: implementations that fail on the empty case are likely to fail on the other cases (especially for data structures, which always start off empty), and conversely, testing only a complex case without testing the basic cases will make it hard to determine whether any bugs found pertain to that specific complex case, or result from a more fundamental issue that would affect even the basic cases. These basic tests may also help us find patterns and scenarios to reuse later cases with larger sizes.

  2. Now, build off the tests you wrote for the empty cases by writing all possible tests for size 1. There should be more cases possible now that you can actually have values in your heap or list (for example, you can trigger a situation where contains should return true).

  3. Repeat process with increasing numbers of elements (it's up to you to determine sizes that will allow for new, distinct cases and method behavior).

    It's impossible to write tests for every possible input size, as there are infinitely many possible sizes, and increasingly many possible cases for larger sizes; but, by actively considering what sizes and inputs can actually trigger new cases, we can write a reasonably comprehensive test suite.

  4. After implementing your code, review it to examine all your conditional branches and loop conditions, and try to create tests that execute each path through in your code. (For loops consider cases where the loop runs 0 times, 1 time, 2 times, etc.) These types of tests are called clear-box tests, as you consider the internals of the implementation. Reviewing your code should help you gather more ideas and more specific tests than before.

General advice

  1. Revisit the testing categories from Project 1 part 1, and try to write tests for each applicable category for each method.

  2. Remember to assert that all of the state is as it should be (check fields, and use the various getter methods) after your test code. It's relatively common for students to write relevant test cases, but fail to catching bugs simply because the tests don't have the proper asserts to check the state thoroughly enough. So, whenever possible, check the values of all getter methods and/or internal fields.

  3. Try to write tests with succinct, but descriptive names. This should help you organize your tests when coming up with cases and should help whenever you fail a test and need to remember what it does. Additionally, this should encourage you to write tests that capture a single case only, which will make it easier to determine what's wrong when your tests fail. On the other hand, if you have huge mega-tests that check many things, it may be difficult to know exactly how to continue writing new tests or debug failed tests.

  4. Check that the correct exceptions are being thrown in the specific situations described in the documentation.

  5. Don't forget to test cases involving larger sizes. Some cases in your code may not appear until you have more than a handful of elements, especially for the heap code. Remember that you can use loops to generate values to set up your test cases with larger input sizes, so you don't need to type everything out by hand.

  6. Calling methods multiple times or in combination with other methods could be useful for revealing new cases.

  7. You can also refer back to previous provided test files for inspiration or example syntax.

ArrayHeapPriorityQueue

  1. When testing your ArrayHeapPriorityQueue, you may notice that there's no simple way to inspect all the data in the heap. One way to verify that your heap works properly is by calling removeMin repeatedly and checking that it returns the correct values. Whether your input sequence is random, sorted, reverse-sorted, or anything else, after removeMining, the data should all come out in sorted order. So, one way to test the heap is to make sure each value from removeMin is greater than or equal to the return value of the previous removeMin call (according to compareTo).

  2. You may find that using only removeMin is somewhat restrictive. As an alternative, you can use directly access the internal array from your ArrayHeapPriorityQueue class. The provided test testAddEmptyInternalArray demonstrates the provided helper method for accessing this internal array. After storing the array, you can check whether the heap is valid according to the min-heap invariant (each node is greater than or equal to its parent).

    If you want your code to be concise and consistent, you might want to make a helper method that checks the heap invariants at every single index in the heap.

    WARNING: Be careful about testing the internals specifically. If you test and assert too strictly (perhaps checking behavior not actually specified by heaps and only present for your own implementation), you may cause you to lose points due to our correct solution failing your tests. The safest useful thing to assert about the internal array is this heap invariant.

  3. Just like you can access the package-private array field, heap, to test its state, you can also access the indices IDictionary. As this is also part of your ArrayHeapPriorityQueue's state, it is worth checking that the correct keys and values are maintained.

  4. Once you read the instructions for implementing ArrayHeapPriorityQueue below, you will realize that you cannot have duplicate values according to equals in your heap. This means it would be difficult to test your code when a.compareTo(b) == 0, as in many cases, this would only be true if a.equals(b). So, we have provided a class called IntPair that defines equality and compareTo in a way that will allow for compareTo ties while not being equals. By using this class as the values for your heap, you can then test your code executing when a.compareTo(b) == 0. See the class comment for IntPair for more details.

TopKSort

  1. To test your Sorter.topKSort, consider the fact that there are two real parameters: k and the input list. To be thorough, consider all the edge cases for k and the input list, and test combinations of both.

  2. As a reminder, you may use classes from java.util.* when writing unit tests. You may find Collections.sort(...) to be particularly helpful for generating expected output. You may need to write some extra logic to convert between an IList and java.util.List, however.

    As always, you may not use anything from java.util.* for anything outside of your tests, unless explicitly told otherwise.

  3. You should not test cases when the preconditions (from the method comments) for a method are not met. In such situations, the behavior is undefined, so any resulting behavior should be correct, and any test you write that expect any particular behavior will be incorrect. The postcondition, however, defines expected behavior that is safe to test.

Part 1c: Implement ArrayHeapPriorityQueue

Task: complete the ArrayHeapPriorityQueue class.

Notes:

  1. In lecture, we covered how to implement and work with 2-heaps: a heap where each node has up to 2 children.

    For this assignment, you must implement a 4-heap: a heap where each node has up to four children. You will need to modify the equations you were given in lecture in order to successfully do so.

    Additionally, you must implement this as being indexed from 0, rather than some other arbitrary index.

  2. Make sure you're implementing a min-heap, not a max-heap.

  3. Unlike all the other data structures we've implemented so far, your heap will not attempt to handle null entries. (There's no sensible way to compare a null against a non-null element).

    Your heap should throw an exception when given a null element—see the IPriorityQueue interface for more details.

  4. The data structure you'll be implementing has some additional functionality: we'll be able to remove and replace elements in the heap. In exchange, the heap will not support duplicate elements according to equals.

    We recommend leaving the implementation of these remove, replace, and contains methods until after you've implemented add and removeMin. The code for these methods may require some changes to code used by add and removeMin, but it will probably be easier to think about these methods after you've implemented the basic code for add and removeMin.

    In particular, you'll need to track the indices of each element in the heap by using a dictionary to map from element to index in the heap. (Take a moment to consider why we want to do this, and what the alternative is.) Tracking the indices means that we'll need to update our dictionary of indices every time the heap changes from adding, removing, or percolating; fortunately, adding this logic is relatively painless if you implement add and removeMin using our provided method stubs for the helper methods percolateUp, percolateDown, and swap.

    Note: this dictionary of element to index is one reason why we do not support duplicate elements in our heap—it would imply us keeping track of two identical keys in our dictionary pointing to different indices.

    Here are some more tips for each method:

    • contains: Since we're tracking the indices of each element in the heap, implementing this is as simple as calling containsKey. Make sure that contains runs in approximately constant time.
    • replace: This method replaces an element in the heap with a new element. When this method gets called, you'll need to figure out whether the new item needs to percolate upwards or downwards in the heap, and update the heap accordingly. replace should have \(\mathcal{O}(log(n))\) runtime.
      void replace(oldItem, newItem):
          throw exception if necessary
          index = dictionary.remove(oldItem)
          heap[index] = newItem
          dictionary.put(newItem, index)
      
          percolate(index)
      
      void percolate(index):
          if index is first in heap:
              percolateDown(index)
          else:
              parent = get parent index
              if heap[index] < heap[parent]:
                  percolateUp(index)
              else if heap[index] > heap[parent]:
                  percolateDown(index)
    • remove: The code for this method should actually be fairly simple after you've implemented replace, since you can simply reuse its functionality. This method will end up looking somewhat similar to removeMin: you just need to replace the input item with the last item in the heap, and then use the percolation logic from replace to determine how to fix the heap invariants. remove should have \(\mathcal{O}(log(n))\) runtime.
      void remove(item):
          throw exception if necessary
          index = dictionary.get(item)
      
          if index is last index in heap:
              remove last element in heap
          else:
              swap(index, last index in heap)
              remove last element in heap
              percolate(index)
  5. In lecture, we assumed that every element has some numerical priority, and focused on comparing each element by that priority.

    In practice, the exact "priority" an element has is an implementation detail of the items the client gives us, and isn't a value we can see directly.

    What you should do instead is to compare items using their compareTo method. The ArrayHeapPriorityQueue is specifically defined so that all items MUST implement the Comparable interface—if they don't, the code won't compile.

Review: working with objects that implement Comparable

In Java, the way that we indicate an object is "comparable" in some way is to (a) have that object implement the Comparable interface and (b) have that object implement the compareTo(...) method, which returns an int.

That int indicates how the two elements are "ordered" in comparison to each other. For example, suppose we do a.compareTo(b).

  • If the returned int is negative, that means a < b.
  • If the returned int is equal to zero, that means a == b.
  • If the returned int is positive, that means a > b.

This syntax is unfortunately clunky and a little difficult to remember. Here are two strategies that may help.

The first strategy is to remember that if you ever want to do something like "a OP b" (where "OP" is some comparison operator), you can always automatically translate it toa.compareTo(b) OP 0. For example, if I wanted to check if a <= b, you would write a.compareTo(b) <= 0.

The second strategy is to use helper methods so you can minimize the amount of ugly code you need to write. For example, instead of writing a.compareTo(b) <= 0 everywhere, try writing and using a helper method like:

private boolean leq(T a, T b) {
    return a.compareTo(b) ≤= 0;
}

This will help make your code a little more readable and easier to debug. (Instead of having to worry about whether you used .compareTo(...) correctly in multiple places, you can focus on getting it right just once.)

Part 1d: Implement Sorter.topKSort(...)

Task: complete the Sorter.topKSort(...) method.

Your Sorter.topKSort(k, list) method is responsible for sorting and returning the top (largest) \(k\) elements from a list containing \(n\) comparable elements. You can find the Sorter class inside the misc package.

One naive way of implementing this algorithm would be to implement some sorting algorithm such as quicksort or merge sort, sort the list then return the last \(k\) elements.

However, this would take \(\mathcal{O}(n\log(n))\) time to run! We can do better: your implementation must run in \(\mathcal{O}(n\log(k))\) time.

Notes:

  1. There is an additional test that runs when you push to Gitlab, called testIsNLogK. This test counts operations in your code to make sure the number of comparisons is properly \(\mathcal{O}(n\log(k))\). Your code may also time out if it takes more than a second to run, to mirror the timeout during grading.

  2. There are pre- and post-conditions defined in the method header for topKSort. Per the definition of "precondition," you should assume the preconditions are true when you write your code, and your may do anything if the preconditions are not met. Additionally, be aware the postcondition defines an additional condition that must be true at the end of the method.

    • In general, postconditions are used to define any noteworthy side effects a method may have; in this case, we're using one to define a lack of side effect: topKSort should not mutate its input list. (We normally wouldn't document this, since it should be assumed by default that a method will not have major side effects such as mutating its inputs, but we're doing it here for emphasis.)
  3. Because you will be putting values from the IList parameter into your heap, yourtopKSort will not support sorting lists with any duplicate values (according to the values' equals method).

  4. Hint 1: your method should construct exactly two data structures: a single ArrayHeapPriorityQueue, and a single DoubleLinkedList to store the output to return.

  5. Hint 2: you should need to loop over your input list exactly once.

Section f: Complete group write-up

Task: Complete a write-up containing answers to the following questions.

You and your partner will work together on this write-up: your completed write-up MUST be in PDF format. You will submit it to Gradescope. Log into Gradescope with your @uw.edu email address. When you submit, mark which pages correspond to the questions we ask, and afterwards, the partner who uploaded the PDF must add the other partner as a group member on Gradescope. Do not have each member submit individually. A video showing how to do this can be found here. We may deduct points if you do not correctly submit as a group or do not properly assign your pages/problems.

For each of the experiments, answer the bolded questions (and questions in the orange boxes) in your write-up. Just like before, a plot will automatically be generated to display the results of the experiments; include PNGs of the plots inside your write-up PDF.

The post-experiment analysis portions will be graded based on the clarity of your explanations and whether they match up with your plot.

Experiments 1 and 2: fixing variables for topKSort

In this experiment, we're running topKSort and producing two plots for the runtime. Experiment1 has \(k\) as a fixed value and plots the runtime as \(n\) varies, and Experiment2 has \(n\) as the fixed value and plots the runtime as \(k\) varies.

Questions (after running the experiment)

You should also notice that in your plots, experiment 1 results in a linear shape, but experiment 2 results in a curve. Why are they different and where do those shapes come from? You might find it helpful to reference general knowledge of complexity classes in your response and the axes of the two graphs.

Experiment 3

In this experiment, we'll run our topKSort against other implementations that use insertion sort and merge sort to find the top \(k\) elements.

  1. Describe the order of the input list. You will have to look inside the makeDoubleLinkedList method, but the order is the same for all 3 sorting tests. Is it sorted? Reverse-sorted? Random? Some other pattern?
  2. For this input case ordering, predict the simplified \(\Theta\) running time of each of the tests:
    1. your topKSort
    2. insertion sort to find the top \(k\) elements
    3. merge sort to find the top \(k\) elements
    Hint: see lecture slides.

Questions (after running the experiment)

There's a version of quicksort inside SortingAlgorithms.java, but we didn't use it in this experiment. Give the simplified \(\Theta\) running time for using this implementation of quicksort on this specific input ordering to find the top \(k\) elements. Hint: Pay attention to the pivot selection strategy.

Experiment 4

In this experiment, we'll again run our topKSort against other sorting methods, but this time, we'll look at quick sort and merge sort, and with a different input ordering.

  1. Describe the order of the input list. You will have to look inside the makeRandomDoubleLinkedList method, but the order is the same for all 3 sorting tests. Is it sorted? Reverse-sorted? Random? Some other pattern?
  2. For this input case ordering, predict the simplified \(\Theta\) running time of each of the tests:
    1. your topKSort
    2. merge sort to find the top \(k\) elements
    3. quicksort to find the top \(k\) elements
    Hint: see lecture slides.

Questions (after running the experiment)

How do well do merge sort and quicksort perform relatively? Which is runs faster on this case? By how much: constant difference, constant factor difference, or differing in asymptotic behavior?

The insertion sort from the previous experiment was not run in this experiment. What case(s) of insertion sort might this test's input ordering represent? Pick from worst, best, average, and in-practice. Guess the \(\Theta\) running time of insertion sort on this case. (This question will be graded mostly on completion, although a full-scoring answer must still be consistent with the other cases of insertion sort presented in the lecture slides.)

Section g: Complete individual feedback survey

Task: Submit a response to the feedback survey.

After finishing the project, take a couple minutes to complete this individual feedback survey on Canvas. (Each partner needs to submit their own individual response.)

Deliverables

The following deliverables are due on Wednesday, August 7 at 11:59pm.

Submit by pushing your code to GitLab and submitting your writeup to Gradescope. If you intend to submit late, fill out this late submission form when you submit.

Before the deadline, be sure to double-check that...