Link

k-d Tree

Composing and testing a complex data structure from scratch, again.1

Learning Goals

  • Compose algorithms via iterative improvement.

    Algorithms can be tricky to implement. One way we can manage the complexity of an algorithm is to develop it in stages, starting from something simple that works and then progressing to more and more complex optimizations while ensuring the program works at every step along the way. This way, we can maintain correctness as we add complexity to our programs.

  • Encapsulate complex behavior beneath a simple interface.

    The best abstractions hide complex behavior beneath a simple interface. Programs often gradually grow more and more complex, so a key challenge of implementing complex data structures is in managing this complexity while maintaining the illusion of a simple interface.

  • Write randomized tests for your programs.

    While testing our programs with handwritten tests can be helpful for debugging, they don’t necessarily cover all of the complexities of real-world datasets. To verify the correctness of our data structures and algorithms, we need to develop more robust testing and validation methods. One way to do this is to write randomized tests to check that our program behaves well under more realistic, large-scale workloads.

Table of contents

  1. Getting Started
  2. Point
  3. PointSet API
  4. NaivePointSet
  5. KDTreePointSet
  6. Testing
  7. Submission

Getting Started

To get the skeleton (starter) code for the homework, open your CSE 373 IntelliJ project. In the menu bar, tap on the VCS item, then hover over the Git dropdown, tap the Pull… menu item when it’s revealed, and pull the skeleton repository to get the kdtree assignment.

Before you begin the assignment, you might find the following visual resources helpful:

If you end up incorporating any resources into your own code, be sure to cite them!

Point

We provide a class, Point, that represents a location with an x and y coordinate. You may not modify this class. The key methods in this class are:

  • public double x(): Returns the x coordinate value.
  • public double y(): Returns the y coordinate value.
  • public double distanceSquaredTo(Point p): Returns the squared Euclidean distance between two points.
  • public double distanceSquaredTo(double px, double py): Returns the squared Euclidean distance between two points.
Note
The distance methods returns squared distances, not the actual distance as shown in lecture.

PointSet API

We also provide an interface called PointSet that represents a set of such points. It has only one method.

  • public Point nearest(double x, double y): Returns the point in the set nearest to the given (x, y) coordinate.

The PointSet will be able to search through the list of Point objects stored, finding the point closest to a given coordinate. In HuskyMaps, we’ll be using this to find the roadway that is nearest to the user’s mouse-click location.

Point p1 = new Point(1.1, 2.2); // Point with x = 1.1, y = 2.2
Point p2 = new Point(3.3, 4.4);
Point p3 = new Point(-2.9, 4.2);

PointSet nn = new NaivePointSet(List.of(p1, p2, p3));
int x = 3.0, y = 4.0;           // Mouse-click at (3, 4)
Point ret = nn.nearest(x, y);   // ret == p2
ret.getX();                     // Evaluates to 3.3
ret.getY();                     // Evaluates to 4.4

NaivePointSet

Before you create your complex-but-fast KDTreePointSet class, you will first create a simple-but-slow linear-time implementation of the PointSet interface. The goal of creating this class is to provide an alternative, albeit slower, solution that you can use to easily verify if the result of your k-d tree’s nearest is correct.

Note that your NaivePointSet class should be immutable, meaning that it should be impossible to add or or remove points from it—even if the user modifies the constructor’s input list of points after after instantiating the PointSet. (Since the Point class is immutable, this is as simple as making a defensive, shallow copy of the list.)

NaivePointSet will only be tested for correctness, not efficiency.

For reference, our solution for this class is 32 lines total from top to bottom. You should not spend significantly more than half an hour to complete this part.

KDTreePointSet

Now, write the KDTreePointSet class with the same constructor and method signature as NaivePointSet. Your k-d tree should guarantee O(log N) access time on real-world data, i.e., you don’t need to worry about the absolute worst-possible case runtime.

As with NaivePointSet, your KDTreePointSet class should be immutable. You may assume that the input to the constructor has random ordering (i.e., is not sorted in any way). While k-d trees can theoretically work for any number of dimensions, your implementation will only work for the 2-dimensional case when our points have only x and y coordinates.

About a quarter of the points for this homework will be allocated towards evaluating the correctness of KDTreePointSet. The remainder (minus the points for NearestPointSet) will be allocated towards evaluating the efficiency of KDTreePointSet, but these tests will only provide credit if the implementation is fully correct.

Tips

For implementing nearest, we recommend that you write a simple version and verify that it works before adding in the various optimizations from the pseudocode. For example, you might start by writing a nearest method that recursively traverses the entire tree in Ω(N) time. Then, you might add code so that it always visits the closer side before checking the further side (changing traversal order without improving runtime). Then, after verifying that this works, you could finally implement the optimization of only traversing the farther side if necessary, bringing the runtime down to O(log N). (As a reminder, feel free to consult online resources for more thorough explanations about this pruning optimization.) By doing things this way, we can test and iterate on our program one idea one at a time instead of expecting everything to work immediately.

Commit frequently. Don’t be afraid to throw away your solution and start from a previous commit (or from scratch) if you feel that debugging is not getting anywhere. Restarting gives us a fresh perspective, but we get to keep all of our experiences with developing the program that we picked along the way so that our next try will be more robust. Our solution to the nearest method is about 30 lines of code, though it’s possible to efficiently solve the problem with less.

Testing

We will not provide any skeleton tests nor autograder messages (beyond a basic sanity check) for this homework. You will be responsible for writing your own tests and ensuring the correctness of your code. While you need to submit your tests to the autograder, they won’t be graded.

There are a number of different ways that you can construct tests for this part of the project, but we will go over our recommended approach: randomized testing.

Creating test cases that span all of the different KDTreePointSet edge cases will be significantly more difficult than with ArrayMinHeapPQ due to the complexity of the problem we are solving. To avoid thinking about all possible strange edge cases, we can turn towards techniques other than creating specific, deterministic tests to cover all of the possible errors.

Our suggestion is to use randomized testing which will allow you to test your code on a large sample of points. This should encompass most, if not all, of the edge cases you might run into. With randomized tests, you can generate a large number of random doubles to construct random points to add to your tree, as well as points to query nearest. To verify the correctness of the results, compare the results of KDTreePointSet.nearest to the results to NaivePointSet.nearest. If our implementation can pass a very large number of randomly-generated scenarios, we can have some confidence in the correctness of our data structure and its algorithms.

An issue is that randomized tests are not deterministic. This mean each time we run the tests, different points will be generated. This makes debugging nearly impossible as we won’t have any ability to rerun and closely analyze failed cases. However, randomness in computers is almost never truly random. Instead, computer scientists developed pseudorandom number generators (PRNGs) which are algorithms that generate random-looking sequences of numbers. Since PRNGs are algorithms, we can make them follow the same steps each time by seeding them with the same number. We suggest you use the Java Random class to generate random coordinate values.

int seed = 42; // Or any number you like...
Random random = new Random(seed);
double randomDouble = random.nextDouble();

Submission

Commit and push your changes to GitLab before submitting your homework to Gradescope.

  1. Josh Hug. 2019. Project 2AB: Extrinsic PQ and KDTree. In CS 61B: Data Structures, Spring 2019. https://sp19.datastructur.es/materials/proj/proj2ab/proj2ab