k-d Tree

Composing and testing a complex data structure from scratch, again.¹

Learning Goals

Compose algorithms via iterative improvement.
Algorithms can be tricky to implement. One way we can manage the complexity of an algorithm is to develop it in stages, starting from something simple that works and then progressing to more and more complex optimizations while ensuring the program works at every step along the way. This way, we can maintain correctness as we add complexity to our programs.
Encapsulate complex behavior beneath a simple interface.
The best abstractions hide complex behavior beneath a simple interface. Programs often gradually grow more and more complex, so a key challenge of implementing complex data structures is in managing this complexity while maintaining the illusion of a simple interface.

Overview
Getting the Assignment
Point
PointSet API
Submission

Overview

After spending a lecture on k-d trees, it is now time for you to implement your own! We’ll be using this later in the assignment for HuskyMaps, our map server. To get you started, we’ve provided a Point class with some helpful methods to calculate distances. Using this class, you’ll create two implementations of the PointSet nearest-neighbor-finding interface: a basic one using a naive linear search, and a faster-but-more-complex k-d tree implementation.

Getting the Assignment

Task: Pull the skeleton repository in IntelliJ to get the kdtree assignment.

To get the skeleton (starter) code for the homework, open your CSE 373 IntelliJ project. In the menu bar, tap on the VCS item, then hover over the Git dropdown, tap the Pull… menu item when it’s revealed, and pull the skeleton repository to get the assignment.

Before you begin the assignment, you might find the following visual resources helpful:

Insertion demo slides
Josh Hug’s nearest-finding demo video
- Slides
Josh Hug’s overview video
Nearest-finding pseudocode (slide 34)

If you end up incorporating any resources into your own code, be sure to cite them!

Unsure where to start?

Read over Point and begin implementing NaivePointSet per the spec.

Point

Background: Read through the point class you’ll be working with.

We provide a class, Point, that represents a location with an x and y coordinate. You may not modify this class. The key methods in this class are:

Signature	Description
`double x()`	Returns the x coordinate value.
`double y()`	Returns the y coordinate value.
`double distanceSquaredTo(Point p)`	Returns the Euclidean distance (L2 norm) squared between two points.
`double distanceSquaredTo(double px, double py)`	Returns the Euclidean distance (L2 norm) squared between two points.

Note: The distance methods return squared distances. Make sure you don’t try to compare those squared distances to un-squared distances!

PointSet API

Background: Read through the interface you’ll be implementing.

The PointSet<T extends Point> interface represents a set of such points. The generic type parameter T represents the exact type of the points in the set; extends Point indicates that the class only supports types that extend the functionality of Point. (The only class that matches this description in this assignment is Point itself, but this parameter will become useful later in the HuskyMaps assignment.)

Signature	Description
`default T nearest(double x, double y)`	Returns the point in this set nearest to (x, y).
`T nearest(Point target)`	Returns the point in this set nearest to (x, y).
`List<T> allPoints()`	Returns a list of all points in this set.

Remarks

The bulk of the functionality of PointSet lies in its nearest methods, which search through the list of Point objects stored to find the point closest to a given coordinate. In HuskyMaps, we’ll be using these to find the roadway that is nearest to the user’s mouse-click location.

Point p1 = new Point(1.1, 2.2); // Point with x = 1.1, y = 2.2
Point p2 = new Point(3.3, 4.4);
Point p3 = new Point(-2.9, 4.2);

PointSet<Point> nn = new NaivePointSet<>(List.of(p1, p2, p3));
double x = 3.0, y = 4.0;        // Mouse-click at (3, 4)
Point ret = nn.nearest(x, y);   // ret == p2
ret.x();                        // Evaluates to 3.3
ret.y();                        // Evaluates to 4.4

Since the interface includes a default definition of the first nearest method that uses the second, you’ll only need to implement the second version.
The allPoints method is primarily used by the grader to format test feedback. Other than that, it’s not very useful.

NaivePointSet

Task: Write a basic implementation of PointSet.

Before you create your faster KDTreePointSet class, first create a simple-but-slow linear-time implementation. The goal here is to provide an alternative, albeit slower, solution that you can use to easily verify results from your k-d tree’s nearest method.

NaivePointSet will only be tested for correctness, not efficiency.

For reference, our solution for this class is 32 lines total from top to bottom. You should not spend significantly more than half an hour to complete this part.

KDTreePointSet

Task: Write a k-d tree implementation of PointSet.

Now, implement the KDTreePointSet class. Your nearest methods should achieve $\mathcal{O}(\log N)$ runtime on real-world data (i.e., you don’t need to worry about the absolute worst-case runtime).

Your constructor may assume that its input has random ordering (i.e., is not sorted in any way). In particular, do not randomly shuffle the given points, since that makes it difficult to test your class reliably. Instead, we’ve included a static factory method createAfterShuffling to do the shuffling before calling your constructor—you can call this to defend against sorted data, while still maintaining testability by writing tests that call the constructor directly.

While k-d trees can theoretically work for any number of dimensions, your implementation will only work for the 2-dimensional case where our points have only x and y coordinates.

About a quarter of the points for this homework will be allocated towards evaluating the correctness of KDTreePointSet. The remainder (minus the points for NearestPointSet) will be allocated towards evaluating the efficiency of KDTreePointSet, but these tests will only provide credit if the implementation is fully correct.

Tips

For implementing nearest, we recommend that you write a simple version and verify that it works before adding in the various optimizations from the pseudocode. For example, you might start by writing a nearest method that recursively traverses the entire tree in $\Omega(N)$ time. Then, you might add code so that it always visits the closer side before checking the further side (changing traversal order without improving runtime). Then, after verifying that this works, you could finally implement the optimization of only traversing the farther side if necessary, bringing the runtime down to $\mathcal{O}(\log N)$ . (As a reminder, feel free to consult online resources for more thorough explanations about this pruning optimization.) By doing things this way, we can test and iterate on our program one idea one at a time instead of expecting everything to work immediately.
Commit frequently. Don’t be afraid to throw away your solution and start from a previous commit (or from scratch) if you feel that debugging is not getting anywhere. Restarting gives us a fresh perspective, but we get to keep all of our experiences with developing the program that we picked along the way so that our next try will be more robust. Our solution to the nearest method is about 30 lines of code, though it’s possible to efficiently solve the problem with fewer.

Testing

Recommended: Write your own tests. They’ll probably need to be randomized.

We only provide a few tests for this homework, and most of the scoring done by the autograder will be based on large randomized tests that don’t provide meaningful feedback.

Creating a test suite that spans all possible k-d tree edge cases is significantly more difficult than previous assignments due to the complexity of the problem we are solving—there’s a whole extra dimension to deal with. Since it’s tedious (or even infeasible) to enumerate all edge cases, we instead turn to randomized testing, a lazier style of testing than unit testing, and we recommend that you do the same if you decide to write additional tests.

To verify the correctness of our k-d tree implementation, we can generate a large number of random doubles to construct random points for our point sets, as well as target points to query nearest. Then, we can simply compare the results of KDTreePointSet.nearest to the results to NaivePointSet.nearest (which is simpler and thus easier to unit test normally).

Of course, this method of testing is not useful for debugging: if you discover any bugs this way, you’ll likely have a difficult time reproducing them in a smaller unit tests since your test case will be both large and randomly changing on every execution. To address the second issue, you should seed your random number generator to produce fixed test data. We suggest you use the Java Random class to generate random coordinate values:

int seed = 42; // Or any number you like
Random random = new Random(seed);
double x = random.nextDouble(); // returns 0.727564...
double y = random.nextDouble(); // returns 0.683223...

However, this won’t change the fact that your test case may be too large to effectively use the debugger. Often, it may just be easiest to take a break and come back later to re-read your code with a fresh mindset.

Submission

Task: Commit and push your changes to GitLab before submitting your homework to Gradescope.

Josh Hug. 2019. Project 2AB: Extrinsic PQ and KDTree. In CS 61B: Data Structures, Spring 2019. https://sp19.datastructur.es/materials/proj/proj2ab/proj2ab ↩