Link

Multi-Dimensional Data

Table of contents

  1. 2-d range search
  2. K-d trees
  3. Nearest neighbor search

In the Autocomplete homework, we implemented the range search operation on strings to find all strings that started with a given prefix. Autocomplete is an example of a 1-dimensional range search because each string can be ordered according to lexicographic (dictionary) order and stored in a sorted array.

What is the runtime for 1-d range search as a function of N (the number of keys) and M (the number of matching keys)?

O(M + log N). We need to run two binary searches to get the first index and the last index, each taking O(log N) time. Collecting the M matching keys can be done in linear time by copying all of the items from the first index to the last index in the sorted array.

What is the runtime for inserting a new key into the sorted array?

O(N). Even if we binary search to the correct insertion point in O(log N) time, we still need to shift all of the following items over by one index to make space for the new key.

While sorted arrays are fast for range search queries, they are slow at inserting new keys. One way we can improve insertion runtime is by switching to search trees. 1-d range search on a binary search tree is similar to the sorted array algorithm but combines searching for keys with collecting keys.

  1. Recursively find all keys in left subtree, if any could fall in the range.
  2. Check if the key in the current node matches. If so, add it to the result.
  3. Recursively find all keys in right subtree, if any could fall in the range.

If we use a left-leaning red-black tree, this algorithm runs in O(M + log N) time. However, the real improvement comes from improved insertion time which is now just O(log N).

However, the key to this algorithm is the recursive decisionmaking done at every node. Earlier, we learned that the binary search invariant introduces a lower and upper bound to the possible values at any given node. This invariant is exactly what enables us to implement 1-d range search.

However, it turns out that 2-d data like locations on Earth cannot be represented using this lower and upper bounding approach. The University of Washington, for example, is located at (47.655 degrees north, 122.308 degrees west). A 2-d range search query could ask for all the points in a rectangular region surrounding the university: one mile north, east, south, and west of the quad.

Why is this problem hard to solve using the data structures we’ve already learned? Consider how we might use the same binary search tree approach to collect all of the data points less than x = -1.5 from the following dataset.

Storing items in a binary search tree according to either the x-coordinate or the y-coordinate optimizes 2-d range search runtime when the query orientation matches the tree representation. However, we lose the runtime benefit when the tree is not in the same orientation as the range search query.

The fundamental problem with 2-d data is in the recursive decisionmaking process. At each node of a binary search tree, we decide to partition the space of all data points left-right, pruning a fraction of the points each time. Translating this partitioning strategy into 2-d space requires making 2 decisions: not only left-right but also up-down.

1-d vs. 2-d Partitioning

K-d trees

Several data structures have been invented to solve the 2-d range search problem, including range trees, quadtrees, and k-d trees. In this class, we’ll focus on the 2-d special case of k-d trees (k=2, or 2-d trees). 2-d trees solve this problem by using a binary search tree data structure but cycling through the recursive decisions at each level: even-depth nodes partition left-right while odd-depth nodes partition up-down.

1-d data vs. 2-d tree

In the following diagram, the root (A, depth 0) partitions the entire space left-right. The left child of the root (E, depth 1) partitions the subspace to the left of the root up-down. The right child of the root (B, depth 1) partitions the right side up-down. All of the remaining, grayed-out items above the node labeled B are therefore also to the right of the root.

Simple 2-d tree

While k-d trees can be used to answer k-d range search queries, we’ll focus primarily on using k-d trees to solve a different problem known as nearest neighbor search: the problem of finding the nearest data point to an arbitrary search location not necessarily in the dataset. In HuskyMaps, the user can get navigation directions between any two arbitrary locations on the map, even starting or ending at places that aren’t roads. Somehow, our app is able to find the nearest roadway to start and end the navigation directions.

Why can't we use the set or map abstract data types to solve nearest neighbor search?

The search location might not be in the set or map at all. If we want to find the nearest data point, we’ll need to iterate over all N items and keep track of the nearest seen so far. In the case that the item is not in the set, then we need to return the “nearest” data point for some definition of “nearest”.

Note
For simplicity, we’ll always assume 2-d Euclidean geometry. In other words, all points are on a perfectly flat x-y plane so that the distance between two points is given by the Euclidean (straight-line) distance. The data point with the smallest Euclidean distance to the search location is the nearest neighbor.

We can solve the nearest neighbor search problem using k-d trees by keeping track of the nearest neighbor as we explore the tree.

def nearest(Node n, Point goal, Node best):
    if n is null:
        return best
    if n.distance(goal) < best.distance(goal):
        best = n
    best = nearest(n.leftChild, goal, best)
    best = nearest(n.rightChild, goal, best)
    return best
What is the runtime for the above algorithm in terms of N, the number of data points?

Theta(N) since the two recursive calls explore the entire k-d tree.

We can optimize nearest neighbor search in most real-world scenarios by exploring the “good side” (closer to the goal) before the “bad side”. Instead of always exploring both the left and right children, we can selectively explore the “bad side” only when it could have a better nearest neighbor.

def nearest(Node n, Point goal, Node best):
    if n is null:
        return best
    if n.distance(goal) < best.distance(goal):
        best = n
    if goal < n:
        goodSide = n.leftChild
        badSide = n.rightChild
    else:
        goodSide = n.rightChild
        badSide = n.leftChild
    best = nearest(goodSide, goal, best)
    if badSide could have a better nearest neighbor:
        best = nearest(badSide, goal, best)
    return best
What is the worst-case runtime for the optimized algorithm in terms of N, the number of data points?

Theta(N) since we could still need to consider all of the points in the k-d tree.