Sets, Maps, and BSTs

Collections
Search structures
Binary search trees

A variable’s data type (or simply type) determines its possible values and operations. An abstract data type (ADT) is a data type that does not specify any one implementation. Data structures, such as resizable arrays or linked nodes, implement ADTs.

Collections

In Java, abstract data types and their data structure implementations are provided by the Java Collections framework, which includes familiar interfaces like List as well as classes like ArrayList and LinkedList. Two other commonly used interfaces are the Set and Map interfaces.

Sets java.util.Set<E> represent an unordered collection of unique elements.

add(E element): Add the given element to the set.
contains(Object o): Return true if o is in the set.
remove(Object o): Remove o from the set.

Maps java.util.Map<K, V> represent sets of key-value pairs. A key is a unique identifier used to access items in a map. Depending on the context, it can be used interchangeably with “item” or “element”. The values associated with keys in a map are not necessarily unique.

put(K key, V value): Associate the value with the given key.
containsKey(Object key): Return true if the key is in the map.
get(Object key): Return the value corresponding to the given key.
remove(Object key): Remove key from this map.

Search structures

Arrays enable fast access to items given a particular index. Linked nodes enable fast insertion or removal of contiguous items, assuming that we have a reference to the desired node. Both arrays and linked nodes are fundamentally linear data structures because they arrange their items in a sequence on a line.

We know that binary search is a logarithmic-time algorithm for finding an item in a sorted array. Unfortunately, arrays are fixed in size, making it difficult to add new items to a sorted array-based set.

Goal: Develop a data structure that balances log-time insertion with log-time search.

Why are linked nodes slow for implementing sets?

Given a particular item, we have no idea where it is in the linked nodes data structure. Furthermore, in order for binary search to work, we need to maintain the sorted order of items in the data structure. This slows down insertion to worst-case linear time.

One optimization is to change the entry point so that it’s more convenient for binary search. Instead of entering from the beginning of the linked nodes, put a reference to the middle node. In order to access items in the left half of the data structures, flip the direction of the edges.

However, binary search needs to access either the middle item in the upper half or lower half. We can modify the left and right edges to make longer hops directly to the middle item.

Binary search trees

Tree Terminology

class BST<Key> {
    Key value;
    BST left;
    BST right;
}

Binary tree: A tree where each node has 0, 1, or 2 children.
Binary search tree (BST): A binary tree with the added binary search invariant.

Ordering invariant

Binary search trees inherit the binary search invariant from ordered linked sets. In binary search, a decision is made at each node to go either left or right based on the current value. This only works because of the underlying order maintained in the structure of the tree. Formally, this ordering between items in a tree is a total order satisfying three mathematical properties.

Connex: also known as totality; for all v and w, (v ≤ w) or (w ≤ v) or both.
Antisymmetric: for all v and w, if (v < w) then (w > v); and if (v = w) then (w = v).
Transitive: for all v, w, and x, if (v ≤ w) and (w ≤ x), then v ≤ x.

In Java, data type implementers can implement the Comparable interface to define a default total ordering.

Since we will be implementing sets and maps, we simplify the total ordering rule to disallow duplicates.

Search

Searching for an item in a binary search tree involves making decision based on the binary search invariant.

static BST contains(BST T, Key sk) {
    if (T == null)
        return null;
    if (sk.equals(T.key))
        return T;
    else if (sk ≺ T.key)
        return contains(T.left, sk);
    else
        return contains(T.right, sk);
}

What is the runtime for search in a "bushy" BST with N nodes?

The analysis for search in a “bushy” BST is similar to binary search in a sorted array. In the best case, the search key happens to be the root item. This takes constant time. In the worst case, the search key is not in the BST. This depends on the height of the tree. If the height of a bushy BST is log N, then the worst-case runtime for search will also be log N.

Insert

Insertion in a binary search tree cooperates with the binary search invariant. To maintain uniqueness of items in the set, we first search for the key in the tree. If it already exists in the BST, then we don’t need to do anything. Otherwise, we add it as a leaf node in the place where we would expect to find it given the current arrangement of the tree.

static BST add(BST T, Key ik) {
    if (T == null)
        return new BST(ik);
    if (ik ≺ T.key)
        T.left = add(T.left, ik);
    else if (ik ≻ T.key)
        T.right = add(T.right, ik);
    return T;
}

Remove

Our goal is to preserve the binary search invariant while making as few changes to the overall structure as possible to keep the algorithm simple. There are three cases for removing a key from a tree. (Assume that the key exists in the tree.)

Key has no children. Remove the edge from its parent.
Key has one child. Reassign the edge from its parent to the removing node’s one child.
Key has two children. Use the Hibbard deletion procedure: replace the node with either the immediate predecessor or successor.