Disjoint Sets Reading

The disjoint sets abstract data type supports two key operations: union and find. When we discussed Kruskal’s algorithm, we discovered that we needed an ADT to representing components and the connections between them. The runtime for this ADT will determine the runtime for Kruskal’s algorithm.

public interface DisjointSets {
  /** Connects two items P and Q. */
  void union(char p, char q);

  /** Returns a unique id representing P's component. Can be used to
      determine if two items P and Q are in the same component.*/
  int find(char p);
}

In the example above, each character in the disjoint set can be seen as a vertex in the graph. In Kruskal’s algorithm, we consider each edge in order of increasing edge weight. Before we can add an edge (p, q) to the MST, we need to first check that the two vertices (p and q) are not already connected; if they aren’t, we need to connect the two vertices. As a result, both union and find are important in our runtime analysis, so we need to make sure they’re fast. Let’s set our goal to be a logarithmic-time algorithm.

List of Sets

One slow but correct implementation is to use a List<Set<Character>>. This is the most literal implementation of disjoint sets: each set in the list directly and literally represents the set in the ADT.

For instance, if we have N = 6 elements and nothing has been connected yet, our list of sets looks like {A}, {B}, {C}, {D}, {E}, {F}, {G}. After a few union operations, it might look like {A, B, C, E}, {D, F}, {G}.

What is the runtime for find in terms of N, the total number of elements?

O(N). We don’t necessarily know which set p is in, so we may need to check all of the sets. If only a few sets have been connected and p is in a set at the end of the list, then it’ll take order of N time to iterate overall the ~N sets.

What is the runtime for union in terms of N, the total number of elements?

O(N), for the same reason as find. In order to connect the items in two sets, we first need to know what’s in each set. But the only way to know what’s in each set is to find the set containing p and the set containing q in the list.

Data structures are only fast when we expect to know exactly or approximately where an item will be in the data structure. The List<Set<Character>> representation is slow because we don’t know where each p and q will be in the data structure.

QuickFind

QuickFind is the first of several faster implementations for Disjoint Sets. Suppose we know our number of items N up front. Create an int[N] where the indices of the array represent the indices of the elements of our set, and the value at an index is the id of the set it belongs to. The following image represents the disjoint sets {A, B, C, E}, {D, F}, {G} (we assume that the first set has id 123, the second 456, and the third 789).

Disjoint Sets QuickFind Before Union

If we are able to map elements to their array indices in constant time (for example, you could imagine a hash table of elements to array indices), then we can implement find() in constant time as well, since the lookup into the ids array is also constant-time.

Turning our attention to union(C, D), we can find(C) and find(D) in constant time. After the connect operation, let’s arbitrarily choose find(D) == 456 as the id of the new set. To implement this, we only need to iterate through the ids array setting all instances of 123 to 456. This last iteration step is an order of N runtime operation

Disjoint Sets QuickFind After Union

We call this data structure QuickFind because its slow union operation enables a faster find operation. In lecture, we’ll see a family of other data structures based on the ideas introduced by QuickFind.