Autocomplete

Designing and analyzing autocomplete.

  1. Autocomplete interface
    1. Reference implementation
  2. Design and implement
    1. SequentialSearchAutocomplete
    2. BinarySearchAutocomplete
    3. TernarySearchTreeAutocomplete
  3. Analyze and compare
    1. Asymptotic analysis
    2. Experimental analysis
  4. Above and beyond
    1. Wordscapes
    2. Count of range sum
    3. Trie implementation
    4. Algorithmic fairness

Autocomplete is a feature that helps a user select valid search results by showing possible inputs as they type. For example, in a map app, autocomplete might allow the user to enter a prefix such as Sea and automatically suggest the city, Seattle.

In addition to autocompleting names, places, or things, autocomplete can also be a useful abstraction for implementing DNA subsequence search. Instead of indexing a list of all the city names or places, a DNA data structure can index all the suffixes of a very long DNA sequence. Autocompleting the DNA suffixes enables efficient search across all the DNA substrings for medical applications, genomics, and forensics.

In this project, we will compare 4 implementations (described later) and 2 applications (city search and DNA search) of autocomplete. By the end of this project, students will be able to:

  • Design and implement tree-based and array-based search data structures.
  • Analyze and compare runtimes using asymptotic and experimental analysis.
Can I work with someone else on this project?

Although this project requires an individual submission, you may find it useful to discuss overarching ideas with your classmates. Our primary rule is that we ask that you do not claim to be responsible for work that is not yours. If you get some help from someone else or from an online resource, cite it. I believe that there is a lot of value in learning from others so long as you do not deprive yourself (or others) of the opportunity to learn.

We are comfortable doing this because each submission in this class comes in the form of a video that you record. Your video is a demonstration of everything that you learned throughout the process of working on an assignment. Our goal is for students to support each other and find community through this course. The real advantage of taking a course on-campus at a university is to be able to connect with others who share common interests in learning.

What am I submitting at the end of this project?

Satisfactory completion of the project requires one Canvas submission including: a video-recorded individual presentation, a link to your filled-out slide template for Autocomplete and a link to your src/main/java/autocomplete folder on GitLab from your projects repo that addresses all deliverables (in green).

The project instructions contain a lot of details to provide context, clarify common confusions, and help students get started. Your Canvas submissions only need to address tasks that are described in green callouts like this one.

Your video presentation should meet the following requirements:

  • Your presentation should not be much longer than 8 minutes (9 minutes maximum) and should include your voiceover. (Your video is appreciated but not necessary.)
  • Your video presentation must include or use our provided slide template for Autocomplete. This is done intentionally to help guide you toward the correct solution.
  • After submitting to Canvas, add a submission comment linking to your slides or document and a link to your src/main/java/autocomplete folder on GitLab. Remember to commit and push to Gitlab and check the associated status for your latest Pipeline is a green checkmark! Please be sure to check access permissions on your slides, so any TA can view your work.

Students are expected to follow Washington state law on the Student Conduct Code for the University of Washington. In this course, students must:

  • Indicate on assignment submissions any assistance received, including materials distributed in this course.
  • Not receive, generate, or otherwise acquire any substantial portion or walkthrough for an assignment.
  • Not aid, assist, attempt, or tolerate prohibited academic conduct in others.

Within the provided slide template, you must cite any external sources you used, either through collaboration or online resources. If you consulted any Generative AI on this project, you must include all prompts and outputs. Online sources should be cited with a simple link. Collaboration with peers must include names and the extent of that collaboration. The more detail, the better!

Submitted work that is not consistent with sources will require makeups in Office Hours.

Autocomplete interface

Implementations of Autocomplete must provide the following methods:

void addAll(Collection<? extends CharSequence> terms)
Adds all the terms to the autocompletion dataset. Each term represents a possible autocompletion search result. Behavior is not defined if duplicate terms are added to the dataset.
Collection
The parent interface to lists and sets in Java. Using Collection rather than List lets clients use any list or set or other collection that they’ve already created in their program.
CharSequence
An interface that generalizes the concept of a String of characters. Using CharSequence rather than String lets clients define specialized implementations for long strings like DNA.
Collection<? extends CharSequence>
The type of the parameter, read: a Collection of any type of elements that extend CharSequence. The ? extends lets clients call the method with a Collection<String> or a Collection<SuffixSequence> instead of having to strictly use a Collection<CharSequence>.
List<CharSequence> allMatches(CharSequence prefix)
Returns a new list of all terms that begin with the same characters as the given prefix. Note that this should not be a view of the list.

In Java, a view is a clever way of working with a part of a data structure without making a copy of it. For example, the ArrayList class has a subList method with the following method signature.

public List<E> subList(int fromIndex, int toIndex)

subList returns another List. But instead of constructing a new ArrayList and copying over all the elements from the fromIndex to the toIndex, the Java developers defined a SubList class that provides a view of the data structure using the following fields (some details omitted).

private static class SubList<E> implements List<E> {
    private final ArrayList<E> root;
    private final int offset;
    private int size;
}

The SubList class keeps track of its ArrayList root (which contains the actual data array), an int offset representing the start of the sublist, and the int size of the sublist. The sublist serves as an intermediary that implements get(index) by checking that the index is in the sublist before returning the offset index.

public E get(int index) {
    if (index < 0 || index >= size) {
        throw new IndexOutOfBoundsException();
    }
    return root.elementData.get(offset + index);
}

Given the terms [alpha, delta, do, cats, dodgy, pilot, dog], allMatches("do") should return [do, dodgy, dog] in any order. Try this example yourself by writing a new test case in the AutocompleteTests class. You can write additional test cases like this to assist in debugging.

@Test
void compareSimple() {
    List<CharSequence> terms = List.of(
        "alpha", "delta", "do", "cats", "dodgy", "pilot", "dog"
    );
    CharSequence prefix = "do";
    List<CharSequence> expected = List.of("do", "dodgy", "dog");

    Autocomplete testing = createAutocomplete();
    testing.addAll(terms);
    List<CharSequence> actual = testing.allMatches(prefix);
    assertEquals(expected.size(), actual.size());
    assertTrue(expected.containsAll(actual));
    assertTrue(actual.containsAll(expected));
}

The compareSimple() test will be used in your code tracing deliverables in this specification. We recommend that you add this to AutocompleteTests.

Reference implementation

The project code includes a fully functional TreeSetAutocomplete implementation that stores all the terms in a TreeSet. The class contains a single field for storing the underlying TreeSet of terms. Rather than declare the field as a Set, we’ve chosen to use the more specialized subtype NavigableSet because it includes helpful methods that can be used to find the first term that matches the prefix.

private final NavigableSet<CharSequence> elements;

The constructor assigns a new TreeSet collection to this field. In Java, TreeSet is implemented using a red-black tree, a type of balanced search tree where access to individual elements are worst-case logarithmic time with respect to the size of the set. CharSequence::compare tells the TreeSet to use the natural dictionary order when comparing any two elements.

public TreeSetAutocomplete() {
    elements = new TreeSet<>(CharSequence::compare);
}

If you’ve ever used a TreeSet<String>, you might be surprised to see the argument CharSequence::compare. This is not necessary for TreeSet<String>, but it is necessary for TreeSet<CharSequence> because CharSequence does not implement Comparable<CharSequence>. You can read more in the Java developers mailing list.

The addAll method calls TreeSet.addAll to add all the terms to the underlying TreeSet.

@Override
public void addAll(Collection<? extends CharSequence> terms) {
    elements.addAll(terms);
}

The allMatches method:

  1. Ensures the prefix is valid. If the prefix is null or empty, returns an empty list.
  2. Finds the first matching term by calling TreeSet.ceiling, which returns “the least element in this set greater than or equal to the given element, or null if there is no such element.”
  3. Collects the remaining matching terms by iterating over the TreeSet.tailSet, which is “a view of the portion of this set whose elements are greater than or equal to fromElement.”
  4. If we reach a term that no longer matches the prefix, returns the list of results.
@Override
public List<CharSequence> allMatches(CharSequence prefix) {
    List<CharSequence> result = new ArrayList<>();
    if (prefix == null || prefix.length() == 0) {
        return result;
    }
    CharSequence start = elements.ceiling(prefix);
    if (start == null) {
        return result;
    }
    for (CharSequence term : elements.tailSet(start)) {
        if (Autocomplete.isPrefixOf(prefix, term)) {
            result.add(term);
        } else {
            return result;
        }
    }
    return result;
}

Design and implement

Design and implement 3 implementations of the Autocomplete interface.

We highly recommend you study Autocomplete.java to identify a method that may be useful for checking for prefix matches in allMatches. This is also used in the reference implementation.

SequentialSearchAutocomplete

Terms are added to an ArrayList in any order.

Because the elements are not stored in any sorted order, the allMatches method must scan across the entire list and check every term to see if it matches the prefix.

Recall what the idea of sequential search is from the Merge Sort lesson.

BinarySearchAutocomplete

Terms are added to a sorted ArrayList. When additional terms are added, the entire list is re-sorted using Collections.sort.

Since the terms are in a list sorted according to natural dictionary order, all matches must be located in a contiguous sublist. Collections.binarySearch can find the start index for the first match. See further details below on how to determine the start index when it is negative.

After the first match is found, we can collect all remaining matching terms by iterating to the right until it no longer matches the prefix. Consider how this impacts your code.

List<CharSequence> elements = new ArrayList<>();
elements.add("alpha");
elements.add("delta");
elements.add("do");
elements.add("cats");

System.out.println("before: " + elements);
Collections.sort(elements, CharSequence::compare);
System.out.println(" after: " + elements);

CharSequence prefix = "bay";
System.out.println("prefix: " + prefix);
int i = Collections.binarySearch(elements, prefix, CharSequence::compare);
System.out.println("     i: " + i);

This program produces the following output.

before: [alpha, delta, do, cats]
 after: [alpha, cats, delta, do]
prefix: bay
     i: -2

The index i is negative because Collections.binarySearch returns a negative value to report that an exact match for the prefix was not found in the sorted list.

Returns
the index of the search key, if it is contained in the list; otherwise, (-(insertion point) - 1). The insertion point is defined as the point at which the key would be inserted into the list: the index of the first element greater than the key, or list.size() if all elements in the list are less than the specified key. Note that this guarantees that the return value will be >= 0 if and only if the key is found.

Since the prefix often will not exactly match an element in the list, we can use algebra to recover the insertion point. The start value represents the index of the first term that could match the prefix.

int start = i;
if (i < 0) {
    start = -(start + 1);
}

Trace through the compareSimple test for the BinarySearchAutocomplete implementation by adding it to AutocompleteTests. First, without walking through the code, create a diagram of your ArrayList that showcases its state after calling testing.addAll(terms). Then, explain how your allMatches method works to collect terms into the actual list using the prefix “do”, connecting each operation to the input and output of the test. Lastly, explain how the actual and expected lists compare for each of the 3 assertion checks.

Once you explain a method one time, you do not need to repeat an explanation of its operations in your code trace.

We recommend that you utilize your visual diagram when explaining allMatches to ensure continuous explanations in connection to the compareSimple test.

TernarySearchTreeAutocomplete

Terms are added to a ternary search tree using the TST.java class as a reference.

  1. Skim TST.java. What do you notice will work for Autocomplete? What needs to change?
    • Begin by understanding the fields and constructor for the TST. How does this align with your conceptual understanding of the structure? How is data stored?
  2. Identify methods in the TST class that are most similar to Autocomplete.
    • For addAll, see if you can identify a method to insert one term into a TST and understand its recursive pattern. Then, determine how you can repeat this for several terms.
    • For allMatches, see if you can identify a method to find one prefix match from the TST, and understand its recursive pattern. Then, determine how you can repeat this to collect all the remaining prefix matches from the TST.
    • Recall the importance of base cases and edge cases in your code as well.
  3. Adapt the code to implement the Autocomplete interface.
    • Descriptive variable names and helper methods are crucial! The reference class can be difficult to parse from its shorthand variable-naming and we discourage this.

Don’t copy and paste code! Most of the time, we will need to make many changes, and we might introduce subtle bugs when we copy code that we don’t fully understand. Instead, rewrite the code in your own words after making sense of the purpose of each line. We often don’t need all the lines of code, and the code can be rewritten in ways that are more suitable for the problem at hand.

It’s okay if your TernarySearchTreeAutocomplete throws a StackOverflowError when running the DNASearch class. This is caused by Java’s built-in limit on recursive depth. There are different ways to work around this limit, but it’s not relevant to this project.

Trace through the compareSimple test for the TernarySearchTreeAutocomplete implementation by adding it to AutocompleteTests. First, without walking through the code, create a diagram of your TST that showcases its state after calling testing.addAll(terms). Then, explain how your allMatches method works to collect terms into the actual list using the prefix “do”, connecting each operation (across all helper methods) to the input and output of the test. Lastly, explain how the actual and expected lists compare for each of the 3 assertion checks.

Once you explain a method one time, you do not need to repeat an explanation of its operations in your code trace.

We recommend that you utilize your visual diagram when explaining allMatches to ensure continuous explanations in connection to the compareSimple test.

Make sure to pass all the tests for SequentialSearchAutocomplete, BinarySearchAutocomplete, and TernarySearchTreeAutocomplete, push your code to Gitlab, and check the associated status for your latest Pipeline. Include a link to the passing pipeline of your projects repository in the provided slide template.

Analyze and compare

Asymptotic analysis

Give a big-theta bound for the worst-case runtime of the addAll and allMatches methods for each of the 4 implementations, including TreeSetAutocomplete, with respect to N, the total number of terms already stored in the data structure. Explain the runtime of each implementation in a couple sentences while referencing the code. You must use the assumptions we provide below for the runtime analysis.

As you perform your asymptotic analysis, make sure to carefully read through and keep in mind the assumptions and hints given below.

What does the underlying data structure look like in the worst case? How are terms organized? Based on that worst case, analyze the runtime of operations performed on that data structure.

addAll Assumptions and Resources
Assume a constant number of terms are added to a dataset that already contains N terms.
Assume that arrays can accommodate all the new terms without resizing.
Collections.sort uses Timsort, an optimized version of merge sort with runtime in O(N log N) where N is the size or length of the collection or array.
allMatches Assumptions and Resources
Consider the relationship between the added terms and the prefix. How many matches will we have if all the terms in the dataset begin with A and the prefix is A? How many matches will we have if all the terms in the data set begin with B and the prefix is A?
Assume all strings have a constant length.
TreeSet is implemented using a red-black tree, which has the same asymptotic runtime as a left-leaning red-black tree or a 2-3 tree.

Your final runtimes should only include variations of N and 1 as variables. Use of K, L, M, or other asymptotic variables is incorrect, and often arises from outside sources that are inaccurate.

Experimental analysis

Now that you’ve predicted runtimes across the 4 implementations, let’s compare results for the cities.tsv dataset.

RuntimeExperiments are disabled by default. Comment-out the @Disabled tag above the RuntimeExperiments class header to enable it.

Run the provided RuntimeExperiments in AutocompleteTests to compare the real-world runtime of each implementation. For each implementation, RuntimeExperiments constructs an empty instance and records the number of seconds to add N terms to the dataset and then compute all matches for the prefix (such as Sea).

  • The first column denotes N, the total number of terms.
  • The second column denotes the average runtime for addAll in seconds.
  • The third column denotes the average runtime for allMatches in seconds.

Copy-paste the text into plotting software such as Desmos. Plot the runtimes of all 4 implementations on addAll and allMatches. To ensure that your plots are legible, please include labels to match with the colors for each each of points. We also recommend creating only one plot for addAll, and one plot for allMatches.

Once you are done running RuntimeExperiments, please uncomment the @Disabled line to ensure that our pipeline to check your tests will work!

Display both legible and labelled plots for addAll and allMatches to compare runtimes across all 4 implementations. Which implementation is the fastest for addAll, and why? Which implementation is the slowest for allMatches, and why? What benefits and drawbacks arise for implementations based on tree-like structures?

There are apparent disagreements between the runtimes you hypothesized in asymptotic analysis and the runtimes you observed in your experimental graphs. Explain why based on worst case vs. average case differences, the settings used in RuntimeExperiments for the cities dataset, and the assumptions you referenced in your asymptotic analysis.

If you are unsure of how to approach the above deliverable, you may structure your response according to these questions:

  1. In the RuntimeExperiments class, the prefix Sea is used against a cities dataset, and our instances of Autocomplete begin empty. How does this compare to the asymptotic analysis setup?
  2. During the asymptotic analysis portion, you held certain variables constant. How does this compare to the experimental analysis setup?
  3. During the asymptotic analysis portion, you assumed worst-case orientations for structures and their search times. How does this compare to the experimental analysis setup?

Above and beyond

Optionally, apply what you’ve learned by working on these project ideas.

Wordscapes

Wordscapes is an app that has multiple levels where you try to fill out a crossword puzzle using only the letters provided. Although a set of words can create several valid words, only a select few are considered to fill out the puzzle (perfect place to use isTerm). You can choose an implementation that you feel is most appropriate for the game and also create your own list of words to use for the crosswords you create.

Count of range sum

LeetCode 327. Count of Range Sum

Given an integer array nums and two integers lower and upper, return the number of range sums that lie in [lower, upper] inclusive. Range sum S(i, j) is defined as the sum of the elements in nums between indices i and j inclusive, where i <= j.

In order to build a solution for this LeetCode problem, it is essential to understand a data structure called a Segment Tree: a tree data structure for efficiently querying of values within intervals aka segments. Watch the video below in order to get a comprehensive understanding and overview of the segment tree data structure. The video covers everything you will need in order to come up with a Segment Tree approach solution for the LeetCode problem discussed above.

Trie implementation

During class, we compared the ternary search tree data structure to the trie data structure. Like a TST, a trie can also be used to implement the Autocomplete interface! Using this example TrieST.java class, identify and adapt relevant portions of the code to implement the Autocomplete interface. Most of the code in the TrieST.java class is not needed for implementing Autocomplete. Use the Trie Visualization to see what you expect your tree to look like!

Algorithmic fairness

In your project, you may have wondered how search engines display their results, and which sources they chose to display first. In this talk, Chirag Shah expands on these ideas, explores how ranking of search results impact users, and presents some algorithms and statistical methods that can be used to increase fairness and diversity in search result rankings.