Autocomplete

Find all terms beginning with a given prefix, sorted by weight.¹

Learning Goals

Develop Java programs in IntelliJ independently.
One of the themes for the following assignments is not only writing your own tests, but also writing your own programs. In previous homeworks, we provided significant guidance to help you identify what subproblems to solve and how to solve them. In this homework, we will fade away some of this scaffolding and encourage greater independence.
Increase confidence in code correctness by testing programs with simulated datasets.
How do we know a program is truly correct? The answer is that we don’t, usually. This is a hard problem in computer science because bugs can come up in the most unsuspect places. In this course, we will use a software testing technique which attempts to cover most of the possible edge cases by simulating thousands or millions of real world inputs. The tricky test case in ArrayDeque from HW 2 was actually generated through one of these simulations.
Formally introduce asymptotic runtime requirements and exception handling.
A program can be correct on all inputs, but too slow to be useful. In previous homeworks, we informally introduced timing and memory tests as a means of providing feedback. For this assignment, we will enforce runtime requirements since our implementation needs to be fast if we want to use it as part of HuskyMaps. Additionally, as our programs begin to take real world inputs, they also need to handle malformed data.

Getting Started
Introduction
SimpleTerm
Term
Autocomplete API
LinearRangeSearch
BinaryRangeSearch
Testing
Interactive GUI
Optional Extension: Tries
Submission

Getting Started

Pull the skeleton repository to get the autocomplete assignment.

Introduction

Write a program to implement autocomplete for a given set of terms, where a term is a query string and an associated non-negative weight. That is, given a prefix, find all queries that start with the given prefix, in descending order of weight.

Autocomplete is pervasive in modern applications. As the user types, the program predicts the complete query (typically a word or phrase) that the user intends to type. IMDB uses it to display the names of movies as the user types; search engines use it to display suggestions as the user enters web search queries; cell phones use it to speed up text input.

In these examples, the application predicts how likely it is that the user is typing each query and presents to the user a list of the top-matching queries, in descending order of weight. These weights are determined by historical data, such as box office revenue for movies, frequencies of search queries from other Google users, or the typing history of a cell phone user. For the purposes of this assignment, you will have access to a set of all possible queries and associated weights (and these queries and weights will not change).

The performance of autocomplete functionality is critical in many systems. For example, consider a search engine which runs an autocomplete application on a server farm. According to one study, the application has only about 50ms to return a list of suggestions for it to be useful to the user. Moreover, in principle, it must perform this computation for every keystroke typed into the search bar and for every user!

In this assignment, you will implement autocomplete by sorting the terms by query string; binary searching to find all query strings that start with a given prefix; and sorting the matching terms by weight.

SimpleTerm

Implement a class SimpleTerm that represents an autocomplete term—a query string and an associated integer weight.

public SimpleTerm(String query, long weight): Initializes a term with the given query string and weight.
public String query(): Returns this term’s query.
public String queryPrefix(int r): Returns the first r characters of this query.
public long weight(): Returns this term’s weight.

It turns out that IntelliJ can generate almost all of this code for you. Explore the code generation features in IntelliJ to generate the two-argument constructor, getter methods for query and weight, toString(), as well as equals and hashCode. Then, implement queryPrefix on your own using methods from the String class.

Note: The generated code might not pass style checks. Add additional braces and tidy-up the code so that it meets style requirements.

Term

Implement the default methods in the Term interface, which support comparing terms by three different orders:

public int compareTo(Term that): Compares the two terms in lexicographic order (dictionary order) by query.
public int compareToByReverseWeightOrder(Term that): Compares the two terms in descending order by weight.
public int compareToByPrefixOrder(Term that, int r): Compares the two terms in lexicographic order, but using only the first r characters of each query.

The Java documentation for compareTo methods provides more details about comparison between two objects is defined in Java. The last order may seem a bit odd, but you will use it to later find all query strings that start with a given autocomplete query prefix of length r.

Hint: The Java String and Long class have some methods that may help. Look for methods called compare and compareTo.

See the method comments for more details on edge cases and exception handling.

The string comparison functions must take time proportional to the number of characters needed to resolve the comparison.

Autocomplete API

The Autocomplete interface consists of a single method.

public List<Term> allMatches(String prefix): Returns all terms that start with the given prefix, in descending order of weight.

You will develop two classes that implement the Autocomplete interface: a slow but easier-to-verify solution called LinearRangeSearch and a much faster but harder-to-verify solution called BinaryRangeSearch.

LinearRangeSearch

Develop a class LinearRangeSearch that implements Autocomplete. It should have the following constructor and allMatches method.

public LinearRangeSearch(Collection<Term> terms): Stores the terms for future autocompletion by the allMatches method.
public List<Term> allMatches(String prefix): Returns all terms that start with the given prefix, in descending order of weight.

When allMatches is called with the given prefix, LinearRangeSearch scans the entire collection of terms and collects all of the terms that start with the given prefix. To return the terms in descending order of weight, you can use the Java Collections.sort or List.sort methods together with Term.byReverseWeightOrder().

There are no particular runtime requirements for LinearRangeSearch.

This class should not take more than half an hour to implement. Remember that you can search the online Java documentation for helpful methods in the String class, for example. You can also borrow small snippets of code from other online sources so long as they’re cited as comments in your code.

BinaryRangeSearch

Develop a class BinaryRangeSearch that implements Autocomplete with a much faster but harder-to-verify binary search algorithm. The gist of the algorithm is to:

Sort the terms in lexicographic order.
Find all query strings that start with the given prefix. Run two separate binary searches: one to find the first query with the prefix, and another to find the last.
Sort the matching terms in descending order by weight.

Tip: Since you’ll be writing code to find the first and last of a sequence of items, your search method won’t be able to return early and will need to search all the way until the search interval is empty.

As with LinearRangeSearch, your class should have the following constructor and allMatches method.

public BinaryRangeSearch(Collection<Term> terms): Stores the terms for future autocompletion by the allMatches method.
public List<Term> allMatches(String prefix): Returns all terms that start with the given prefix, in descending order of weight.

Your implementation must achieve each of the following runtime requirements.

The constructor must make O(N log N) compares, where N is the number of terms.
The allMatches method must make O(log N + M log M) compares, where M is the number of matching terms.

To interpret these runtime requirements, recall that we studied the runtime of merge sort in lecture. Java’s sorting methods use an optimized variant of merge sort that takes O(N log N) time to sort a list of N terms.

Testing

Write tests in the BinaryRangeSearchTest class. These tests won’t be graded, but you should still practice writing some since the autograder tests for this assignment are all randomized—this means that they’ll be hard to interpret.

After writing a few unit tests, also try writing a randomized test. You may have noticed a setUp method that reads in the cities.txt dataset. If we assume that our LinearRangeSearch implementation is correct, and our BinaryRangeSearch returns the same results on thousands of inputs, then we can assume that BinaryRangeSearch is also probably correct. We can generate a randomized test in a few steps:

Randomly choose a word from the collection of terms. Say we happen to choose the city name “Seattle”.
Randomly pick a prefix length between, say, 1 and 10. Suppose we happen to pick the number 4.
Substring the city name to get “Seat”.
Compute linearAuto.allMatches("Seat") and binaryAuto.allMatches("Seat").
Call assertEquals to check that the results are the same.

We can repeat this process for thousands of iterations, simulating a new autocompletion query each time.

Interactive GUI

The AutocompleteGUI class provides a GUI for the user to enter queries. It presents the top K matching terms in real time. When the user selects a term, the GUI opens up the results from a Google search for that term in a browser.

AutocompleteGUI Demo

Optional Extension: Tries

Implement the Autocomplete interface with tries. Tries are efficient search tree structures for implementing string sets and string maps. The textbook contains implementations for TrieST.java (R-way trie) and TST.java (ternary search trie).

Submission

Commit and push your changes to GitLab before submitting your homework to Gradescope.

Kevin Wayne. 2016. Autocomplete-Me. In Nifty Assignments 2016. http://nifty.stanford.edu/2016/wayne-autocomplete-me/
Matthew Drabick and Kevin Wayne. 2019. Autocomplete. In COS 226: Algorithms and Data Structures, Spring 2019. https://www.cs.princeton.edu/courses/archive/spring19/cos226/assignments/autocomplete/specification.php ↩