Autocomplete
Find all terms beginning with a given prefix, sorted by weight.1
Learning Goals
Develop Java programs in IntelliJ independently.
One of the themes for the following assignments is not only writing your own tests, but also writing your own programs. In previous homeworks, we provided significant guidance to help you identify what subproblems to solve and how to solve them. In this homework, we will fade away some of this scaffolding and encourage greater independence.
Increase confidence in code correctness by testing programs with simulated datasets.
How do we know a program is truly correct? The answer is that we don’t, usually. This is a hard problem in computer science because bugs can come up in the most unsuspect places. In this course, we will use a software testing technique which attempts to cover most of the possible edge cases by simulating thousands or millions of real world inputs. The tricky test case in
ArrayDeque
from HW 2 was actually generated through one of these simulations.Formally introduce asymptotic runtime requirements and exception handling.
A program can be correct on all inputs, but too slow to be useful. In previous homeworks, we informally introduced timing and memory tests as a means of providing feedback. For this assignment, we will enforce runtime requirements since our implementation needs to be fast if we want to use it as part of HuskyMaps. Additionally, as our programs begin to take real world inputs, they also need to handle malformed data.
Table of contents
- Getting Started
- Introduction
- SimpleTerm
- Term
- Autocomplete API
- LinearRangeSearch
- BinaryRangeSearch
- Testing
- Interactive GUI
- Optional Extension: Tries
- Submission
Getting Started
Pull the skeleton repository to get the autocomplete
assignment.
Introduction
Write a program to implement autocomplete for a given set of terms, where a term is a query string and an associated non-negative weight. That is, given a prefix, find all queries that start with the given prefix, in descending order of weight.
Autocomplete is pervasive in modern applications. As the user types, the program predicts the complete query (typically a word or phrase) that the user intends to type. IMDB uses it to display the names of movies as the user types; search engines use it to display suggestions as the user enters web search queries; cell phones use it to speed up text input.
In these examples, the application predicts how likely it is that the user is typing each query and presents to the user a list of the top-matching queries, in descending order of weight. These weights are determined by historical data, such as box office revenue for movies, frequencies of search queries from other Google users, or the typing history of a cell phone user. For the purposes of this assignment, you will have access to a set of all possible queries and associated weights (and these queries and weights will not change).
The performance of autocomplete functionality is critical in many systems. For example, consider a search engine which runs an autocomplete application on a server farm. According to one study, the application has only about 50ms to return a list of suggestions for it to be useful to the user. Moreover, in principle, it must perform this computation for every keystroke typed into the search bar and for every user!
In this assignment, you will implement autocomplete by sorting the terms by query string; binary searching to find all query strings that start with a given prefix; and sorting the matching terms by weight.
SimpleTerm
Implement a class SimpleTerm
that represents an autocomplete term—a query string and an associated integer weight.
public SimpleTerm(String query, long weight)
: Initializes a term with the given query string and weight.public String query()
: Returns this term’s query.public String queryPrefix(int r)
: Returns the firstr
characters of this query.public long weight()
: Returns this term’s weight.
It turns out that IntelliJ can generate almost all of this code for you. Explore the code generation features in IntelliJ to generate the two-argument constructor, getter methods for query
and weight
, toString()
, as well as equals
and hashCode
. Then, implement queryPrefix
on your own using methods from the String
class.
- Note
- The generated code might not pass style checks. Add additional braces and tidy-up the code so that it meets style requirements.
Term
Implement the default methods in the Term
interface, which support comparing terms by three different orders:
public int compareTo(Term that)
: Compares the two terms in lexicographic order (dictionary order) by query.public int compareToByReverseWeightOrder(Term that)
: Compares the two terms in descending order by weight.public int compareToByPrefixOrder(Term that, int r)
: Compares the two terms in lexicographic order, but using only the firstr
characters of each query.
The Java documentation for compareTo
methods provides more details about comparison between two objects is defined in Java. The last order may seem a bit odd, but you will use it to later find all query strings that start with a given autocomplete query prefix of length r
.
- Hint
- The Java
String
andLong
class have some methods that may help. Look for methods calledcompare
andcompareTo
.
See the method comments for more details on edge cases and exception handling.
The string comparison functions must take time proportional to the number of characters needed to resolve the comparison.
Autocomplete API
The Autocomplete
interface consists of a single method.
public List<Term> allMatches(String prefix)
: Returns all terms that start with the given prefix, in descending order of weight.
You will develop two classes that implement the Autocomplete
interface: a slow but easier-to-verify solution called LinearRangeSearch
and a much faster but harder-to-verify solution called BinaryRangeSearch
.
LinearRangeSearch
Develop a class LinearRangeSearch
that implements Autocomplete
. It should have the following constructor and allMatches
method.
public LinearRangeSearch(Collection<Term> terms)
: Stores the terms for future autocompletion by theallMatches
method.public List<Term> allMatches(String prefix)
: Returns all terms that start with the given prefix, in descending order of weight.
When allMatches
is called with the given prefix, LinearRangeSearch
scans the entire collection of terms and collects all of the terms that start with the given prefix
. To return the terms in descending order of weight, you can use the Java Collections.sort
or List.sort
methods together with Term.byReverseWeightOrder()
.
There are no particular runtime requirements for LinearRangeSearch
.
This class should not take more than half an hour to implement. Remember that you can search the online Java documentation for helpful methods in the String
class, for example. You can also borrow small snippets of code from other online sources so long as they’re cited as comments in your code.
BinaryRangeSearch
Develop a class BinaryRangeSearch
that implements Autocomplete
with a much faster but harder-to-verify binary search algorithm. The gist of the algorithm is to:
- Sort the terms in lexicographic order.
- Find all query strings that start with the given prefix. Run two separate binary searches: one to find the first query with the prefix, and another to find the last.
- Sort the matching terms in descending order by weight.
- Tip
- Since you’ll be writing code to find the first and last of a sequence of items, your search method won’t be able to return early and will need to search all the way until the search interval is empty.
As with LinearRangeSearch
, your class should have the following constructor and allMatches
method.
public BinaryRangeSearch(Collection<Term> terms)
: Stores the terms for future autocompletion by theallMatches
method.public List<Term> allMatches(String prefix)
: Returns all terms that start with the given prefix, in descending order of weight.
Your implementation must achieve each of the following runtime requirements.
- The constructor must make O(N log N) compares, where N is the number of terms.
- The
allMatches
method must make O(log N + M log M) compares, where M is the number of matching terms.
To interpret these runtime requirements, recall that we studied the runtime of merge sort in lecture. Java’s sorting methods use an optimized variant of merge sort that takes O(N log N) time to sort a list of N terms.
Testing
Write tests in the BinaryRangeSearchTest
class. These tests won’t be graded, but you should still practice writing some since the autograder tests for this assignment are all randomized—this means that they’ll be hard to interpret.
After writing a few unit tests, also try writing a randomized test. You may have noticed a setUp
method that reads in the cities.txt
dataset. If we assume that our LinearRangeSearch
implementation is correct, and our BinaryRangeSearch
returns the same results on thousands of inputs, then we can assume that BinaryRangeSearch
is also probably correct. We can generate a randomized test in a few steps:
- Randomly choose a word from the collection of
terms
. Say we happen to choose the city name “Seattle”. - Randomly pick a prefix length between, say, 1 and 10. Suppose we happen to pick the number 4.
- Substring the city name to get “Seat”.
- Compute
linearAuto.allMatches("Seat")
andbinaryAuto.allMatches("Seat")
. - Call
assertEquals
to check that the results are the same.
We can repeat this process for thousands of iterations, simulating a new autocompletion query each time.
Interactive GUI
The AutocompleteGUI
class provides a GUI for the user to enter queries. It presents the top K matching terms in real time. When the user selects a term, the GUI opens up the results from a Google search for that term in a browser.
Optional Extension: Tries
Implement the Autocomplete
interface with tries. Tries are efficient search tree structures for implementing string sets and string maps. The textbook contains implementations for TrieST.java
(R-way trie) and TST.java
(ternary search trie).
Submission
Commit and push your changes to GitLab before submitting your homework to Gradescope.
Kevin Wayne. 2016. Autocomplete-Me. In Nifty Assignments 2016. http://nifty.stanford.edu/2016/wayne-autocomplete-me/
Matthew Drabick and Kevin Wayne. 2019. Autocomplete. In COS 226: Algorithms and Data Structures, Spring 2019. https://www.cs.princeton.edu/courses/archive/spring19/cos226/assignments/autocomplete/specification.php ↩