Autocomplete

Find all terms beginning with a given prefix, sorted by weight.¹

Learning Goals

Develop Java programs in IntelliJ independently.
One of the themes for the following assignments is not only writing your own tests, but also writing your own programs. In previous homeworks, we provided significant guidance to help you identify what subproblems to solve and how to solve them. In this homework, we will fade away some of this scaffolding and encourage greater independence.
Increase confidence in code correctness by testing programs with simulated datasets.
How do we know a program is truly correct? The answer is that we don’t, usually. This is a hard problem in computer science because bugs can come up in the most unsuspect places. In this course, we will use a software testing technique which attempts to cover most of the possible edge cases by simulating thousands or millions of real world inputs. The tricky test case in ArrayDeque from HW 2 was actually generated through one of these simulations.
Formally introduce asymptotic runtime requirements and exception handling.
A program can be correct on all inputs, but too slow to be useful. In previous homeworks, we informally introduced timing and memory tests as a means of providing feedback. For this assignment, we will enforce runtime requirements since our implementation needs to be fast if we want to use it as part of HuskyMaps. Additionally, as our programs begin to take real world inputs, they also need to handle malformed data.

Getting Started
Introduction
Term
Autocomplete API
LinearRangeSearch
BinaryRangeSearch
Testing
Interactive GUI
Submission

Getting Started

Pull the skeleton repository to get the autocomplete assignment.

Introduction

Write a program to implement autocomplete for a given set of terms, where a term is a query string and an associated non-negative weight. That is, given a prefix, find all queries that start with the given prefix, in descending order of weight.

Autocomplete is pervasive in modern applications. As the user types, the program predicts the complete query (typically a word or phrase) that the user intends to type. IMDB uses it to display the names of movies as the user types; search engines use it to display suggestions as the user enters web search queries; cell phones use it to speed up text input.

In these examples, the application predicts how likely it is that the user is typing each query and presents to the user a list of the top-matching queries, in descending order of weight. These weights are determined by historical data, such as box office revenue for movies, frequencies of search queries from other Google users, or the typing history of a cell phone user. For the purposes of this assignment, you will have access to a set of all possible queries and associated weights (and these queries and weights will not change).

The performance of autocomplete functionality is critical in many systems. For example, consider a search engine which runs an autocomplete application on a server farm. According to one study, the application has only about 50ms to return a list of suggestions for it to be useful to the user. Moreover, in principle, it must perform this computation for every keystroke typed into the search bar and for every user!

In this assignment, you will implement autocomplete by sorting the terms by query string; binary searching to find all query strings that start with a given prefix; and sorting the matching terms by weight.

Note: This homework continues the theme of reduced autograder feedback. While you can still submit as many times as you like, feedback will be limited to large-scale tests on the full dataset. This tells you whether your program is fully correct, but might not directly help with developing your program.

Term

Implement a class Term that represents an autocomplete term—a query string and an associated integer weight. You must implement the following API, which supports comparing terms by three different orders:

Lexicographic order (dictionary order) by query string (the natural order).
In descending order by weight (an alternate order).
Lexicographic order by query string but using only the first r characters (a family of alternate orderings).

The last order may seem a bit odd, but you will use it to later find all query strings that start with a given autocomplete query prefix of length r.

public Term(String query, long weight): Initializes a term with the given query string and weight.
public int compareTo(Term that): Compares the two terms in lexicographic order by query. (See Java’s documentation for more details on compareTo methods.)
public int compareToByReverseWeightOrder(Term that): Compares the two terms in descending order by weight.
public int compareToByPrefixOrder(Term that, int r): Compares the two terms in lexicographic order, but using only the first r characters of each query.
public String query(): Returns this term’s query.
public String queryPrefix(int r): Returns the first r characters of this query.
public long weight(): Returns this term’s weight.

See the class’s method comments for more details on edge cases and exception handling.

Hint: The Java standard library has some methods that may help. Look for methods called compare.

For testing purposes, it will also be helpful to write toString and equals methods. IntelliJ can automatically generate these methods for you through the code generation dialog (Code > Generate…), though you’ll need to first declare the class’s instance variables. Note that the generated code might not pass style checks: in this case, add additional braces or clean-up the code so that it’s presentable.

The string comparison functions must take time proportional to the number of characters needed to resolve the comparison.

Autocomplete API

The Autocomplete interface consists of a single method.

public Term[] allMatches(String prefix): Returns all terms that start with the given prefix, in descending order of weight.

You will develop two classes that implement the Autocomplete interface: a slow but easier-to-verify solution called LinearRangeSearch and a much faster but harder-to-verify solution called BinaryRangeSearch. Part of the learning experience of this portion of the assignment is getting comfortable with writing testing code that simulates real-world workloads. This will be reinforced in future homework assignments when autograder access is further restricted. However, your tests won’t be graded.

LinearRangeSearch

Develop a class LinearRangeSearch that implements Autocomplete. It should have the following constructor and allMatches method.

public LinearRangeSearch(Term[] terms): Stores the terms for future autocompletion by the allMatches method.
public Term[] allMatches(String prefix): Returns all terms that start with the given prefix, in descending order of weight.

When allMatches is called with the given prefix, LinearRangeSearch scans the entire array of terms and collects all of the terms that start with the given prefix. To return the terms in descending order of weight, you can use the Java Arrays.sort or List.sort methods together with TermComparators.byReverseWeightOrder().

There are no particular runtime requirements for LinearRangeSearch.

This class should not take much longer than two hours to implement. Remember that you can search the online Java documentation for helpful methods in the String class, for example. You can also borrow small snippets of code from other online sources so long as they’re cited as comments in your code.

BinaryRangeSearch

Develop a class BinaryRangeSearch that implements Autocomplete with a much faster but harder-to-verify binary search algorithm. The gist of the algorithm is to:

Sort the terms in lexicographic order.
Find all query strings that start with the given prefix. You can do this by running two separate binary searches: one to find the first query with the prefix, and another to find the last.
Sort the matching terms in descending order by weight.

Tip: If you don’t remember how to write code for a binary search, feel free to consult other resources (e.g., past coursework, internet). Since you’ll be writing code to find the first and last of a sequence of items, your search method won’t be able to return early and will need to search all the way until the search interval is empty.

As with LinearRangeSearch, your class should have the following constructor and allMatches method.

public BinaryRangeSearch(Term[] terms): Stores the terms for future autocompletion by the allMatches method.
public Term[] allMatches(String prefix): Returns all terms that start with the given prefix, in descending order of weight.

Your implementation must achieve each of the following runtime requirements.

The constructor must make O(N log N) compares, where N is the number of terms.
The allMatches method must make O(log N + M log M) compares, where M is the number of matching terms.

To interpret these runtime requirements, recall that we studied the runtime of merge sort in lecture. Java’s sorting methods use an optimized variant of merge sort that takes O(N log N) time to sort a list of N terms.

Testing

Write tests in the BinaryRangeSearchTest class. These tests won’t be graded (although you do still need to submit them to the autograder), but you should still practice writing some since the autograder tests for this assignment are all randomized—this means that they’ll be hard to interpret. Write your own unit tests to test and debug your own code; you can follow the provided example test, testSimpleExample.

After writing a few unit tests, also try writing a randomized test. You may have noticed a setUp method that reads in the cities.txt dataset. If we assume that our LinearRangeSearch implementation is correct, and our BinaryRangeSearch returns the same results on thousands of inputs, then we can assume that BinaryRangeSearch is also probably correct. We can randomly generate a test in a few steps:

Randomly choose a word from the array of terms. Say we happen to choose the city name “Seattle”.
Randomly pick a prefix length between 1 and 10. (10 is just an arbitrary choice.) Say we happen to pick the number 4.
Substring the city name to get “Seat”.
Compute linearAuto.allMatches("Seat") and binaryAuto.allMatches("Seat").
Call assertTermsEqual to check that the results are the same.

We can repeat this process for thousands of iterations, simulating a new autocompletion query each time.

You may not share tests with any other students—all code in your possession, including tests, should be your own work. You can discuss different edge cases and share ideas about test cases, but test code falls under the same academic honesty policy as any other homework code.

Interactive GUI

The AutocompleteGUI class provides a GUI for the user to enter queries. It presents the top K matching terms in real time. When the user selects a term, the GUI opens up the results from a Google search for that term in a browser.

AutocompleteGUI Demo

Submission

Commit and push your changes to GitLab before submitting your homework to Gradescope.

Kevin Wayne. 2016. Autocomplete-Me. In Nifty Assignments 2016. http://nifty.stanford.edu/2016/wayne-autocomplete-me/
Matthew Drabick and Kevin Wayne. 2019. Autocomplete. In COS 226: Algorithms and Data Structures, Spring 2019. https://www.cs.princeton.edu/courses/archive/spring19/cos226/assignments/autocomplete/specification.php ↩