Autocomplete
Designing and analyzing autocomplete.
Running Husky Maps
Husky Maps is a web app for mapping the world, searching for places, and navigating around Seattle. This project is the first of three that will help implement it (MapServer
)!
Please make sure that IntelliJ has the correct settings to run on
IntelliJ IDEA
and notGradle
.Go to IntelliJ settings ->
Build, Execution and Deployment
->Build Tools
->Gradle
and change BOTHbuild and run
using andrun tests
using to theIntelliJ IDEA
option instead ofGradle
.
We have made a major update earlier to the project that didn’t allow MapServer
to run correctly before, let’s fix MapServer
to allow it to run. Please move the resources
folder to be inside src/main
, but still be outside of src/main/java
. Now run MapServer
. If everything is successful, you’ll see this flurry of messages appear in the Run tool window indicating that the app has launched.
[main] INFO io.javalin.Javalin - Starting Javalin ...
[main] INFO org.eclipse.jetty.server.Server - jetty-11.0.13; built: 2022-12-07T20:47:15.149Z; git: a04bd1ccf844cf9bebc12129335d7493111cbff6; jvm 11.0.16+8-post-Debian-1deb11u1
[main] INFO org.eclipse.jetty.server.session.DefaultSessionIdManager - Session workerName=node0
[main] INFO org.eclipse.jetty.server.handler.ContextHandler - Started i.j.j.@683dbc2c{/,null,AVAILABLE}
[main] INFO org.eclipse.jetty.server.AbstractConnector - Started ServerConnector@2b6856dd{HTTP/1.1, (http/1.1)}{0.0.0.0:8080}
[main] INFO org.eclipse.jetty.server.Server - Started Server@3c72f59f{STARTING}[11.0.13,sto=0] @4349ms
[main] INFO io.javalin.Javalin -
__ ___ ______
/ /___ __ ______ _/ (_)___ / ____/
__ / / __ `/ | / / __ `/ / / __ \ /___ \
/ /_/ / /_/ /| |/ / /_/ / / / / / / ____/ /
\____/\__,_/ |___/\__,_/_/_/_/ /_/ /_____/
https://javalin.io/documentation
[main] INFO io.javalin.Javalin - Listening on http://localhost:8080/
[main] INFO io.javalin.Javalin - You are running Javalin 5.6.1 (released June 22, 2023).
[main] INFO io.javalin.Javalin - Javalin started in 309ms \o/
You’re done! All the app features work because we’ve provided reference implementations for each interface that we’ll learn in this class. The goal of this approach is to enable you to compare different ways to implement the same functionality, each with their own trade-offs. Through studying interfaces and implementations, we’ll gain a deeper understanding about why programs are designed the way they are. You can now visit localhost:8080 to try the web app for yourself, but the map images won’t load without the optional steps below.
How do I enable map images in Husky Maps?
To see the map images, sign up for a free MapBox account to get an access token. Once you have your access token, in the IntelliJ toolbar, select the “MapServer” dropdown, Edit Configurations…, under Environment variables write TOKEN=
and then paste your token. Re-run the MapServer
class to launch the web app and enjoy the “Ice Cream” map style by Maya Gao.
How do I let other people visit my Husky Maps?
Running Husky Maps in IntelliJ will only allow you (or whomever is using your computer) to access the app. In order to allow anyone on the internet to use your app, we’ll need to deploy it to the web. Optionally, you may follow the Deployment instructions in the project README.md to learn how to deploy the app to the web for free.
Autocomplete
Autocomplete is a feature that helps a user select valid search results by showing possible inputs as they type. For example, in a map app, autocomplete might allow the user to enter a prefix such as Sea and automatically suggest the city, Seattle.
In addition to autocompleting names, places, or things, autocomplete can also be a useful abstraction for implementing DNA subsequence search. Instead of indexing a list of all the city names or places, a DNA data structure can index all the suffixes of a very long DNA sequence. Autocompleting the DNA suffixes enables efficient search across all the DNA substrings for medical applications, genomics, and forensics.
In this project, we will compare 4 implementations (described later) and 2 applications (city search and DNA search) of autocomplete. By the end of this project, students will be able to:
- Design and implement tree-based and array-based search data structures.
- Analyze and compare runtimes using asymptotic and experimental analysis.
Can I work with someone else on this project?
Although this project requires an individual submission, you may find it useful to discuss overarching ideas with your classmates. Our primary rule is that we ask that you do not claim to be responsible for work that is not yours. If you get some help from someone else or from an online resource, cite it. I believe that there is a lot of value in learning from others so long as you do not deprive yourself (or others) of the opportunity to learn.
We are comfortable doing this because each submission in this class comes in the form of a video that you record. Your video is a demonstration of everything that you learned throughout the process of working on an assignment. Our goal is for students to support each other and find community through this course. The real advantage of taking a course on-campus at a university is to be able to connect with others who share common interests in learning.
What am I submitting at the end of this project?
Satisfactory completion of the project requires one Canvas submission including: a video-recorded individual presentation, some kind of visually-organizing structure, such as slides or a document and a link to your src/main/java/autocomplete
folder on GitLab that addresses all the green callouts.
The project instructions contain a lot of details to provide context, clarify common confusions, and help students get started. Your Canvas submissions only need to address tasks that are described in green callouts like this one.
Your video presentation should meet the following requirements:
- Your presentation should not be much longer than 8 minutes (9 minutes maximum) and should include your voiceover. (Your video is appreciated but not necessary.)
- Your video presentation should include some kind of visually-organizing structure, such as slides or a document.
- After submitting to Canvas, add a submission comment linking to your slides or document, a link to your
src/main/java/autocomplete
folder on GitLab, and a list of your citations and references used. Remember tocommit
andpush
to Gitlab and check that the associated status for your latest Pipeline is a green checkmark!
Students are expected to follow Washington state law on the Student Conduct Code for the University of Washington. In this course, students must:
- Indicate on assignment submissions any assistance received, including materials distributed in this course.
- Not receive, generate, or otherwise acquire any substantial portion or walkthrough for an assignment.
- Not aid, assist, attempt, or tolerate prohibited academic conduct in others.
In a canvas comment turned in with your video presentation, you must cite any sources you used, including lecture materials, section materials, and collaboration with peers. If you used any kind of computer technology to help prepare your project deliverable, include the queries and/or prompts.
Submitted work that is not consistent with sources may be subject to the student conduct process.
Autocomplete interface
Implementations of Autocomplete
must provide the following methods:
void addAll(Collection<? extends CharSequence> terms)
- Adds all the terms to the autocompletion dataset. Each term represents a possible autocompletion search result. Behavior is not defined if duplicate terms are added to the dataset.
Collection
- The parent interface to lists and sets in Java. Using
Collection
rather thanList
lets clients use any list or set or other collection that they’ve already created in their program.CharSequence
- An interface that generalizes the concept of a
String
of characters. UsingCharSequence
rather thanString
lets clients define specialized implementations for long strings like DNA.Collection<? extends CharSequence>
- The type of the parameter, read: a
Collection
of any type of elements that extendCharSequence
. The? extends
lets clients call the method with aCollection<String>
or aCollection<SuffixSequence>
instead of having to strictly use aCollection<CharSequence>
.
List<CharSequence> allMatches(CharSequence prefix)
- Returns a new list of all terms that begin with the same characters as the given prefix. Note that this should not be a view of the list.
In Java, a view is a clever way of working with a part of a data structure without making a copy of it. For example, the
ArrayList
class has asubList
method with the following method signature.public List<E> subList(int fromIndex, int toIndex)
subList
returns anotherList
. But instead of constructing a newArrayList
and copying over all the elements from thefromIndex
to thetoIndex
, the Java developers defined aSubList
class that provides a slice of the data structure using the following fields (some details omitted).private static class SubList<E> implements List<E> { private final ArrayList<E> root; private final int offset; private int size; }
The
SubList
class keeps track of itsArrayList root
, anint offset
representing the start of the sublist, and theint size
of the sublist. The sublist serves as an intermediary that implementsget(index)
by checking that the index is in the sublist before returning the offset index.public E get(int index) { if (index < 0 || index >= size) { throw new IndexOutOfBoundsException(); } return root.elementData[offset + index]; }
Given the terms [alpha, delta, do, cats, dodgy, pilot, dog], allMatches("do")
should return [do, dodgy, dog] in any order. Try this example yourself by writing a new test case in the AutocompleteTests
class. You can write additional test cases like this to assist in debugging.
@Test
void compareSimple() {
List<CharSequence> terms = List.of(
"alpha", "delta", "do", "cats", "dodgy", "pilot", "dog"
);
CharSequence prefix = "do";
List<CharSequence> expected = List.of("do", "dodgy", "dog");
Autocomplete testing = createAutocomplete();
testing.addAll(terms);
List<CharSequence> actual = testing.allMatches(prefix);
assertEquals(expected.size(), actual.size());
assertTrue(expected.containsAll(actual));
assertTrue(actual.containsAll(expected));
}
Reference implementation
The project code includes a fully functional TreeSetAutocomplete
implementation that stores all the terms in a TreeSet
. The class contains a single field for storing the underlying TreeSet
of terms. Rather than declare the field as a Set
, we’ve chosen to use the more specialized subtype NavigableSet
because it includes helpful methods that can be used to find the first term that matches the prefix.
private final NavigableSet<CharSequence> elements;
The constructor assigns a new TreeSet
collection to this field. In Java, TreeSet
is implemented using a red-black tree, a type of balanced search tree where access to individual elements are worst-case logarithmic time with respect to the size of the set. CharSequence::compare
tells the TreeSet
to use the natural dictionary order when comparing any two elements.
public TreeSetAutocomplete() {
elements = new TreeSet<>(CharSequence::compare);
}
If you’ve ever used a TreeSet<String>
, you might be surprised to see the argument CharSequence::compare
. This is not necessary for TreeSet<String>
, but it is necessary for TreeSet<CharSequence>
because CharSequence
does not implement Comparable<CharSequence>
. You can read more in the Java developers mailing list.
The addAll
method calls TreeSet.addAll
to add all the terms to the underlying TreeSet
.
@Override
public void addAll(Collection<? extends CharSequence> terms) {
elements.addAll(terms);
}
The allMatches
method:
- Ensures the prefix is valid. If the prefix is
null
or empty, returns an empty list. - Finds the first matching term by calling
TreeSet.ceiling
, which returns “the least element in this set greater than or equal to the given element, ornull
if there is no such element.” - Collects the remaining matching terms by iterating over the
TreeSet.tailSet
, which is “a view of the portion of this set whose elements are greater than or equal tofromElement
.” - If we reach a term that no longer matches the
prefix
, returns the list of results.
@Override
public List<CharSequence> allMatches(CharSequence prefix) {
List<CharSequence> result = new ArrayList<>();
if (prefix == null || prefix.length() == 0) {
return result;
}
CharSequence start = elements.ceiling(prefix);
if (start == null) {
return result;
}
for (CharSequence term : elements.tailSet(start)) {
if (Autocomplete.isPrefixOf(prefix, term)) {
result.add(term);
} else {
return result;
}
}
return result;
}
Design and implement
Design and implement 3 implementations of the Autocomplete
interface.
Make sure to pass all the tests
for SequentialSearchAutocomplete
, BinarySearchAutocomplete
, and TernarySearchTreeAutocomplete
, push your code to Gitlab, and check that the associated status for your latest Pipeline is passing. If the implementations do not pass all the test cases, explain in your video what you think could be causing the problem.
SequentialSearchAutocomplete
Terms are added to an ArrayList
in any order. Because there elements are not stored in any sorted order, the allMatches
method must scan across the entire list and check every term to see if it matches the prefix
.
BinarySearchAutocomplete
Terms are added to a sorted ArrayList
. When additional terms are added, the entire list is re-sorted using Collections.sort
. Since the terms are in a list sorted according to natural dictionary order, all matches must be located in a contiguous sublist. Collections.binarySearch
can find the start index for the first match. After the first match is found, we can collect all remaining matching terms by iterating to the right until it no longer matches the prefix.
List<CharSequence> elements = new ArrayList<>();
elements.add("alpha");
elements.add("delta");
elements.add("do");
elements.add("cats");
System.out.println("before: " + elements);
Collections.sort(elements, CharSequence::compare);
System.out.println(" after: " + elements);
CharSequence prefix = "bay";
System.out.println("prefix: " + prefix);
int i = Collections.binarySearch(elements, prefix, CharSequence::compare);
System.out.println(" i: " + i);
This program produces the following output.
before: [alpha, delta, do, cats] after: [alpha, cats, delta, do] prefix: bay i: -2
The index
i
is negative becauseCollections.binarySearch
returns a negative value to report that an exact match for the prefix was not found in the sorted list.
- Returns
- the index of the search key, if it is contained in the list; otherwise,
(-(insertion point) - 1)
. The insertion point is defined as the point at which the key would be inserted into the list: the index of the first element greater than the key, orlist.size()
if all elements in the list are less than the specified key. Note that this guarantees that the return value will be>= 0
if and only if the key is found.Since the prefix often will not exactly match an element in the list, we can use algebra to recover the insertion point. The
start
value represents the index of the first term that could match the prefix.int start = i; if (i < 0) { start = -(start + 1); }
Add the compareSimple
test shown above during the description of the Autocomplete interface, into AutocompleteTests
. Then, trace through the test compareSimple
for the BinarySearchAutocomplete
implementation with a focus on justifying how your code for allMatches
produces the correct behavior. You can justify allMatches
by tracing the code for that method once you reach that method in your trace of compareSimple
.
TernarySearchTreeAutocomplete
Terms are added to a ternary search tree using the TST.java class as a reference.
- Skim the
TST
class (linked above). What do you notice will work forAutocomplete
? What needs to change? - Identify methods in the
TST
class that are most similar toAutocomplete
. - Adapt the code to implement the
Autocomplete
interface.
Don’t copy and paste code! Most of the time, we will need to make many changes, and we might introduce subtle bugs when we copy code that we don’t fully understand. Instead, rewrite the code in your own words after making sense of the purpose of each line. We often don’t need all the lines of code, and the code can be rewritten in ways that are more suitable for the problem at hand.
It’s okay if your TernarySearchTreeAutocomplete
throws a StackOverflowError
when running the DNASearch
class. This is caused by Java’s built-in limit on recursive depth. There are different ways to work around this limit, but it’s not relevant to this project.
Trace through the test compareSimple
for the TernarySearchTreeAutocomplete
implementation with a focus on justifying how your code for allMatches
produces the correct behavior. You can justify allMatches
by tracing the code for that method once you reach that method in your trace of compareSimple
.
Analyze and compare
Asymptotic analysis
Give a big-theta bound for the worst-case runtime of the addAll
and allMatches
methods for each implementation, including TreeSetAutocomplete
, with respect to N, the total number of terms already stored in the data structure. Explain the worst case and the runtime of each implementation in a couple sentences while referencing the code. Please use the assumptions we provide below for the runtime analysis.
As you perform your asymptotic analysis, make sure to carefully read through and keep in mind the assumptions and hints given below.
What does the underlying data structure look like in the worst case? How are terms organized? Based on that worst case, analyze the runtime of operations performed on that data structure.
addAll
- Assume a constant number of terms are added to a dataset that already contains N terms.
- Assume that arrays can accommodate all the new terms without resizing.
Collections.sort
uses Timsort, an optimized version of merge sort with runtime in O(N log N) where N is the size or length of the collection or array. - Assume that arrays can accommodate all the new terms without resizing.
allMatches
- Consider the relationship between the added terms and the prefix. How many matches will we have if all the terms in the dataset begin with A and the prefix is A? How many matches will we have if all the terms in the data set begin with B and the prefix is A?
Assume all strings have a constant length. TreeSet
is implemented using a red-black tree, which has the same asymptotic runtime as a left-leaning red-black tree or a 2-3 tree.
Experimental analysis
Compare the runtimes across all 4 implementations, including TreeSetAutocomplete
. Are certain algorithms faster than others?
Are there any disagreements between the runtimes you hypothesized in asymptotic analysis and the runtimes you observed in your experimental graphs? Describe how differences between the theoretical assumptions made for asymptotic analysis and the actual settings in RuntimeExperiments
might explain those disagreements, as well as how the experimental analysis being a more average case may affect this.
For allMatches
, describe how the default prefix affects the experimental analysis. Try experimenting and changing the prefix
in RuntimeExperiments
to figure this out. What kind of variables for the prefix affect the runtime of finding matches?
To let RuntimeExperiments
run, make sure to comment out the @Disabled
line before it. Make sure to keep @Nested
. Once you are done running RuntimeExperiments
, please uncomment the @Disabled
line to ensure that our pipeline to check your tests will work!
Run the provided RuntimeExperiments
in AutocompleteTests
to compare the real-world runtime of each implementation. For each implementation, RuntimeExperiments
constructs an empty instance and records the number of seconds to add N terms to the dataset and then compute all matches for the prefix
(such as Sea).
- The first column denotes N, the total number of terms.
- The second column denotes the average runtime for
addAll
in seconds. - The third column denotes the average runtime for
allMatches
in seconds.
Copy-paste the text into plotting software such as Desmos. Plot the runtimes of all 4 implementations on addAll
and allMatches
.