Prefix Operations and Tries

Prefix Operations and Tries
A workload-centered analysis of tries, applying data structures to optimize tries, and designing prefix operation algorithms.
Kevin Lin, with thanks to many others.
1
Ask questions anonymously on Piazza. Look for the pinned Lecture Questions thread.

Feedback from the Reading Quiz
2
The reading introduced us to a missing component of our runtime analysis. Specialized data structures, like DataIndexedCharMap, can be simpler and faster than more general but also more complex data structures like hash tables.

Tries: A Specialized Data Structure
Tries are a character-by-character string set representation.
3
s
a
m
d
p
e
a
w
l
s
sad
same
sap
awls
a
0
1
2
3
sad
awls
a
same
sap
sam
sam
Binary Search Tree
Hash Table
Trie

Searching in Tries
contains(“sam”):	true, blue. hit.
contains(“sa”):		false, white. miss.
contains(“a”):		true, blue. hit.
contains(“saq”):	false, fell off. miss.

Two ways to have a search miss.
If the final node is white.
If we fall off the tree.
4
s
a
m
d
p
e
a
w
l
s

Given a trie with N keys, what is the runtime for contains(key) for a key of length L?
5

Trie: Design 1
6
public class TrieSet {
  private static final int R = 128; // ASCII
  private Node root;

  private static class Node {
    private char ch;
    private boolean isKey;
    private DataIndexedCharMap<Node> next;
    private Node(char c, boolean b, int R) {
      ch = c; isKey = b;
      next = new DataIndexedCharMap<Node>(R);
    }
  }
}
s
a
m
d
p
e
a
w
l
s
?: What’s inefficient about this implementation?

...
Trie Node Implementation
7
private static class Node {
  private char ch;
  private boolean isKey;
  private DataIndexedCharMap<Node> next;
  private Node(char c, boolean b, int R) {
    ch = c; isKey = b;
    next = new DataIndexedCharMap<Node>(R);
  }
}
a
w
128 links, mostly null

Trie Node Implementation
8
private static class Node {
  private char ch;
  private boolean isKey;
  private DataIndexedCharMap<Node> next;
  private Node(char c, boolean b, int R) {
    ch = c; isKey = b;
    next = new DataIndexedCharMap<Node>(R);
  }
}
a
w
a
w
w
...
128 links, mostly null

Trie: Design 1
9
public class TrieSet {
  private static final int R = 128; // ASCII
  private Node root;

  private static class Node {
    private char ch;
    private boolean isKey;
    private DataIndexedCharMap<Node> next;
    private Node(char c, boolean b, int R) {
      ch = c; isKey = b;
      next = new DataIndexedCharMap<Node>(R);
    }
  }
}
s
a
d
a
w
l
a
s
a
d
w
l
...
...
...
...
...
...
...
...
?: What information is redundant in this data structure?

Trie: Design 1.5
10
public class TrieSet {
  private static final int R = 128; // ASCII
  private Node root;

  private static class Node {
    private char ch;
    private boolean isKey;
    private DataIndexedCharMap<Node> next;
    private Node(char c, boolean b, int R) {
      ch = c; isKey = b;
      next = new DataIndexedCharMap<Node>(R);
    }
  }
}
a
s
a
d
w
l
...
...
...
...
...
...
...
...
We can remove the character ch from the node because we’ll know which character we’re on when we index into the DataIndexedCharMap next.

Does the structure of a trie depend on the order in which strings are inserted?
11

Trie Runtime
When our keys are strings, Tries give us slightly better performance on contains and add.
However, DataIndexedCharMap wastes a ton of memory storing R links per node.
12
Key type
contains(x)
add(x)
Balanced BST
comparable
Θ(log N)
Θ(log N)
Hash Table
hashable
Θ(1)*
Θ(1)*†
Data-indexed array
char
Θ(1)
Θ(1)
Trie: Design 1.5
string
Θ(1)
Θ(1)
* :  Assuming items are evenly spread.
† :  Indicates “on average”.

Typical runtime when treating length of keys as a constant.
?: How might we address the memory usage problem? What ideas can we borrow from other data structures?

Dealing with Sparsity
13

v1.5: DataIndexedCharMap
14
a
d
c
…
97
98
99
100
…
…
97
98
99
100
…
…
97
98
99
100
…
…
97
98
99
100
…
isKey = false
isKey = true
isKey = true
isKey = false
Abstract Trie

v2.0: Hash-Table-Based Trie
15
a
d
c
(ad)
(c)
isKey = false
d
0
1
2
3
c
a
0
1
2
3
0
1
2
3
0
1
2
3
isKey = true
isKey = true
isKey = false
Abstract Trie

v3.0: BST-Based Trie
16
a
d
c
(ad)
(c)
isKey = false
isKey = true
isKey = true
isKey = false
‘c’
‘a’
‘d’
Each trie node keeps track of its own BST
Abstract Trie

v4.0: Ternary Search Trie (TST)
17
a
d
c
a
d
c
Abstract Trie
Ternary Search Trie
Integrate internal BSTs into main structure.
?: How do you look up the string “ad” in the ternary search trie? The string “c”?




?: How is the ternary search trie different from the abstract trie? From the BST-based trie?

Which value is associated with the key “CAC”?
18
A
C
C
Q
G
G
C
G
C
C
A
G
C
C
1
2
3
4
5
6
Tries in COS 226 (Sedgewick, Wayne/Princeton)
If you’re not sure where to start, look back at the previous example.

Q1: Which value is associated with the key “CAC”?

Which value is associated with the key “CAC”?
19

Search in a TST
Follow links corresponding to each character in the key.
If less, take left link; if greater, take right link.
If equal, take the middle link and move to the next key character.

Search hit. Final node is blue (isKey == true).
Search miss. Reach a null link or final node is white (isKey == false).
20
Tries in COS 226 (Sedgewick, Wayne/Princeton)

Does the structure of a TST depend on the order in which strings are inserted?
21

Prefix Operations
22
Unfortunately, since TSTs behave like (unbalanced) binary search trees, the runtime depends on the insertion order. The exact runtime of TSTs fall outside of the scope of this course, but it’s useful to know that they’re about as fast as hash tables in practice, and can be made faster with some simple optimizations.

String-Specific Operations
Theoretical asymptotic speed improvement is nice.
But the main appeal of tries is their efficient prefix matching.
Prefix match. keysWithPrefix("sa")
Longest prefix. longestPrefixOf("sample")

In this section, we’ll use the abstract trie representation.
23
s
a
m
d
p
e
a
w
l
s
Abstract Trie
All of this also applies to any other trie representation.

Collecting Trie Keys
Describe in English an algorithm to collect all the keys in a trie.
collect(): ["a","awls","sad","sam","same","sap"]
Create an empty list of results x.
For character c in root.next.keys():
Call colHelp("c", x, root.next.get(c)).
Return x.

colHelp(String s, List<String> x, Node n)
???
24
s
a
m
d
p
e
a
w
l
s
Abstract Trie
Q
Q1: Describe in English an algorithm to collect all the keys in a trie.

Collecting Trie Keys
Describe in English an algorithm to collect all the keys in a trie.
collect(): ["a","awls","sad","sam","same","sap"]
Create an empty list of results x.
For character c in root.next.keys():
Call colHelp("c", x, root.next.get(c)).
Return x.

colHelp(String s, List<String> x, Node n)
If n.isKey, then x.add(s).
For character c in n.next.keys():
Call colHelp(s + c, x, n.next.get(c)).
25
s
a
m
d
p
e
a
w
l
s
Abstract Trie
A

colHelp("a", x,      )

colHelp("aw", x,      )

colHelp("awl", x,      )

colHelp("awls", x,      )

Collecting Trie Keys
colHelp(String s, List<String> x, Node n)
If n.isKey, then x.add(s).
For character c in n.next.keys():
Call colHelp(s + c, x, n.next.get(c)).
26
s
a
m
d
p
e
a
w
l
s
Abstract Trie
collect(): []
collect(): [   "a",]
collect(): [   "a",   "awls",]

Collecting Trie Keys
colHelp(String s, List<String> x, Node n)
If n.isKey, then x.add(s).
For character c in n.next.keys():
Call colHelp(s + c, x, n.next.get(c)).
27
s
a
m
d
p
e
a
w
l
s
Abstract Trie
collect(): [   "a",   "awls",   "sad",   "sam",   "same",   "sap"]

Prefix Operations with Tries
Describe in English an algorithm for keysWithPrefix.
keysWithPrefix("sa"): ["sad","sam","same","sap"]
28
s
a
m
d
p
e
a
w
l
s
Abstract Trie
Q
Q1: Describe in English an algorithm for keysWithPrefix.

Prefix Operations with Tries
Describe in English an algorithm for keysWithPrefix.
keysWithPrefix("sa"): ["sad","sam","same","sap"]
Find the node α corresponding to the string (in pink).
Create an empty list x.
For character c in α.next.keys():
Call colHelp("sa" + c, x, α.next.get(c)).
29
s
a
m
d
p
e
a
w
l
s
Abstract Trie
A

Autocomplete with Tries
Autocomplete should return the most relevant results.

One way: a Trie-based Map<String, Relevance>.
When a user types in a string "hello",
Call keysWithPrefix("hello").
Return the 10 strings with the highest relevance.
30

Top-3 Matches for "s"
Call keysWithPrefix("s").
sad, smog, spit, spite, spy
Return the 3 keys with highest value.
spit, spite, sad

This algorithm is slow. Why?
31
Q
s
a
m
d
p
o
b
u
c
k
g
y
i
t
e
10
12
5
15
20
7
Abstract Trie
Q1: This algorithm is slow. Why?

Improving Autocomplete
Very short queries, e.g. "s", will require checking billions of results.
But we only need to keep the top 10.

Prune the search space. Each node stores its own relevance as well as the max relevance of its descendents.
32
s
a
m
d
p
o
b
u
c
k
g
y
i
t
e
None
10
None
10
None
10
10
10
value = None
best = 20
None
20
None
12
12
12
None
5
None
5
5
5
None
20
None
20
15
20
20
20
7
7
Challenge: design an algorithm that can collect the top K results from a trie containing N total results. Assume K can also be very large.

Summary
When your key is a string, you can use a Trie.
Typically better real-world performance than hash table or search tree.
Need to decide on a mapping from letter to node. Ternary Search Trie is an elegant solution.

Most importantly, tries enable efficient prefix operations like keysWithPrefix.
Optimal implementation of autocomplete involves further optimizations!

Bottom line: Data structures interact in beautiful and important ways!
33