CSE373—Data Structures and Algorithms
for Nonmajors
Programming Assignment #4
due: Friday,
2/29/08, 10:00 PM
Introduction:
You’ve probably noticed
that many of your favorite applications and websites, whether on a computer or
on a cellphone, do auto-completions or suggestions. For this assignment, we will write code that
could be used as part of a user interface to provide completions for movie
titles specified by some number of initial characters (from here on called
prefixes).
This assignment is a bit
different, in that most of the functionality is already working. You can query the number of movies starting
with a prefix and you can query the list of movies starting with a prefix. The catch is that the provided
implementations are very slow (O(n), where n is the number of movie titles), whereas
by being clever and storing some extra state, we can perform the first query in
O(log n) and the second query in O(m+log n),
where n is the number of movie
titles, and m is the number titles
returned.
One thing fascinating
about algorithms is that one algorithm can be used to solve many problems, but
many times there are better algorithms to solve specific problems. Three generic problems to solve are: “Does a
solution exist?” “How many solutions exist?” “What are all the solutions?” Clearly if we can solve the last of these,
the first two are trivial. However,
often much simpler algorithms exist for the first two. In the case of this project, we can cleverly
make the query for the number of strings with prefix faster than the query for
the list of string with prefix. We
already did the work for you of designing the algorithms. Your job is to interpret our descriptions of
them and implement them as code.
Additionally the system will incorporate an LRU
cache (optionally enabled) so that if query results have been cached, they can
be returned directly in O(1) rather than be computed. These sorts of caches are on the front ends
of many large web services, because 90% of the traffic asks for 10% of the
data, and the other 10% of the traffic asks for the rest. (You might also hear terms such as Poisson
distribution or long-tail distribution).
You will be modifying code in MoviePrefixSearcher.java and RangeAvlTree.java. Go ahead and run MoviePrefixSearcher’s
main to get an idea for what it does. Read the comments in main to get an idea of
what the speed improvements are from the improved algorithms.
About the
Data:
The IMDB movie database has over a million
titles in it. The list provided for you (about
half a million) does not have any individual episodes of TV series (irrelevant)
and straight-to-video titles (mostly pornographic). This reduced data should reside in a heap
under 100MB, whereas the original data may require java run with the –X memory
options.
For those curious, the original data came from:
ftp://ftp.fu-berlin.de/pub/misc/movies/database/movies.list.gz
The data is loaded similarly to how the weblinks zip data was loaded in the last project.
Why not
just use a sorted list?
Just considering the two operations numWithPrefix and listWithPrefix, a sorted list seems
like the ideal data structure. The first
is easily computed in O(log n) by calculating the difference
between the indices returned from binary search on the lower bound of the
prefix interval and binary search on the upper bound. For the second, obtain the indices for the
two bounds just mentioned O(log n) and
get all the items between the two bounds O(m),
for a total of O(m + log n).
However the problem is that the list of movies always
changes. There are dozens of new movies coming
out every week. And there needs to be ways
to remove an entry from the database were an errant entry added. If we used a sorted list, adding and removing
from the list would require O(n) work due to
the shifting needed either to make a hole in the middle of the array or fill in
a hole while preserving sorted order.
Using a balanced binary tree, we get O(log n) inserts and removes, and the
costs for numWithPrefix and listWithPrefix remain O(log n) and O(m + log n) respectively. The
balanced binary tree sounds like a win-win – just that the algorithms involved
are trickier and it uses more space by a constant factor.
Importance
of a cheap numWithPrefix
If we had a listWithPrefix routine, we could implement numWithPrefix simply by calling .size() on the list returned by
the former method (this is the default implementation I provide). However for several reasons it is useful to
have a method that returns the count that is as cheap as possible in both time
and space.
·
A
user interface doesn’t want to display 1000+ possible matches if the user has
just typed in the first character. With
a quick query of the count of prefix matches, the query can only be displayed
if the list will be small.
·
It
can be computationally too expensive to perform a listWithPrefix that is too large.
·
If
these queries were part of a web service, a query on startsWithPrefix that has huge time/space expenses would
be an example of a denial-of-service attack (whether intentional or
accidental), where little effort by the client causes a lot of work by the
server, potentially keeping the service away from other clients. However, if the query first must go through numWithPrefix, then it is cheap to
decide if the server should block the expensive request.
·
(This
is pretty tangential for this course, but still really cool stuff.) Regarding DoS
attacks, having a cache will help reduce the server load, but a smart attacker
could flood the cache with infrequent requests to generate cache misses. It’s still essential to have a fast way of
rejecting costly server requests. Ideally,
a separate system detects attackers who are flooding the cache and blocks them.
Algorithmic
Details
An AVL tree needs to calculate heights at every
node to see if the balance is off and a rotation must be done. Recursively, the height of a node is 1+max(height(node.left),height(node.right)), with null having
height -1. The problem is calculating
this height recursively on a subtree of n nodes takes O(n) work, which completely negates the point of gaining efficiency
from a balanced tree. Instead the
strategy is to store the height of a node in the node’s state. The challenge is then updating the height
whenever it changes (insert, remove, rotate).
The code provided to you has an AVL tree implementation that does all
the bookkeeping for these heights.
For this project, we’ll also need to calculate the
size of every subtree. Recursively the size of a subtree
rooted by a node is 1+size(node.left)+size(node.right), with null having a
size 0. Likewise, we want to use this
size to improve runtime, so we cannot recursively evaluate it in O(n) time; we need to store the sizes as
state in the nodes (I have already added a size field to the node class). And likewise, the sizes need to get updated
whenever the subtrees change, similarly to the AVL
tree’s heights. Note that the height(Node) and size(Node) methods provided in the code are
not recursive – they simply handle the null pointer cases for you for
convenience. A private method called checkSizes compares the slow
recursive size to the size stored in the node.
In the following two algorithms, you will be
comparing a node’s element against a range [a,b), i.e. including a and everything up to but excluding b. There are three cases:
being smaller than a, being inside
the range, and being at least b. Each of these cases can be evaluated using
calls to compareTo; see the default code for
itemsInRange that I provided for a
comparison that checks if a node is inside a range.
The first algorithm for you to implement is itemsInRange, in a runtime of O(m+log n).
Regardless of how efficient we try to be, we always have to copy m items into the list to be returned. So if m
is small, the runtime is governed by the height of the tree, whereas when m is
large, the height of the tree is insignificant.
The idea is that when we are not copying items (meaning that we are
outside the range), we are as cheap as possible. The default implementation I provide does an
in-order traversal in O(n) and copies only when the item is
within the range [a,b). However, there is much waste going on. Recall the definition of a binary search
tree. Every node descending to the left
of a node is smaller than it, and every node descending to the right of a node
is larger than it. So if we’re at a node
that is smaller than a, a right
descendent of it may be larger than a,
but it is impossible for a left descendent to be larger than a.
Hence if we’re at a node smaller than a, we do not need to recurse to the left,
because it is impossible for any such descendents to be in the range [a,b). A similar argument applies to nodes larger
than b.
The second algorithm for you to implement is sizeOfRange, in a runtime of O(log n).
We will build on the ideas from the first algorithm. We wish to count the number of items in the
range [a,b). If a node is less than a, then certainly any descendents to the left of a are not going to contribute to
the number of items. We can go one step
further and account for subtrees such that all their
nodes are in the desired range. In such
a case, instead of recursing, we can just read the
node’s size field. Suppose that node is within the range [a,b), meaning that we count it. If node.left is smaller than the lower bound a, then all its left descendents are
definitely outside the range (stop recursing), and
some of its right descendents may be in the range (keep recursing). The other case is if node.left is at least a.
Then its left descendents may be in the range (keep recursing),
and all its right descendents must be inside the range! They are all inside the range because they
must be smaller than node and bigger than node.left, by binary search
order. A similar argument follows with
respect to the upper bound b. In your implementation, beware of null
pointers due to empty subtrees!
I’d recommend drawing some simple trees and
sketching out these ideas before trying to code anything.
Testing
and Debugging
The database of movies is interesting and fun
because it is large. It’s
size makes it good for verifying that your algorithm is indeed running fast,
but what’s the point of a fast algorithm that doesn’t work? To test and debug your code, I recommend
using simple trees that you could draw by hand.
You can replace the main code in RangeAvlTree to do some of this
testing. The smaller your testcases are, the faster they will run, and the easier it
will be to interpret the results. You
may also want to write a toString method for RangeAvlNode that prints out
pertinent information for a node, such as its element and its size. Build your code with a solid foundation. Your code for 5) below could be giving
incorrect results because there is a bug in it or because the sizes maintained
in 2) and 3) are wrong. Similarly, it’s
easier to test 2) first without invoking rotations, and then once you know 2)
is working, then get 3) going. Isolate
bugs into as small and precise a location as possible. Use a slow but simple version of a method to compare
against a faster but more complicated version.
Programming
(70 points):
You should
use good style, whitespace, naming conventions, etc., and comment your code.
You may
not add any other imports, class or instance variables, or change the signatures
for any existing method . You may add other private methods. We should still be able to test your code by
running something similar to the original MoviePrefixSearcher
main. You may modify any of the mains,
but take advantage of the comments showing the expected output that I provided.
1) Write code to store the CACHE_SIZE most recent results from listWithPrefix in an LRU cache, using
the prefixes as the keys. If the cache
is disabled, do not look in the cache nor add any results to the cache. If the cache is enabled and the cache misses,
query the tree and add the result to the cache.
If the cache contains the prefix, return the cached result instead of
querying the tree. This code goes in the
block with the UnsupportedOperationException and should not be more
than a few lines. Use the SizeBoundedLinkedHashMap cache already provided for you.
The last bit of code in main tests the runtimes with the caching; the times from my
computer are shown. The code in main does not test the correctness of the results returned; you
must test that yourself. I recommend
doing the caching first because it’s a warmup and the
differences in the runtimes are more dramatic than when the other methods are
optimized. The examples show a period of
100, where the cache is just large enough, and a period of 121, where old items
are cycled out before they are accessed again.
2) Write code in the private insert method to update the size field for the parameter t. You should handle
this code similarly to how the height is being handled. Use the size(RangeAvlNode) method for
convenience. There is a private method
called checkSizes() that I recommend running after an insert to check that you
are updating the sizes properly. Note
that you should not run checkSizes in your final/production
version because it is slow.
The size update in insert is just part of the puzzle.
The sizes also need to be updated in the rotation code, just as the
heights are updated there. It’s best to
do one thing at a time though, so you can build a test binary tree that is
balanced and needs no rotating, for example, by inserting the following strings
in this order:
“M” “F” “S” “C” “J” Q” “W” “B”
“D” “H” “K” “O” “R” “U” “Y” “A” “E” “G” “I” “L” “N” “P” “T” “V” “X” “Z”
3) Write the code in the private rotateWithLeftChild and rotateWithRightChild to update the size
fields for the nodes being rotated, similarly to how height is being
handled. Now your sizes should be fully
updated, and you can test checkSizes after arbitrary
inserts that trigger rotations.
4) Write and use a recursive helper method for itemsInRange that returns a list of items
within the interval in O(m+log n) time (m items returned, n items
in tree). When you are testing, you can see
if this list is equal to the list generated by the slow helper method that I provided,
but for your final/production version, just call your fast helper. See the algorithmic details section for hints.
5) Write and use a recursive helper method for sizeOfRange that returns the number
of items within the interval in O(log n) time. When you are testing, you can see if the count
is equal to the count generated by the slow way that I provided, but for your final/production
version, just call your fast helper. See
the algorithmic details section for hints.
Turn in
your work electronically from the “assignments” link on the class webpage. If you did not receive an email receipt, you
did not turn in your project.