CSE373—Data Structures and Algorithms for Nonmajors

Programming Assignment #4

due: Friday, 2/29/08, 10:00 PM

Introduction:

You’ve probably noticed that many of your favorite applications and websites, whether on a computer or on a cellphone, do auto-completions or suggestions.  For this assignment, we will write code that could be used as part of a user interface to provide completions for movie titles specified by some number of initial characters (from here on called prefixes).

This assignment is a bit different, in that most of the functionality is already working.  You can query the number of movies starting with a prefix and you can query the list of movies starting with a prefix.  The catch is that the provided implementations are very slow (O(n), where n is the number of movie titles), whereas by being clever and storing some extra state, we can perform the first query in O(log n) and the second query in O(m+log n), where n is the number of movie titles, and m is the number titles returned.

One thing fascinating about algorithms is that one algorithm can be used to solve many problems, but many times there are better algorithms to solve specific problems.  Three generic problems to solve are: “Does a solution exist?” “How many solutions exist?” “What are all the solutions?”  Clearly if we can solve the last of these, the first two are trivial.  However, often much simpler algorithms exist for the first two.  In the case of this project, we can cleverly make the query for the number of strings with prefix faster than the query for the list of string with prefix.  We already did the work for you of designing the algorithms.  Your job is to interpret our descriptions of them and implement them as code.

Additionally the system will incorporate an LRU cache (optionally enabled) so that if query results have been cached, they can be returned directly in O(1) rather than be computed.  These sorts of caches are on the front ends of many large web services, because 90% of the traffic asks for 10% of the data, and the other 10% of the traffic asks for the rest.  (You might also hear terms such as Poisson distribution or long-tail distribution).

You will be modifying code in MoviePrefixSearcher.java and RangeAvlTree.java.  Go ahead and run MoviePrefixSearcher’s main to get an idea for what it does.  Read the comments in main to get an idea of what the speed improvements are from the improved algorithms.

About the Data:

The IMDB movie database has over a million titles in it.  The list provided for you (about half a million) does not have any individual episodes of TV series (irrelevant) and straight-to-video titles (mostly pornographic).  This reduced data should reside in a heap under 100MB, whereas the original data may require java run with the –X memory options.

For those curious, the original data came from:
ftp://ftp.fu-berlin.de/pub/misc/movies/database/movies.list.gz

The data is loaded similarly to how the weblinks zip data was loaded in the last project.

Why not just use a sorted list?

Just considering the two operations numWithPrefix and listWithPrefix, a sorted list seems like the ideal data structure.  The first is easily computed in O(log n) by calculating the difference between the indices returned from binary search on the lower bound of the prefix interval and binary search on the upper bound.  For the second, obtain the indices for the two bounds just mentioned O(log n) and get all the items between the two bounds O(m), for a total of O(m + log n).

However the problem is that the list of movies always changes.  There are dozens of new movies coming out every week.  And there needs to be ways to remove an entry from the database were an errant entry added.  If we used a sorted list, adding and removing from the list would require O(n) work due to the shifting needed either to make a hole in the middle of the array or fill in a hole while preserving sorted order.

Using a balanced binary tree, we get O(log n) inserts and removes, and the costs for numWithPrefix and listWithPrefix remain O(log n) and O(m + log n) respectively.  The balanced binary tree sounds like a win-win – just that the algorithms involved are trickier and it uses more space by a constant factor.

Importance of a cheap numWithPrefix

If we had a listWithPrefix routine, we could implement numWithPrefix simply by calling .size() on the list returned by the former method (this is the default implementation I provide).  However for several reasons it is useful to have a method that returns the count that is as cheap as possible in both time and space. 

·         A user interface doesn’t want to display 1000+ possible matches if the user has just typed in the first character.  With a quick query of the count of prefix matches, the query can only be displayed if the list will be small.

·         It can be computationally too expensive to perform a listWithPrefix that is too large.

·         If these queries were part of a web service, a query on startsWithPrefix that has huge time/space expenses would be an example of a denial-of-service attack (whether intentional or accidental), where little effort by the client causes a lot of work by the server, potentially keeping the service away from other clients.  However, if the query first must go through numWithPrefix, then it is cheap to decide if the server should block the expensive request.

·         (This is pretty tangential for this course, but still really cool stuff.)  Regarding DoS attacks, having a cache will help reduce the server load, but a smart attacker could flood the cache with infrequent requests to generate cache misses.  It’s still essential to have a fast way of rejecting costly server requests.  Ideally, a separate system detects attackers who are flooding the cache and blocks them.

Algorithmic Details

An AVL tree needs to calculate heights at every node to see if the balance is off and a rotation must be done.  Recursively, the height of a node is 1+max(height(node.left),height(node.right)), with null having height -1.  The problem is calculating this height recursively on a subtree of n nodes takes O(n) work, which completely negates the point of gaining efficiency from a balanced tree.  Instead the strategy is to store the height of a node in the node’s state.  The challenge is then updating the height whenever it changes (insert, remove, rotate).  The code provided to you has an AVL tree implementation that does all the bookkeeping for these heights.

For this project, we’ll also need to calculate the size of every subtree.  Recursively the size of a subtree rooted by a node is 1+size(node.left)+size(node.right), with null having a size 0.  Likewise, we want to use this size to improve runtime, so we cannot recursively evaluate it in O(n) time; we need to store the sizes as state in the nodes (I have already added a size field to the node class).  And likewise, the sizes need to get updated whenever the subtrees change, similarly to the AVL tree’s heights.  Note that the height(Node) and size(Node) methods provided in the code are not recursive – they simply handle the null pointer cases for you for convenience.  A private method called checkSizes compares the slow recursive size to the size stored in the node.

In the following two algorithms, you will be comparing a node’s element against a range [a,b), i.e. including a and everything up to but excluding b.  There are three cases: being smaller than a, being inside the range, and being at least b.  Each of these cases can be evaluated using calls to compareTo; see the default code for itemsInRange that I provided for a comparison that checks if a node is inside a range.

The first algorithm for you to implement is itemsInRange, in a runtime of O(m+log n).  Regardless of how efficient we try to be, we always have to copy m items into the list to be returned.  So if m is small, the runtime is governed by the height of the tree, whereas when m is large, the height of the tree is insignificant.  The idea is that when we are not copying items (meaning that we are outside the range), we are as cheap as possible.  The default implementation I provide does an in-order traversal in O(n) and copies only when the item is within the range [a,b).  However, there is much waste going on.  Recall the definition of a binary search tree.  Every node descending to the left of a node is smaller than it, and every node descending to the right of a node is larger than it.  So if we’re at a node that is smaller than a, a right descendent of it may be larger than a, but it is impossible for a left descendent to be larger than a.  Hence if we’re at a node smaller than a, we do not need to recurse to the left, because it is impossible for any such descendents to be in the range [a,b).  A similar argument applies to nodes larger than b. 

The second algorithm for you to implement is sizeOfRange, in a runtime of O(log n).  We will build on the ideas from the first algorithm.  We wish to count the number of items in the range [a,b).  If a node is less than a, then certainly any descendents to the left of a are not going to contribute to the number of items.  We can go one step further and account for subtrees such that all their nodes are in the desired range.   In such a case, instead of recursing, we can just read the node’s size field.  Suppose that node is within the range [a,b), meaning that we count it.  If node.left is smaller than the lower bound a, then all its left descendents are definitely outside the range (stop recursing), and some of its right descendents may be in the range (keep recursing).  The other case is if node.left is at least a.  Then its left descendents may be in the range (keep recursing), and all its right descendents must be inside the range!  They are all inside the range because they must be smaller than node and bigger than node.left, by binary search order.  A similar argument follows with respect to the upper bound b.  In your implementation, beware of null pointers due to empty subtrees! 

I’d recommend drawing some simple trees and sketching out these ideas before trying to code anything.

Testing and Debugging

The database of movies is interesting and fun because it is large.  It’s size makes it good for verifying that your algorithm is indeed running fast, but what’s the point of a fast algorithm that doesn’t work?  To test and debug your code, I recommend using simple trees that you could draw by hand.  You can replace the main code in RangeAvlTree to do some of this testing.  The smaller your testcases are, the faster they will run, and the easier it will be to interpret the results.  You may also want to write a toString method for RangeAvlNode that prints out pertinent information for a node, such as its element and its size.  Build your code with a solid foundation.  Your code for 5) below could be giving incorrect results because there is a bug in it or because the sizes maintained in 2) and 3) are wrong.  Similarly, it’s easier to test 2) first without invoking rotations, and then once you know 2) is working, then get 3) going.  Isolate bugs into as small and precise a location as possible.  Use a slow but simple version of a method to compare against a faster but more complicated version.

 

 

Programming (70 points):

You should use good style, whitespace, naming conventions, etc., and comment your code.

You may not add any other imports, class or instance variables, or change the signatures for any existing method .  You may add other private methods.  We should still be able to test your code by running something similar to the original MoviePrefixSearcher main.  You may modify any of the mains, but take advantage of the comments showing the expected output that I provided.

1) Write code to store the CACHE_SIZE most recent results from listWithPrefix in an LRU cache, using the prefixes as the keys.  If the cache is disabled, do not look in the cache nor add any results to the cache.  If the cache is enabled and the cache misses, query the tree and add the result to the cache.  If the cache contains the prefix, return the cached result instead of querying the tree.  This code goes in the block with the UnsupportedOperationException and should not be more than a few lines.  Use the SizeBoundedLinkedHashMap cache already provided for you.

The last bit of code in main tests the runtimes with the caching; the times from my computer are shown.  The code in main does not test the correctness of the results returned; you must test that yourself.  I recommend doing the caching first because it’s a warmup and the differences in the runtimes are more dramatic than when the other methods are optimized.  The examples show a period of 100, where the cache is just large enough, and a period of 121, where old items are cycled out before they are accessed again.

2) Write code in the private insert method to update the size field for the parameter t.  You should handle this code similarly to how the height is being handled.  Use the size(RangeAvlNode) method for convenience.  There is a private method called checkSizes() that I recommend running after an insert to check that you are updating the sizes properly.  Note that you should not run checkSizes in your final/production version because it is slow.

The size update in insert is just part of the puzzle.  The sizes also need to be updated in the rotation code, just as the heights are updated there.  It’s best to do one thing at a time though, so you can build a test binary tree that is balanced and needs no rotating, for example, by inserting the following strings in this order:
“M” “F” “S” “C” “J” Q” “W” “B” “D” “H” “K” “O” “R” “U” “Y” “A” “E” “G” “I” “L” “N” “P” “T” “V” “X” “Z”

3) Write the code in the private rotateWithLeftChild and rotateWithRightChild to update the size fields for the nodes being rotated, similarly to how height is being handled.  Now your sizes should be fully updated, and you can test checkSizes after arbitrary inserts that trigger rotations.

4) Write and use a recursive helper method for itemsInRange that returns a list of items within the interval in O(m+log n) time (m items returned, n items in tree).  When you are testing, you can see if this list is equal to the list generated by the slow helper method that I provided, but for your final/production version, just call your fast helper.  See the algorithmic details section for hints.

5) Write and use a recursive helper method for sizeOfRange that returns the number of items within the interval in O(log n) time.  When you are testing, you can see if the count is equal to the count generated by the slow way that I provided, but for your final/production version, just call your fast helper.  See the algorithmic details section for hints.

Turn in your work electronically from the “assignments” link on the class webpage.  If you did not receive an email receipt, you did not turn in your project.