CSE143X Notes for Monday, 12/7/20

I first spent some time showing how to write a generic version of the binary search tree class. I pointed out that in a previous lecture we wrote an add method for the IntTree class that would allow you to build a binary search tree of integers. You might want to build other kinds of binary search trees with different kinds of data and you wouldn't want to make many copies of essentially the same code (one for integers, one for Strings, etc). Instead, you want to write the code in a more generic way.

I showed a client program for a new version of the class that I call SearchTree that will work for any class that implements what is known as the Comparable interface. The new class is a generic class, so it would be better to describe it as SearchTree<E> (for some element type E). The client code constructs a SearchTree<String> that puts words into alphabetical order and a SearchTree<Integer> that puts numbers into numerical order.

I then went over some of the details that come up in converting the IntTree binary search tree code into the generic SearchTree code. I mentioned that programming generic classes can be rather tricky. I'm showing this example so that you can see how it's done, but I wouldn't expect you to implement a generic class on your own. You should, however, know how to make use of a generic class or interface. For example, we might ask you to use a LinkedList<String> or we might ask you to implement the Comparable<T> interface, but you won't have to write a generic class from scratch.

We started by writing a node class for the tree. We found it was very tedious because we had to include the <E> notation in so many different places:

        public class SearchTreeNode<E> {
            public E data;
            public SearchTreeNode<E> left;
            public SearchTreeNode<E> right;
        
            public SearchTreeNode(E data) {
                this(data, null, null);
            }
        
            public SearchTreeNode(E data, SearchTreeNode<E> left,
                                  SearchTreeNode<E> right) {
                this.data = data;
                this.left = left;
                this.right = right;
            }
        }
Then I asked about the SearchTree class. Like our IntTree, it should have a single field to store a reference to the overall root, so we wrote the following:

        public class SearchTree<E> {
            private SearchTreeNode<E> overallRoot;

            ...
        }
We then looked at how to convert the IntTree add method into a corresponding generic method. The syntax makes it look fairly complicated, but in fact, it's not that different from the original code. Remember that our IntTree add looks like this:

        public void add(int value) {
            overallRoot = add(value, overallRoot);
        }
    
        private IntTreeNode add(IntTreeNode root, int value) {
            if (root == null) {
                root = new IntTreeNode(value);
            } else if (value <= root.data) {
                root.left = add(root.left, value);
            } else {
                root.right = add(root.right, value);
            }
            return root;
        }
If we just replace "int" with "E" and replace "IntTreeNode" with "SearchTreeNode<E>", we almost get the right answer:

        public void add(E value) {
            overallRoot = add(overallRoot, value);
        }
    
        private SearchTreeNode<E> add(SearchTreeNode<E> root, E value) {
            if (root == null) {
                root = new SearchTreeNode<>(value);
            } else if (value <= root.data) {
                root.left = add(root.left, value);
            } else {
                root.right = add(root.right, value);
            }
            return root;
        }
The problem is that we can no longer perform the test in this line of code:

        } else if (value <= root.data) {
Instead, we have to use a call on the compareTo method:

        } else if (value.compareTo(root.data) <= 0) {
We have to make one more change as well. All that Java would know about these data values is that they are of some generic type E. That means that as far as Java is concerned, the only role it knows they can fill is the Object role. Unfortunately, the Object role does not include a compareTo method because not all classes implement the Comparable interface. We could fix this with a cast and that is what you'll find in most of the code written by Sun:

        } else if (((Comparable<E>) value).compareTo(root.data) <= 0) {
Another approach is to modify the class header to include this information. We want to add the constraint that the class E implements the Comparable interface. We specify that by saying:

        public class SearchTree<E extends Comparable<E>> {
            ...
        }
It's odd that Java has us use the keyword "extends" because we want it to implement the interface, but that's how generics work in Java. If we are defining a class, we make a distinction between when it extends another class versus when it implements an interface. But in generic declarations, we have just the word "extends", so we use it for both kinds of extension. Then I discussed the idea of binary search. This is an important algorithm to understand. The idea behind binary search is that if you have a sorted list of numbers, then you can quickly locate any given value in the list. The approach is similar to the one used in the guessing game program from CSE142. In that game the user is supposed to guess a number between 1 and 100 given clues about whether the answer is higher or lower than their guess. So you begin by guessing the middle value, 50. That way you've eliminated half of the possible values once you know whether the answer is higher or lower than that guess. You can continue that process, always cutting your range of possible answers in half by guessing the middle value.

The number of steps involved will depend on how many times you have to cut the set of possible answers in half. If you start with n answers, eventually you will get down to just one. The number of steps, then, can be computed as:

        n / 2 / 2 / 2 ... / 2 = 1
        n / 2^? = 1
        n = 2^?
        ? = log2(n)
I contrasted this with the way that we wrote the indexOf method for both the ArrayIntList and the LinkedIntList. In both cases we iterated through the list from beginning to end until we either found the value or ran out of values. This is an approach known as linear search.

I said that clearly the binary search is faster than the linear search, but how much faster? The linear algorithm exams all n values potentially. If a value is not found, it has to examine all of them. Even if it is found, it will tend to look at half of them on average if the value appears once in the list. So we are comparing an algorithm that takes n steps versus and algorithm that takes log2(n) steps. That is a huge difference. I mentioned that an appoximation I often use to reason about this is that 2^10 is about equal to 10^3 (the actual values are 1024 versus 1000).

Using this approximation, we can say that for one thousand values to search, we are talking about an algorithm that takes 10 steps versus an algorithm that takes one thousand steps. For a million values, it's 20 steps versus a million. For a billion values, it's 30 steps versus a billion. For a trillion it is 40 steps versus a trillion. And so on.

Computer scientists characterize differences like these by talking about the complexity of an algorithm.

The word "complexity" can be interpreted in many ways. It sounds like a measure of how complex or how complicated a program is. For example, jGRASP has a tool under the File menu that allows you to create a "Complexity Profile Graph" of your code, which is a software engineering concept that is somewhat similar to this. But that's not how computer scientists use the term most often. When we refer to the complexity of an algorithm or a code fragment, we most often are referring to the resources that it requires to execute. The two resources that we are generally most interested in are:

We'll find that a common result is that these two primary resources can often be traded off. We can generally make a program work with less memory if we're willing to have it take more time to run. We can also generally get programs to run faster if we're willing to allocate some extra memory to the task.

Of these two, the resource that computer scientists most often refer to when talking about complexity is time. In particular, we are interested in the growth rate as the input size increases. We begin by deciding on some way to measure the size of the input (e.g., the number of names to sort, the number of numbers to examine, etc) and call this "n". We are interested in what happens when we change n. For example, if it takes time "t" to execute for n items, how much time does it take to execute for 2n items?

I pointed out that this is one of the few places where computer science is actually like a science. Some instructors ask their students to collect empirical timing data for different input sizes and have them plot these values to see if the plot matches the prediction. Unfortunately, these experiments are more difficult to perform on modern computers because features like cache memory skew the results. The important thing is that the predictions hold for large values of n.

I then mentioned a simple rule of thumb that you can apply to Java programs to figure out the complexity of a code segment. I mentioned that it "almost" works. The idea is to find the line of code that is executed most often. In thinking about this, you have to be careful how you count. For example, with a for loop, we'd count the loop itself as executing just once, but the statements controlled by the loop might be executed many times. Of course, a for loop can be inside a for loop in which case the inner loop is executed multiple times. But think in terms of how many times you enter the loop when counting the number of executions of the line of code that begins with "for". You also have to consider how many times various lines of code get executed that you didn't write. There is a method called Arrays.sort that will sort an array and you can't count it as one line of code being executed when you call it. You would have to count how many times its lines of code are executed as well. Programmers use tools known as profilers to give them data about how many times each line of code is executed.

I pointed out that I see a lot of undergraduates who obsess about efficiency and I think that in general it's a waste of their time. Many computer scientists have commented that premature optimization is counterproductive. Don Knuth has said that "Premature optimization is the root of all evil." The key is to focus your attention where you can get real benefit. The analogy I'd make is that if you found that a jet airplane weighed too much, you might decide to put the pilot on a diet. While it's true that in some sense every little bit helps, you're not going to make much progress by trying to slim down the pilot when the plane and its passengers and cargo weigh so much more than the pilot. I see lots of undergraduates putting the pilot on a diet while ignoring much more important details.

In terms of the growth rate of different algorithms, I mentioned that some of your intuitions from calculus will be helpful. You've probably been asked to solve problems like figuring out what the limit is as you approach infinity of an expression like this:

        n^3 - 18 n^2 + 385 n + 708
        --------------------------
        0.005 n^4 - 13 n^2 + 73842
When you solve a limit like this, you ignore things like coefficients and you ignore small terms. What matters here is that you basically have:

        n^3
        ---
        n^4
The rest is noise. So this is something that you know is going to approach 0 because eventually the n^4 will dominate the n^3 no matter what the coefficients and lower-order terms are. We use similar reasoning with complexity. We ignore constant multipliers and we ignore lower order terms to focus on the main term.

I said that different algorithms naturally fall into different complexity classes:

I mentioned that chapter 13 discusses complexity in more detail and showed the following table from the book:

The following table presents several hypothetical algorithm runtimes as an input size N grows, assuming that each algorithm required 100ms to process 100 elements. Notice that even if they all start at the same runtime for a small input size, the ones in higher complexity classes become so slow as to be impractical.

input size (N)O(1)O(log N)O(N)O(N log N)O(N2)O(N3)O(2N)
100100 ms100 ms100 ms100 ms100 ms100 ms100 ms
200100 ms115 ms200 ms240 ms400 ms800 ms32.7 sec
400100 ms130 ms400 ms550 ms1.6 sec6.4 sec12.4 days
800100 ms145 ms800 ms1.2 sec6.4 sec51.2 sec36.5 million years
1600100 ms160 ms1.6 sec2.7 sec25.6 sec6 min 49.6 sec42.1 * 1024 years
3200100 ms175 ms3.2 sec6 sec1 min 42.4 sec54 min 36 sec5.6 * 1061 years

Then I talked about a specific problem. The idea is that we have a list of integers, both positive and negative, and we want to find the subsequence that has the highest sum. If there weren't any negative integers, you'd always include all of the numbers. But because some of them can be negative, it might be the case that some portion of the list has a sum that is greater than any other sequence from the list. The subsequences always involve taking a contiguous chunk of the list. This particular problem has often been used by Microsoft as an interview question, probably because there are different ways to solve it, some of which are much faster than others.

As an example, suppose you have an array that stores these values:

          [0]   [1]   [2]   [3]   [4]   [5]   [6]   [7]   [8]   [9]
        +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
        |  14 |  8  | -23 |  4  |  6  |  10 | -18 |  5  |  5  |  11 |
        +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
The maximum sum is obtained by adding up the values from index [3] through [9]:

        4 + 6 + 10 + -18 + 5 + 5 + 11 = 23
It might seem odd that you include -18, but it's because including that allows you to include the three numbers that come before, which add up to 20 (a net gain of 2 for the overall sum).

There is a simple way to solve this that involves picking each possible subsequence. We can have one for loop that generates each possible starting point and another for loop that generates each possible stopping point:

        for (int start = 0; start < list.length; start++) {
            for (int stop = start; stop < list.length; stop++) {
                look at the numbers from start to stop
            }
        }
So how do we "look at the numbers from start to stop"? We can write a loop that adds up each of those numbers:

        int sum = 0;
        for (int i = start; i <= stop; i++) {
            sum += list[i];
        }
And once we have that sum, we can compare it against the maximum sum we've seen so far and reset the maximum if this sum is better:

        if (sum > max) {
            max = sum;
        }
Putting these pieces together and including some initialization outside the loop, we end up with the following code:

        int max = list[0];
        int maxStart = 0;
        int maxStop = 0;
        for (int start = 0; start < list.length; start++)
            for (int stop = start; stop < list.length; stop++) {
                int sum = 0;
                for (int i = start; i <= stop; i++) {
                    sum += list[i];
                }
                if (sum > max) {
                    max = sum;
                    maxStart = start;
                    maxStop = stop;
                }
            }
That's the first approach. The line that is executed most often in this approach is the "sum += ..." line inside the innermost for loop (the "i" loop that adds up the list). It is nested inside three different loops, each of which executes on the order of n times. So we would predict that this code would be an O(n3) algorithm.

Then I asked how the algorithm could be improved. How can we do this faster? The bottleneck is the line that is adding up individual numbers and the key to improving the algorithm is noticing how we're doing a lot of duplicate work. Think about what happens the first time through the outer loop when "start" is equal to 0. We go through the inner loop for all possible values of "stop". So suppose the list is 2000 long. We're going to compute:

        the sum from 0 to 0
        the sum from 0 to 1
        the sum from 0 to 2
        the sum from 0 to 3
        the sum from 0 to 4
        ...
        the sum from 0 to 1999
Those are all the possibilities that start with 0. We have to explore each of these possibilities, but think about how we're computing the sums. We have an inner "i" loop that is computing the sum from scratch each time. For example, suppose that we just finished computing the sum from 0 to 1000. We next compute the sum from 0 to 1001. But we start from the very beginning and have i go through all of the values 0 through 1001 when we've just computed the sum from 0 to 1000. That was a lot of work. Then you throw away that sum and start from scratch to add up the values from 0 to 1001. But why start back at the beginning? If you know what the values from 0 to 1000 add up to, then just add the value at position 1001 to get the sum from 0 to 1001.

So the key is to eliminate the inner "i" loop by keeping a running sum. This requires us to move the initialization of sum from the inner loop to the outer loop so that we don't forget the work we've already done.

        int max = list[0];
        int maxStart = 0;
        int maxStop = 0;
        for (int start = 0; start < list.length; start++) {
            int sum = 0;
            for (int stop = start; stop < list.length; stop++) {
                sum += list[stop];
                if (sum > max) {
                    max = sum;
                    maxStart = start;
                    maxStop = stop;
                }
            }
        }
In this code the most frequently executed statements are inside the for loop for "stop" (the line that begins "sum +=" and the if). There are other lines of code that you could argue also tie with this, like the for loop test and for loop increment. Each of these lines of code is inside an outer loop that iterates n times and an inner loop that iterates an average of n/2 times. So this is an O(n2) algorithm.

I mentioned that there is a third algorithm, although I wouldn't have time to discuss it in detail and it is the most difficult to understand, so I did not attempt to prove its correctness. I did, however, try to explain the basic idea. The key is to avoid computing all of the sums. We want to have some heuristic that would allow us to ignore certain possibilities. We do that with a single loop that computes for each value of i the highest possible sum you can form that ends with list[i].

For an i of 0, there is only one sequence that ends with list[0] and it's just list[0] itself, which becomes the max:

          [0]   [1]   [2]   [3]   [4]   [5]   [6]   [7]   [8]   [9]
        +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
        |  14 |  8  | -23 |  4  |  6  |  10 | -18 |  5  |  5  |  11 |
        +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
          max
          14
There are two sequences ending with list[1] and we get a better max when we use the one that includes both list[0] and list[1]:

          [0]   [1]   [2]   [3]   [4]   [5]   [6]   [7]   [8]   [9]
        +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
        |  14 |  8  | -23 |  4  |  6  |  10 | -18 |  5  |  5  |  11 |
        +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
          max   max
          14    22
There are three sequences ending in list[2] and the max is achieved by including all three list values:

          [0]   [1]   [2]   [3]   [4]   [5]   [6]   [7]   [8]   [9]
        +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
        |  14 |  8  | -23 |  4  |  6  |  10 | -18 |  5  |  5  |  11 |
        +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
          max   max   max
          14    22    -1
Something very important happens next. There are four different sequences that end with list[3]. We really have only two choices to consider. We know that the best you can do with a sequence that ends with list[3] is to get a sum of -1. That means that if we include that sequence, then we will end up with a smaller sum than if we were to exclude it. So the best sum can be achieved with just list[3] itself, without including the values that come before:

          [0]   [1]   [2]   [3]   [4]   [5]   [6]   [7]   [8]   [9]
        +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
        |  14 |  8  | -23 |  4  |  6  |  10 | -18 |  5  |  5  |  11 |
        +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
          max   max   max   max
          14    22    -1     4
Compare this to what happens later when we are considering what to do for list[7]:

          [0]   [1]   [2]   [3]   [4]   [5]   [6]   [7]   [8]   [9]
        +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
        |  14 |  8  | -23 |  4  |  6  |  10 | -18 |  5  |  5  |  11 |
        +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
          max   max   max   max   max   max   max
          14    22    -1     4    10    20     2
Even though the value that comes before list[7] is a negative number, we notice that the maximum sum that can be achieved that ends with list[6] is a positive number, so we include it in our answer for the best sum that can be achieved that ends with list[7]:
          [0]   [1]   [2]   [3]   [4]   [5]   [6]   [7]   [8]   [9]
        +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
        |  14 |  8  | -23 |  4  |  6  |  10 | -18 |  5  |  5  |  11 |
        +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
          max   max   max   max   max   max   max   max
          14    22    -1     4    10    20     2     7
The max is 7, not 5. Continuing this process, we discover the correct sequence that has a sum of 23:

          [0]   [1]   [2]   [3]   [4]   [5]   [6]   [7]   [8]   [9]
        +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
        |  14 |  8  | -23 |  4  |  6  |  10 | -18 |  5  |  5  |  11 |
        +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
          max   max   max   max   max   max   max   max   max   max
          14    22    -1     4    10    20     2     7    12    23
The key, then, for each value of i is to look to see if it is better to include the values that came before or whether to start fresh with list[i] only. The answer is that we include the values that came before if the sum that can be achieved with them is not negative. This approach is known as Kadane's algorithm. Here is the code for this approach:

        int max = list[0];
        int maxStart = 0;
        int maxStop = 0;
        int start = 0;
        int sum = 0;
        for (int i = 0; i < list.length; i++) {
            if (sum < 0) {
                start = i;
                sum = 0;
            }
            sum += list[i];
            if (sum > max) {
                max = sum;
                maxStart = start;
                maxStop = i;
            }
        }
The most frequently executed statements are those in the body of the for loop (the two if's and the "sum +="). They are each executed n times, so we would expect this to be an O(n) algorithm.

I then switched to a program that has all three of the algorithms I'm going to discuss. It allows you to pick a base value of n and it runs the algorithm for inputs of size n, 2n and 3n, reporting the time each took to execute and the ratio of the various times. We ran it for the first algorithm and asked it to use a base input size of 1500. We got output like this:

        How many numbers do you want to use? 1500
        Which algorithm do you want to use? 1

        for n = 1500, time = 0.913
        
        for n = 3000, time = 7.233
        
        for n = 4500, time = 24.198
        
        Double/single ratio = 7.92223439211391
        Triple/single ratio = 26.50383351588171
We're interested in how long it took to execute for each different input size and the ratios reported at the end.

The first algorithm is an O(n3) algorithm, so we were expecting that in doubling the list, the time would increase by a factor of about 8 and in tripling the list, the time would increase by a factor of about 27. The empirical data if fairly close to the prediction.

We found that the second algorithm was much faster than the first. It is an O(n2) algorithm, so we were able to use much larger lists. We ran this second algorithm with a base size of 30,000 and got output like the following.

        How many numbers do you want to use? 30000
        Which algorithm do you want to use? 2

        for n = 30000, time = 0.98
        
        for n = 60000, time = 3.852
        
        for n = 90000, time = 8.642
        
        Double/single ratio = 3.9306122448979592
        Triple/single ratio = 8.818367346938775
Again we want to pay attention to the times for the different list sizes and the ratios at the end. We would predict that time would increase by a factor of 4 when we double the input and that time would increase by a factor of 9 when we triple the input with this O(n2) algorithm and the empirical results are fairly close.

The third algorithm runs so fast that it almost can't be effectively timed, but here are the results for a run with 5 million elements:

        How many numbers do you want to use? 5000000
        Which algorithm do you want to use? 3

        for n = 5000000, time = 0.017
        
        for n = 10000000, time = 0.033
        
        for n = 15000000, time = 0.046
        
        Double/single ratio = 1.9411764705882353
        Triple/single ratio = 2.705882352941176
This is an O(n) algorithm, so we expect that doubling the input will double the time and tripling the input will triple the time. The empirical results aren't bad given that the algorithm runs so fast.

One of the striking things to notice is that for algorithm 1, we had to limit ourselves to just one thousand elements because it was so slow while with algorithm 3, we could barely time it even with four million values. The moral of the story is that choosing the right algorithm can make a huge difference, particularly if you can choose an algorithm from a better complexity class.

Earlier in the notes I mentioned that counting the most frequently executed line of code "almost" works for predicting the time complexity. There is a notable exception that you have seen. We have often written lines of code like the following:

        int[] list = new int[n];
We have seen that when an array is constructed, Java auto-initializes all of the values to the zero-equivalent of the type. That will require n steps, which means that this one line of code is O(n). It is one of the few examples in Java where a single line of code can require time proportional to n.

The three algorithms are included in handout 6 and anyone who wants the complete program along with the timing code can download it from the calendar page.


Stuart Reges
Last modified: Mon Dec 7 12:44:35 PST 2020