CSE143 Notes for Monday, 4/16/12

I began by discussing the idea of binary search. This is an important algorithm to understand. The idea behind binary search is that if you have a sorted list of numbers, then you can quickly locate any given value in the list. The approach is similar to the one used in the guessing game program from CSE142. In that game the user is supposed to guess a number between 1 and 100 given clues about whether the answer is higher or lower than their guess. So you begin by guessing the middle value, 50. That way you've eliminated half of the possible values once you know whether the answer is higher or lower than that guess. You can continue that process, always cutting your range of possible answers in half by guessing the middle value.

I constrasted this with the way that we wrote the indexOf method for both the ArrayIntList and the LinkedIntList. In both cases we iterated through the list from beginning to end until we either found the value or ran out of values. This is an approach known as linear search.

I said that clearly the binary search is faster than the linear search, but how much faster? Computer scientists answer questions like these by talking about the complexity of an algorithm.

The word "complexity" can be interpreted in many ways. It sounds like a measure of how complex or how complicated a program is. jGRASP has a tool under the File menu that allows you to create a "Complexity Profile Graph" of your code, which is a software engineering concept that is somewhat similar to this. But that's not how computer scientists use the term most often. When we refer to the complexity of an algorithm or a code fragment, we most often are referring to the resources that it requires to execute. The two resources that we are generally most interested in are:

We'll find that a common result is that these two primary resources can often be traded off. We can generally make a program work with less memory if we're willing to have it take more time to run. We can also generally get programs to run faster if we're willing to allocate some extra memory to the task.

Of these two, the resource that computer scientists most often refer to when talking about complexity is time. In particular, we are interested in the growth rate as the input size increases. We begin by deciding on some way to measure the size of the input (e.g., the number of names to sort, the number of numbers to examine, etc) and call this "n". We are interested in what happens when we change n. For example, if it takes time "t" to execute for n items, how much time does it take to execute for 2n items?

I pointed out that this is one of the few places where computer science is actually like a science. Some instructors ask their students to collect empirical timing data for different input sizes and have them plot these values to see if the plot matches the prediction. Unfortunately, these experiments are more difficult to perform on modern computers because features like cache memory skew the results. The important thing is that the predictions hold for large values of n.

I then mentioned a simple rule of thumb that you can apply to Java programs to figure out the complexity of a code segment. I mentioned that it "almost" works. The idea is to find the line of code that is executed most often. In thinking about this, you have to be careful how you count. For example, with a for loop, we'd count the loop itself as executing just once, but the statements controlled by the loop might be executed many times. Of course, a for loop can be inside a for loop in which case the inner loop is executed multiple times. But think in terms of how many times you enter the loop when counting the number of executions of the line of code that begins with "for".

I pointed out that I see a lot of undergraduates who obsess about efficiency and I think that in general it's a waste of their time. Many computer scientists have commented that premature optimization is counterproductive. Don Knuth has said that "Premature optimization is the root of all evil." The key is to focus your attention where you can get real benefit. The analogy I'd make is that if you found that a jet airplane weighed too much, you might decide to put the pilot on a diet. While it's true that in some sense every little bit helps, you're not going to make much progress by trying to slim down the pilot when the plane and its passengers and cargo weigh so much more than the pilot. I see lots of undergraduates putting the pilot on a diet while ignoring much more important details.

In terms of the growth rate of different algorithms, I mentioned that some of your intuitions from calculus will be helpful. You've probably been asked to solve problems like figuring out what the limit is as you approach infinity of an expression like this:

        n^3 - 18 n^2 + 385 n + 708
        --------------------------
        0.005 n^4 - 13 n^2 + 73842
When you solve a limit like this, you ignore things like coefficients and you ignore small terms. What matters here is that you basically have:

        n^3
        ---
        n^4
The rest is noise. So this is something that you know is going to approach 0 because eventually the n^4 will dominate the n^3 no matter what the coefficients and lower-order terms are. We use similar reasoning with complexity. We ignore constant multipliers and we ignore lower order terms to focus on the main term.

I said that different algorithms naturally fall into different complexity classes:

I mentioned that chapter 13 discusses complexity in more detail and showed the following table from the book:

The following table presents several hypothetical algorithm runtimes as an input size N grows, assuming that each algorithm required 100ms to process 100 elements. Notice that even if they all start at the same runtime for a small input size, the ones in higher complexity classes become so slow as to be impractical.

input size (N)O(1)O(log N)O(N)O(N log N)O(N2)O(N3)O(2N)
100100 ms100 ms100 ms100 ms100 ms100 ms100 ms
200100 ms115 ms200 ms240 ms400 ms800 ms32.7 sec
400100 ms130 ms400 ms550 ms1.6 sec6.4 sec12.4 days
800100 ms145 ms800 ms1.2 sec6.4 sec51.2 sec36.5 million years
1600100 ms160 ms1.6 sec2.7 sec25.6 sec6 min 49.6 sec42.1 * 1024 years
3200100 ms175 ms3.2 sec6 sec1 min 42.4 sec54 min 36 sec5.6 * 1061 years

Then I talked about a specific problem. The idea is that we have a list of integers, both positive and negative, and we want to find the subsequence that has the highest sum. If there weren't any negative integers, you'd always include all of the numbers. But because some of them can be negative, it might be the case that some portion of the list has a sum that is greater than any other sequence from the list. The subsequences always involve taking a contiguous chunk of the list. This particular problem has often been used by Microsoft as an interview question, probably because there are different ways to solve it, some of which are much faster than others.

As an example, suppose you have an array that stores these values:

          [0]   [1]   [2]   [3]   [4]   [5]   [6]   [7]   [8]   [9]
        +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
        |  14 |  8  | -23 |  4  |  6  |  10 | -18 |  5  |  5  |  11 |
        +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
The maximum sum is obtained by adding up the values from index [3] through [9]:

        4 + 6 + 10 + -18 + 5 + 5 + 11 = 23
It might seem odd that you include -18, but it's because including that allows you to include the three numbers that come before, which add up to 20 (a net gain of 2 for the overall sum).

There is a simple way to solve this that involves picking each possible subsequence. We can have one for loop that generates each possible starting point and another for loop that generates each possible stopping point:

        for (int start = 0; start < list.length; start++) {
            for (int stop = start; stop < list.length; stop++) {
                look at the numbers from start to stop
            }
        }
So how do we "look at the numbers from start to stop"? We can write a loop that adds up each of those numbers:

        int sum = 0;
        for (int i = start; i <= stop; i++) {
            sum += list[i];
        }
And once we have that sum, we can compare it against the maximum sum we've seen so far and reset the maximum if this sum is better:

        if (sum > max) {
            max = sum;
        }
Putting these pieces together and including some initialization outside the loop, we end up with the following code:

        int max = list[0];
        int maxStart = 0;
        int maxStop = 0;
        for (int start = 0; start < list.length; start++)
            for (int stop = start; stop < list.length; stop++) {
                int sum = 0;
                for (int i = start; i <= stop; i++) {
                    sum += list[i];
                }
                if (sum > max) {
                    max = sum;
                    maxStart = start;
                    maxStop = stop;
                }
            }
That's the first approach. The line that is executed most often in this approach is the "sum += ..." line inside the innermost for loop (the "i" loop that adds up the list). It is nested inside three different loops, each of which executes on the order of n times. So we would predict that this code would be an O(n3) algorithm.

Then I asked how the algorithm could be improved. How can we do this faster? The bottleneck is the line that is adding up individual numbers and the key to improving the algorithm is noticing how we're doing a lot of duplicate work. Think about what happens the first time through the outer loop when "start" is equal to 0. We go through the inner loop for all possible values of "stop". So suppose the list is 2000 long. We're going to compute:

        the sum from 0 to 0
        the sum from 0 to 1
        the sum from 0 to 2
        the sum from 0 to 3
        the sum from 0 to 4
        ...
        the sum from 0 to 1999
Those are all the possibilities that start with 0. We have to explore each of these possibilities, but think about how we're computing the sums. We have an inner "i" loop that is computing the sum from scratch each time. For example, suppose that we just finished computing the sum from 0 to 1000. We next compute the sum from 0 to 1001. But we start from the very beginning and have i go through all of the values 0 through 1001 when we've just computed the sum from 0 to 1000. That was a lot of work. Then you throw away that sum and start from scratch to add up the values from 0 to 1001. But why start back at the beginning? If you know what the values from 0 to 1000 add up to, then just add the value at position 1001 to get the sum from 0 to 1001.

So the key is to eliminate the inner "i" loop by keeping a running sum. This requires us to move the initialization of sum from the inner loop to the outer loop so that we don't forget the work we've already done.

        int max = list[0];
        int maxStart = 0;
        int maxStop = 0;
        for (int start = 0; start < list.length; start++) {
            int sum = 0;
            for (int stop = start; stop < list.length; stop++) {
                sum += list[stop];
                if (sum > max) {
                    max = sum;
                    maxStart = start;
                    maxStop = stop;
                }
            }
        }
In this code the most frequently executed statements are inside the for loop for "stop" (the line that begins "sum +=" and the if). There are other lines of code that you could argue also tie with this, like the for loop test and for loop increment. Each of these lines of code is inside an outer loop that iterates n times and an inner loop that iterates an average of n/2 times. So this is an O(n2) algorithm.

I mentioned that there is a third algorithm, although I wouldn't have time to discuss it in detail and it is the most difficult to understand, so I did not attempt to prove its correctness. I did, however, try to explain the basic idea. The key is to avoid computing all of the sums. We want to have some heuristic that would allow us to ignore certain possibilities. We do that with a single loop and by keeping track of the highest possible sum you can form that ends with list[i].

So suppose that we are considering the i-th value in the list for some i greater than 0. Let's say i is 10. Think about subsequences that end with list[10]. Some of them begin with list[10] and others begin earlier, including list[9] and potentially other values that appear before list[10]. Under what circumstances would we get a higher sum by starting with list[10] versus including these earlier values? That's the key question.

The answer is that if the best subsequence you can find ending in list[9] adds up to a positive number, then that sequence is worth including. If they add up to a negative number, then they are taking away from the sum we are trying to generate. In other words, if those earlier values add up to a negative number, then we can get a higher sum by excluding them and starting our sequence with list[10].

Here is the code for the third approach:

        int max = list[0];
        int maxStart = 0;
        int maxStop = 0;
        int start = 0;
        int sum = 0;
        for (int i = 0; i < list.length; i++) {
            if (sum < 0) {
                start = i;
                sum = 0;
            }
            sum += list[i];
            if (sum > max) {
                max = sum;
                maxStart = start;
                maxStop = i;
            }
        }
The most frequently executed statements are those in the body of the for loop (the two if's and the "sum +="). They are each executed n times, so we would expect this to be an O(n) algorithm.

I then switched to a program that has all three of the algorithms I'm going to discuss. It allows you to pick a base value of n and it runs the algorithm for inputs of size n, 2n and 3n, reporting the time each took to execute and the ratio of the various times. We ran it for the first algorithm and asked it to use a base input size of 1500. We got output like this:

        How many numbers do you want to use? 1500
        Which algorithm do you want to use? 1

        for n = 1500, time = 0.913
        
        for n = 3000, time = 7.233
        
        for n = 4500, time = 24.198
        
        Double/single ratio = 7.92223439211391
        Triple/single ratio = 26.50383351588171
We're interested in how long it took to execute for each different input size and the ratios reported at the end.

The first algorithm is an O(n3) algorithm, so we were expecting that in doubling the list, the time would increase by a factor of about 8 and in tripling the list, the time would increase by a factor of about 27. The empirical data if fairly close to the prediction.

We found that the second algorithm was much faster than the first. It is an O(n2) algorithm, so we were able to use much larger lists. We ran this second algorithm with a base size of 30,000 and got output like the following.

        How many numbers do you want to use? 30000
        Which algorithm do you want to use? 2

        for n = 30000, time = 0.98
        
        for n = 60000, time = 3.852
        
        for n = 90000, time = 8.642
        
        Double/single ratio = 3.9306122448979592
        Triple/single ratio = 8.818367346938775
Again we want to pay attention to the times for the different list sizes and the ratios at the end. We would predict that time would increase by a factor of 4 when we double the input and that time would increase by a factor of 9 when we triple the input with this O(n2) algorithm and the empirical results are fairly close.

The third algorithm runs so fast that it almost can't be effectively timed, but here are the results for a run with 5 million elements:

        How many numbers do you want to use? 5000000
        Which algorithm do you want to use? 3

        for n = 5000000, time = 0.017
        
        for n = 10000000, time = 0.033
        
        for n = 15000000, time = 0.046
        
        Double/single ratio = 1.9411764705882353
        Triple/single ratio = 2.705882352941176
This is an O(n) algorithm, so we expect that doubling the input will double the time and tripling the input will triple the time. The empirical results aren't bad given that the algorithm runs so fast.

One of the striking things to notice is that for algorithm 1, we had to limit ourselves to just one thousand elements because it was so slow while with algorithm 3, we could barely time it even with four million values. The moral of the story is that choosing the right algorithm can make a huge difference, particularly if you can choose an algorithm from a better complexity class.

The three algorithms are included in handout 9 and anyone who wants the complete program along with the timing code can download it from the calendar page.


Stuart Reges
Last modified: Mon Apr 16 11:42:23 PDT 2012