CSE143 Notes for Friday, 1/27/06

I include below notes from the Fall version of the course since we had a guest lecture from our head TA Travis.

To explore complexity, I said that we would discuss a particular problem. The idea is that we have a list of integers, both positive and negative, and we want to find the subsequence that has the highest sum. If there weren't any negative integers, you'd always include all of the numbers. But because some of them can be negative, it might be the case that some portion of the list has a sum that is greater than any other sequence from the list. The subsequences always involve taking a contiguous chunk of the list. This particular problem has often been used by Microsoft as an interview question, probably because there are different ways to solve it, some of which are much faster than others.

As an example, suppose the list stores the values (5000, -45, -8000, 4000, 2000). The 5000 at the front the list is the single largest value, so you'd think we want to include it. But it turns out that the last two numbers add up to something larger (6000). Trying to include both the 5000 at the front and the 4000 and 2000 at the end would require us to include the -8000 in the middle, which also leads to a smaller sum. So you get the maximum sum with the last two values (4000, 2000).

There is a simple way to solve this that involves picking each possible subsequence. We can have one for loop that generates each possible starting point and another for loop that generates each possible stopping point:

        for (int start = 0; start < list.length; start++) {
            for (int stop = start; stop < list.length; stop++) {
                look at the numbers from start to stop
            }
        }

So how do we "look at the numbers from start to stop"? We can write a loop that adds up each of those numbers:

        int sum = 0;
        for (int i = start; i <= stop; i++) {
            sum += list[i];
        }

And once we have that sum, we can compare it against the maximum sum we've seen so far and reset the maximum if this sum is better:

        if (sum > max) {
            max = sum;
        }

Putting these pieces together, we end up with the following code:

        for (int start = 0; start < list.length; start++) {
            for (int stop = start; stop < list.length; stop++) {
                int sum = 0;
                for (int i = start; i <= stop; i++) {
                    sum += list[i];
                }
                if (sum > max) {
                    max = sum;
                }
            }
        }

That's the first approach. The line that is executed most often in this approach is the "sum += ..." line inside the innermost for loop (the "i" loop that adds up the list).

Then I asked how the algorithm could be improved. How can we do this faster? The bottleneck is the line that is adding up individual numbers and the key to improving the algorithm is noticing how we're doing a lot of duplicate work. Think about what happens the first time through the outer loop when "start" is equal to 0. We go through the inner loop for all possible values of "stop". So suppose the list is 2000 long. We're going to compute:

        the sum from 0 to 0
        the sum from 0 to 1
        the sum from 0 to 2
        the sum from 0 to 3
        the sum from 0 to 4
        ...
        the sum from 0 to 1999

Those are all the possibilities that start with 0. We have to explore each of these possibilities, but think about how we're computing the sums. We have an inner "i" loop that is computing the sum from scratch each time. For example, suppose that we just finished computing the sum from 0 to 6. We next compute the sum from 0 to 7. But we start from the very beginning and have i go through all of the values 0 through 7 when we've just computed the sum from 0 to 6.

This becomes even more obvious when you think about larger subsequences. For example, suppose that you just added up all of the values from 0 to 1000. That was a lot of work. Then you throw away that sum and start from scratch to add up the values from 0 to 1001. But why start back at the beginning? If you know what the values from 0 to 1000 add up to, then just add the value at position 1001 to get the sum from 0 to 1001.

So the key is to eliminate the inner "i" loop by keeping a running sum. This requires us to move the initialization of sum from the inner loop to the outer loop so that we don't forget the work we've already done.

        for (int start = 0; start < list.length; start++) {
            int sum = 0;
            for (int stop = start; stop < list.length; stop++) {
                sum += list[stop];
                if (sum > max) {
                    max = sum;
                }
            }
        }

In this code the most frequently executed statements are inside the for loop for "stop" (the line that begins "sum +=" and the if).

I mentioned that there is a third algorithm, although I wouldn't have time to discuss it in detail and it is the most difficult to understand, so I did not attempt to prove its correctness. I did, however, try to explain the basic idea. The key is to avoid computing all of the sums. We want to have some heuristic that would allow us to ignore certain possibilities. We do that with a single loop and by keeping track of the highest possible sum you can form that includes list[i].

So suppose that we are considering the i-th value in the list for some i greater than 0. Let's say i is 10. Think about subsequences that include list[10]. Some of them begin with list[10] and others begin earlier, including list[9] and potentially other values that appear before list[10]. Under what circumstances would we get a higher sum by starting with list[10] versus including these earlier values? That's the key question.

The answer is that if best subsequence you can find ending in list[9] add up to a positive number, then that sequence is worth including. If they add up to a negative number, then they are taking away from the sum we are trying to generate. In other words, if those earlier values add up to a negative number, then we can get a higher sum by excluding them and starting our sequence with list[10].

This algorithm is somewhat tricky, so it's not essential that you understand exactly why it works. By running all three algorithms together, we can compare their results and get at least some evidence that the third algorithm produces the same answers as the other two.

I then switched to the computer and I showed people a program I had written to explore different algorithms for this problem (handout #12).

The program includes a DEBUGGING constant that allows me to turn debugging on and off. When it's on, the code prints the overall list and it prints the longest subsequence that it finds. This is useful to do when you're dealing with sort lists and you want to verify that the code is working. But if you're dealing with thousands of elements in your list, you'd want to turn this off.

I pointed the general structure of method main. It makes various calls on the method System.currentTimeMillis() to get the clock reading in milliseconds. I compute the time that elapses between calls on three different methods: findMax1, findMax2 and findMax3.

We started by exploring calls on findMax1. We ran the program for 500 elements, 1000, 2000 and 3000 and set up an Excel spreadsheet with the times for each execution and the line counts for each execution. Then we computed some ratios. For example, we explored what happens when you double the input by looking at the values for 1000 divided by the values for 500 and by looking at the values for 2000 divided by the values for 1000. We also looked at what happens when you triple the size of the input by computing the values for 3000 divided by the values for 1000.

We found several things. First of all, we found that the growth of line count was a pretty good predictor of the growth of the time (in other words, the ratios were similar). That's good because our theory was that the line count growth rate predicts the time growth rate. We also noticed that these values got closer for larger values of n. This is a common occurrence. With small values of n, other factors can interfere with the timing. But for large values of n, we'll find that the line count becomes a great predictor of the time. Finally, we noticed that the growth rate seems to be n^3. When we doubled the input, we got an increase of about 8 in time and line count (2^3). When we tripled, we got an increase of about 27 in time and line count (3^3).

Next we commented out the call on the first method. That's because it's so slow that we can't explore the other ones if we include the call on the first one. We looked at some similar values increased by a factor of 10 for the second algorithm. In other words, we looked at times and line counts for n of 5000, 10000, 20000 and 30000. We again found that the ratios between the times and the line counts were similar, with the ratios getting closer for larger values of n. In contrast to the first algorithm, this one had a growth rate of n^2. As we doubled the input size, it took about 4 times longer (2^2) and the line count increased by 4. When we tripled the input size, it took about 9 times longer (3^2) and the line count increased by 9.

I didn't have time to the third algorithm in detail or to test its running time, but we did manage to run one test for 100,000 values that took less than 0.1 seconds to run. The final algorithm is linear, meaning that the growth function is n.

I had written the code to report the time the algorithm took as well as a count of how many times the most frequently executed statement was executed.

With the first algorithm, we were able to see that it got very expensive very quickly. It turns out it is an O(n³) algorithm. We saw that going from 1000 items to 2000 items (n to 2n) caused the time to increase by a factor that was pretty close to 8 (2³). We saw a similar increase in the number of times the most frequently executed statement was executed. In going to 3000 (n versus 3n) the time increased by more than the predicted factor of 27, but there are many reasons why that might have happened (not least of which is that these are approximations in the first place). If I had more time to explore this we could have collected more data, but the point is that we would see, in general, that time increases as the cube of the size of the input, as does the count of the most frequently executed statement.
We had to comment out the call on the first algorithm to be able to get meaningful data for the second algorithm. We compared results for 25 thousand numbers versus 50 thousand numbers and saw that both the time and the line count increased by a factor very close to 4. This is an O(n²) algorithm, so that's what we would expect (having input go from n to 2n causes time to go from t to 4t because 2² is 4).
We commented out the call on the second algorithm as well to see how the third algorithm performed. It is so fast that we almost couldn't get meaningful data. Finally when we asked it to use a list of 10 million values it slowed down enough that we could get a meaningful time. The final algorithm runs in O(n) time and we saw that the line count was very well behaved (always equal to n, the size of the array).

The moral of the story is that choosing the right algorithm can make a huge difference, particularly if you can choose an algorithm from a better complexity class.

I pointed out that I see a lot of undergraduates who obsess about efficiency and I think that in general it's a waste of their time. Many computer scientists have commented that premature optimization is counterproductive. Don Knuth has said that "Premature optimization is the root of all evil." The key is to focus your attention where you can get real benefit. The analogy I'd make is that if you found that a jet airplane weighed too much, you might decide to put the pilot on a diet. While it's true that in some sense every little bit helps, you're not going to make much progress by trying to slim down the pilot when the plane and its passengers and cargo weigh so much more than the pilot. I see lots of undergraduates putting the pilot on a diet while ignoring much more important details.

In the case of our three algorithms from Monday, I think some people would choose the first or second approach and would torture themselves to get it to run as quickly as possible. But the real breakthrough comes from choosing the third approach that had a much better growth rate than the other two.

We'll pick up with this example in Friday's lecture when we discuss more about complexity.

Stuart Reges

Last modified: Tue Jan 31 17:40:08 PST 2006