CSE143 Notes for Monday, 4/7/08

I began by talking about the idea of an iterator. Iterators are appropriate when you want to examine every value in a structure from first to last. For example, suppose that you want to print each value in an ArrayIntList one per line. You could say:

        for (int i = 0; i < list.size(); i++) {
            System.out.println(list.get(i));
        }

This code works, but it relies on a "get" method that can quickly access any element of the array. This is known as random access. If you knew that for the rest of your life, you'd always be working with arrays, then you'd have little use for iterators. You'd just call the get method because with arrays you get fast random access.

But many of the other data structures we will be looking at don't have this kind of quick access. Think of how a DVD works, quickly jumping to any scene, versus how a VHS tape works, requiring you to fast forward through every scene until you get to the one you want. For those other structures we will study, iterators will make a lot more sense. So iterators will seem a bit silly when tied to an array-based structure, but we'll eventually see much more interesting examples of iterators.

In general, we think of an iterator as having three basic operations:

a "has next" operation that tells you whether or not there are any values left
a "get next" operation that lets you see what the next value is
a "move to next" operation that moves the iterator to the next value

Sun adopted the convention early on that the second and third steps would be combined into one operation known as "next" that does two different things: it returns the next value and it advances the iterator to the next value. So in Java there are two fundamental operations:

a "hasNext" method that tells you whether or not there are any values left
a "next" method that returns the next value and advances to the one beyond

The loop we wrote for printing the values in the list can be rewritten with an iterator as follows:

        ArrayIntListIterator i = list.iterator();
        while (i.hasNext()) {
            System.out.println(i.next());
        }

This involves a new kind of object of type ArrayIntListIterator. We get one by calling a special method in the list class. Once we have our iterator, we can use it to go through the values in the list one at a time.

I also briefly mentioned that iterators often support a method called remove that allows you to remove the value that you most recently get from a call on next. For example, this variation of the code prints each value and removes any occurrences of the value 3:

        ArrayIntListIterator i = list.iterator();
        while (i.hasNext()) {
            int n = i.next();
            System.out.println(n);
            if (n == 3) {
                i.remove();
            }
        }

This code examines each value in the list and removes all the occurrences of 3. We also looked at "tricky" cases for remove. What could cause it to fail? Someone mentioned that removing something twice might be a problem, as in:

        while (i.hasNext()) {
            int n = i.next();
            if (n == 3) {
                i.remove();
                i.remove();
            }
        }

It would also be a problem to try removing before next has been called.

I then spent some time talking about the built-in ArrayList class. Remember that we're studying the ArrayIntList class as a way to understand the built-in ArrayList class. I first had to discuss relatively new feature of Java known as "generics." We know that for arrays, it is possible to construct arrays that store different types of data:

an int[] to store an array of int values
a double[] to store an array of double values
a String[] to store an array of references to Strings

But arrays are a special case in Java. If they didn't already exist, we couldn't easily add them to Java. Instead, Java now allows you to declare generic classes and generic interfaces. For example, the ArrayList class is similar to an array. Instead of declaring ordinary ArrayList objects, we declare ArrayList<E> where E is some type (think of E as being short for "Element type"). The "E" is a type parameter that can be filled in with the name of any class.

For example, suppose, we want an ArrayList of Strings. We describe the type as:

        ArrayList<String>

When we construct an ArrayIntList, we say:

        ArrayIntList lst = new ArrayIntList();

Imagine replacing both occurrences of "ArrayIntList" with "ArrayList<String>" and you'll see how to construct an ArrayList<String>:

        ArrayList<String> lst = new ArrayList<String>();

And in the same way that you would declare a method header for manipulating an ArrayIntList object:

        public void doSomethingCool(ArrayIntList lst) {
            ...
        }

You can use ArrayList<String> in place of ArrayIntList to declare a method that takes an ArrayList<String> as a parameter:

        public void doSomethingCool(ArrayList<String> list) {
            ...
        }

It can even be used as a return type if you want to have the method return an ArrayList:

        public ArrayList<String> doSomethingCool(ArrayList<String> list) {
            ...
        }

Once you have declared an ArrayList<String>, you can use manipulate it with the kinds of calls we have made on our ArrayIntList but using Strings instead of ints:

        ArrayList<String> list = new ArrayList<String>();
        list.add("hello");
        list.add("there");
        list.add(0, "fun");
        System.out.println(list);

which produces this output:

        [fun, hello, there]

All of the methods we have seen with ArrayIntList are defined for ArrayList: the appending add, add at an index, remove, size, get, etc. So we could write the following loop to print each String from an ArrayList<String>:

        for (int i = 0; i < lst.size(); i++) {
            System.out.println(lst.get(i));
        }

I then spent a little time discussing the issue of primitive data versus objects. Even though we can construct an ArrayList<E> for any class E, we can't construct an ArrayList<int> because int is a primitive type, not a class. To get around this problem, Java has a set of classes that are known as "wrapper" classes that "wrap up" primitive values like ints to make them an object. It's very much like taking a candy and putting a wrapper around it. For the case of ints, there is a class known as Integer that can be used to store an individual int. Each Integer object has a single data field: the int that it wrapped up inside.

Java 5 also has quite a bit of support that makes a lot of this invisible to programmers. If you want to put int values into an ArrayList, you have to remember to use the type ArrayList<Integer> rather than ArrayList<int>, but otherwise Java does a lot of things for you. For example, you can construct such a list and add simple int values to it:

        ArrayList<Integer> list = new ArrayList<Integer>();
        list.add(18);
        list.add(34);

In the two calls on add, we are passing simple ints as arguments to something that really requires an Integer. This is okay because Java will automatically "box" the ints for us (i.e., wrap them up in Integer objects). We can also refer to elements of this list and treat them as simple ints, as in:

        int product = list.get(0) * list.get(1);

The calls on list.get return references to Integer objects and normally you wouldn't be allowed to multiply two objects together. In this case Java automatically "unboxes" the values for you, unwrapping the Integer objects and giving you the ints that are contained inside.

Every primitive type has a corresponding wrapper class: Integer for int, Double for double, Character for char, Boolean for boolean, and so on.

Then I mentioned that I hoped people are aware of the array initializer syntax where you can use curly braces to specify a set of values to use for initializing an array:

        int[] data = {8, 27, 93, 4, 5, 15, 206};

This is a great way to define data to use for a testing program. I asked people how we'd find the product of this list and people suggested the standard approach that uses an int to index the array:

        int product = 1;
        for (int i = 0; i < data.length; i++) {
            product *= data[i];
        }

This approach works, but there is a simpler way to do this. If all you want to do is to iterate over the values of an array one at a time, you can use what is called a for-each loop:

        int product = 1;
        for (int n : data) {
            product *= n;
        }

We generally read the for loop header as, "For each int n in data...". The choice of "n" is arbitrary. It defines a local variable for the loop. I could just as easily have called it "x" or "foo" or "value". Notice that in the for-each loop, I don't have to use any bracket notation. Instead, each time through the loop Java sets the variable n to the next value from the array. I also don't need to test for the length of the array. Java does that for you when you use a for-each loop.

There are some limitations of for-each loops. You can't use them to change the contents of the list. If you assign a value the variable n, you are just changing a local variable inside the loop. It has no effect on the array itself.

As with arrays, we can use a for-each loop for ArrayLists, so we could say:

        String[] data2 = {"four", "score", "and", "seven", "years", "ago"};
        ArrayList<String> lst = new ArrayList<String>();
        for (String s : data2) {
            lst.add(s);
        }
        System.out.println(lst);

which produces this output:

        [four, score, and, seven, years, ago]

I also mentioned that with the next programming assignment, we are asking you to start using more features from the Java class libraries. In particular, for this next programming assignment we are going to use a collection known as a SortedMap.

As an example, I asked people how we could write a program that would count all of the occurrences of various words in an input file. I had a copy of the text of Moby Dick that we looked at to think about this. I showed some starter code that constructs a Scanner object tied to a file:

        import java.util.*;
        import java.io.*;
        
        public class WordCount {
            public static void main(String[] args) throws FileNotFoundException {
                Scanner console = new Scanner(System.in);
                System.out.print("What is the name of the text file? ");
                String fileName = console.nextLine();
                Scanner input = new Scanner(new File(fileName));

                while (input.hasNext()) {
                    String next = input.next();
                    // process next
                }
            }
        }

Notice that in the loop we use input.next() to read individual words and we have this in a while loop testing against input.hasNext(). I pointed out that we'll have trouble with things like capitalization and punctuation. I said that we should at least turn the string to all lowercase letters so that we don't count Strings like "The" and "the" as different words:

        while (input.hasNext()) {
            String next = input.next().toLowerCase();
            // process next
        }

But I said that dealing with punctuation was more than I wanted to attempt in this program, so I decided that we'd live with the fact that Strings like "the" and "the," and "the." would be considered different words. We're looking for a fairly simple example here, so I didn't want to worry too much about punctuation.

To flesh out this code, we had to think about what kind of data structure to use to keep track of words and their frequencies. One person suggested that we use a hashtable. I said that this is related to the data abstraction known as a map.

The idea behind a map is that it keeps track of key/value pairs. In our case, we want to keep track of word/count pairs (what is the count for each different word). We often store data this way. For example, in the US we often use a person's social security number as a key to get information about them. I would expect that if I talked to the university registrar, they probably have the ability to look up students based on social security number to find their transcript.

In a map, there is only one value for any given key. If you look up a social security number and get three different student transcripts, that would be a problem. With the Java map objects, if you already have an entry in your map for a particular key, then any attempt to put a new key/value pair into the map will overwrite the old mapping.

We looked at an interface in the Java class libraries called Map that is a generic interface. That means that we have to supply type information. It's formal description is Map<K, V>. This is different from the Queue interface in that it has two different types. That's because the map has to know what type of keys you have and what type of values you have. In our case, we have some words (Strings) that we want to associated with some counters (ints). We can't actually use type int because it is a primitive type, but we can use type Integer.

We are going to use a slight variation of Map known as SortedMap. A SortedMap is one that keeps its keys in sorted order. For us, that would mean that the words from the file will be kept in sorted order, which is a nice feature to implement. More importantly, you'll need to use a SortedMap for your homework assignment, so we want to practice using that one.

So our map would be of type SortedMap<String, Integer>. In other words, it's a a map that keeps track of String/Integer pairs (this String goes to this Integer). SortedMap is the name of the interface, but it's not an actual implementation. The implementation we will use is TreeMap. So we can construct a map called "count" to keep track of our counts by saying:

        SortedMap<String, Integer> count = new TreeMap<String, Integer>();

There are only a few methods that we'll be using from the SortedMap interface. The most basic allow you to put something into the map (an operation called put) and to ask the map for the current value of something (an operation called get).

I asked what code we need to record the word in our map. Someone suggested using the put method to assign it to a count of 1. So our loop becomes:

        SortedMap<String, Integer> wordCounts = new TreeMap<String, Integer>();
        while (input.hasNext()) {
            String next = input.next().toLowerCase();
            wordCounts.put(next, 1);
        }

This doesn't quite work, but it's getting closer. Each time we encounter a word, it adds it to our map, associating it with a count of 1. This will figure out what the unique words are, but it won't have the right counts for them.

I asked people to think about what to do if a word has been seen before. In that case, we want to increase its count by 1. That means we have to get the old value of the count and add 1 to it:

        wordCounts.get(next) + 1

and make this the new value of the counter:

        wordCounts.put(next, wordCounts.get(next) + 1);

So we have two different calls on put. We want to call the first one when the word is first seen and call the second one if it's already been seen. Someone suggested using an if/else for this. The only question is what test to use. The SortedMap includes a method called containsKey that tests whether or not a certain value is a key stored in the map. Using this method, we modified our code to be:

        SortedMap<String, Integer> wordCounts = new TreeMap<String, Integer>();
        while (input.hasNext()) {
            String next = input.next().toLowerCase();
            if (!wordCounts.containsKey(next)) {
                wordCounts.put(next, 1);
            } else {
                wordCounts.put(next, wordCounts.get(next) + 1);
            }
        }

The first time we see a word, we call the put method and say that the map should associate the word with a count of 1. Later we call put again with a higher count. And we keep calling put every time the count goes up. What happens to the old values that we had put in the map previously? The way the map works, each key is associated with only one value. So when you call put a second or third time, you are wiping out the old association. The new key/value pair replaces the old key/value pair in the map.

Then we talked about how to print the results. Clearly we need to iterate over the entries in the map. One way to do this is to request what is known as the "key set". The key set is the set of all keys contained in the map. The Java documentation says that it will be of type Set. We don't have to really worry about this if we use a for-each loop. Remember that a for-each loop iterates over all of the values in a given collection. So we can say:

        for (String word : wordCounts.keySet()) {
            // process word
        }

We would read this as, "for each String word that is in wordCounts.keySet()..." To process the word, we simply print it out along with its count. How do we get its count? By calling the get method of the map:

        for (String word : wordCounts.keySet()) {
            System.out.println(wordCounts.get(word) + "\t" + word);
        }

I didn't try to print all of the words in Moby Dick because it would have produced too much output. Instead, I had it show me the counts of words in the program itself. Obviously for large files we want some mechanism to limit the output. At that point I passed out the handout with my commented solution. In that version, I include some extra code that asks for a minimum frequency to use. We ran that on Moby Dick and saw this list of words that occur at least 500 times:

        What is the name of the text file? moby.txt
        Minimum number of occurrences for printing? 500
        4571    a
        1354    all
        587     an
        6182    and
        563     are
        1701    as
        1289    at
        973     be
        1691    but
        1133    by
        1522    for
        1067    from
        754     had
        741     have
        1686    he
        552     him
        2459    his
        1746    i
        3992    in
        512     into
        1555    is
        1754    it
        562     like
        578     my
        1073    not
        506     now
        6408    of
        933     on
        775     one
        675     or
        882     so
        599     some
        2729    that
        14092   the
        602     their
        506     there
        627     they
        1239    this
        4448    to
        551     upon
        1567    was
        644     were
        500     whale
        552     when
        547     which
        1672    with
        774     you

Although I show the output here as being lined up, it didn't look that way in jGRASP. For some reason jGRASP is handling tab characters badly in output.

One final point I made about the SortedMap interface is that you can associate just about anything with just about anything. In the word counting program, we associated strings with integers. You could also associate strings with strings. One thing you can't do is to have multiple associations in a single map. For example, if you decide to associate strings with strings, then any given string can be associated with just a single string. But there's no reason that you can't have the second value be structured in some way. You can associate strings with arrays or strings with ArrayLists.

Stuart Reges

Last modified: Mon Apr 7 16:02:47 PDT 2008