CSE143 Notes for Wednesday, 5/6/11

With the next programming assignment, we are asking you to start using more features from the Java class libraries. In particular, for this next programming assignment we are going to use a kind of collection known as a Map.

As an example, I asked people how we could write a program that would count all of the occurrences of various words in an input file. I had a copy of the text of Moby Dick that we looked at to think about this. I showed some starter code that constructs a Scanner object tied to a file:

        import java.util.*;
        import java.io.*;
        
        public class WordCount {
            public static void main(String[] args) throws FileNotFoundException {
                Scanner console = new Scanner(System.in);
                System.out.print("What is the name of the text file? ");
                String fileName = console.nextLine();
                Scanner input = new Scanner(new File(fileName));

                while (input.hasNext()) {
                    String next = input.next();
                    // process next
                }
            }
        }

Notice that in the loop we use input.next() to read individual words and we have this in a while loop testing against input.hasNext(). I pointed out that we'll have trouble with things like capitalization and punctuation. I said that we should at least turn the string to all lowercase letters so that we don't count Strings like "The" and "the" as different words:

        while (input.hasNext()) {
            String next = input.next().toLowerCase();
            // process next
        }

But I said that dealing with punctuation was more than I wanted to attempt in this program, so I decided that we'd live with the fact that Strings like "the" and "the," and "the." would be considered different words. We're looking for a fairly simple example here, so I didn't want to worry too much about punctuation.

To flesh out this code, we had to think about what kind of data structure to use to keep track of words and their frequencies. One person suggested that we might use arrays or ArrayLists. For example, we could have an ArrayList of words and an ArrayList of counts where element "i" in one corresponds to element "i" in the other. This approach is often described as "parallel arrays." It's not a very object-oriented approach because we really want to associate the word with its counts rather than have a structure that puts all the words together and another that puts all the counts together. Someone suggested that we could make a class for a word/count combination and then have an ArrayList of that. That's true, but Java gives us a better alternative. The collections framework provides a data abstraction known as a map.

The idea behind a map is that it keeps track of key/value pairs. In our case, we want to keep track of word/count pairs (what is the count for each different word). We often store data this way. For example, in the US we often use a person's social security number as a key to get information about them. I would expect that if I talked to the university registrar, they probably have the ability to look up students based on social security number to find their transcript.

In a map, there is only one value for any given key. If you look up a social security number and get three different student transcripts, that would be a problem. With the Java map objects, if you already have an entry in your map for a particular key, then any attempt to put a new key/value pair into the map will overwrite the old mapping.

We looked at an interface in the Java class libraries called Map that is a generic interface. That means that we have to supply type information. It's formal description is Map<K, V>. This is different from the List and Set interfaces in that it has two different types. That's because the map has to know what type of keys you have and what type of values you have. In our case, we have some words (Strings) that we want to associated with some counters (ints). We can't actually use type int because it is a primitive type, but we can use type Integer.

So our map would be of type Map<String, Integer>. In other words, it's a a map that keeps track of String/Integer pairs (this String goes to this Integer). Map is the name of the interface, but it's not an actual implementation. The implementation we will use is TreeMap. So we can construct a map called "count" to keep track of our counts by saying:

        Map<String, Integer> count = new TreeMap<String, Integer>();

There are only a few methods that we'll be using from the Map interface. The most basic allow you to put something into the map (an operation called put) and to ask the map for the current value of something (an operation called get).

I asked what code we need to record the word in our map. Someone suggested using the put method to assign it to a count of 1. So our loop becomes:

        Map<String, Integer> count = new TreeMap<String, Integer>();
        while (input.hasNext()) {
            String next = input.next().toLowerCase();
            count.put(next, 1);
        }

This doesn't quite work, but it's getting closer. Each time we encounter a word, it adds it to our map, associating it with a count of 1. This will figure out what the unique words are, but it won't have the right counts for them.

I asked people to think about what to do if a word has been seen before. In that case, we want to increase its count by 1. That means we have to get the old value of the count and add 1 to it:

        count.get(next) + 1

and make this the new value of the counter:

        count.put(next, count.get(next) + 1);

So we have two different calls on put. We want to call the first one when the word is first seen and call the second one if it's already been seen. Someone suggested using an if/else for this. The only question is what test to use. The Map includes a method called containsKey that tests whether or not a certain value is a key stored in the map. Using this method, we modified our code to be:

        Map<String, Integer> count = new TreeMap<String, Integer>();
        while (input.hasNext()) {
            String next = input.next().toLowerCase();
            if (!count.containsKey(next)) {
                count.put(next, 1);
            } else {
                count.put(next, count.get(next) + 1);
            }
        }

The first time we see a word, we call the put method and say that the map should associate the word with a count of 1. Later we call put again with a higher count. And we keep calling put every time the count goes up. What happens to the old values that we had put in the map previously? The way the map works, each key is associated with only one value. So when you call put a second or third time, you are wiping out the old association. The new key/value pair replaces the old key/value pair in the map.

Then we talked about how to print the results. Clearly we need to iterate over the entries in the map. One way to do this is to request what is known as the "key set". The key set is the set of all keys contained in the map. The Java documentation says that it will be of type Set. We don't have to really worry about this if we use a for-each loop. Remember that a for-each loop iterates over all of the values in a given collection. So we can say:

        for (String word : count.keySet()) {
            // process word
        }

We would read this as, "for each String word that is in count.keySet()..." To process the word, we simply print it out along with its count. How do we get its count? By calling the get method of the map:

        for (String word : count.keySet()) {
            System.out.println(count.get(word) + "\t" + word);
        }

I didn't try to print all of the words in Moby Dick because it would have produced too much output. Instead, I had it show me the counts of words in the program itself. Obviously for large files we want some mechanism to limit the output. At that point I passed out the handout with my commented solution. In that version, I include some extra code that asks for a minimum frequency to use. We ran that on Moby Dick and saw this list of words that occur at least 500 times:

        What is the name of the text file? moby.txt
        Minimum number of occurrences for printing? 500
        4571    a
        1354    all
        587     an
        6182    and
        563     are
        1701    as
        1289    at
        973     be
        1691    but
        1133    by
        1522    for
        1067    from
        754     had
        741     have
        1686    he
        552     him
        2459    his
        1746    i
        3992    in
        512     into
        1555    is
        1754    it
        562     like
        578     my
        1073    not
        506     now
        6408    of
        933     on
        775     one
        675     or
        882     so
        599     some
        2729    that
        14092   the
        602     their
        506     there
        627     they
        1239    this
        4448    to
        551     upon
        1567    was
        644     were
        500     whale
        552     when
        547     which
        1672    with
        774     you

One final point I made about the Map interface is that you can associate just about anything with just about anything. In the word counting program, we associated strings with integers. You could also associate strings with strings. One thing you can't do is to have multiple associations in a single map. For example, if you decide to associate strings with strings, then any given string can be associated with just a single string. But there's no reason that you can't have the second value be structured in some way. You can associate strings with arrays or strings with ArrayLists.

Then I mentioned that I wanted to explore a sample program that will constitute a medium hint for the programming assignment. We will begin looking at the program in this lecture and finish it up in the next lecture.

The sample program involves keeping track of friendships. You could think of it as keeping track of Facebook friends. One of the first questions that comes up is how do we represent friendships? For example, are friendships bidirectional? If person A is friends with person B, does that mean that person B is friends with person A? For our purposes, we will assume the answer is yes. If we were trying to represent something like "is attracted to", then we'd come to a different conclusion, but for friends, just like on Facebook and other social networking sites, friendship goes both ways.

I said that a good way to visualize friendships is to draw a graph in which each person is represented with a node (an oval) and each friendship is represented by an edge connecting two nodes (a line drawn between two ovals). I am using a program called Graphviz, which is an open-source graph viewer.. For example, here is a sample friendship graph:

This information is stored in a file with lines that list pairs of friendships, as in:

        graph {
            Amanda -- Emily
            Amanda -- Megan
            Amanda -- Rachel
            Ashley -- Christopher
            Ashley -- Matthew
            Ashley -- Michael
            Jacob -- Tyler
            Jessica -- Christopher
            Jessica -- Samantha
            Megan -- Christopher
            Michael -- Joshua
            Rachel -- Andrew
            Rachel -- Michael
            Samantha -- Christopher
            Samantha -- Matthew
            Sarah -- Andrew
        }

The task of generating a sample file like this to work with is itself an interesting task that I solved using the Java collections. I began with the following program:

        import java.io.*;
        import java.util.*;
        
        public class FriendsData {
            public static void main(String[] args) throws FileNotFoundException {
                String[] names = {"Jessica", "Ashley", "Sarah", "Amanda", "Samantha",
                                  "Emily", "Rachel", "Megan", "Michael", "Jacob",
                                  "Tyler", "Joshua", "Christopher", "Andrew",
                                  "Matthew", "Kyle"};
        
                PrintStream output = new PrintStream(new File("friends.dot"));
                output.println("graph {");
                Random r = new Random();
                for (int count = 0; count < names.length; count++) {
                    int i = r.nextInt(names.length);
                    int j = r.nextInt(names.length);
                    output.println("    " + names[i] + " -- " + names[j]);
                }
                output.println("}");
            }
        }

I am using the 8 most popular names given to girls and boys who were born in the state of Washington in 1992. The program creates a list of friendships that is just as long as the list of names. Since friendships go both directions, this will give each person an average of two friendships. The code uses a random number generator to randomly pick two people and list them as friends in the output. Unfortunately, this program has several flaws. It allows people to be friends with themselves and it allows people to have multiple friendships. So when we run this program and open the file in Graphviz, we get graphs like this:

Joshua is friends with himself and friendships like Samantha/Tyler are listed more than once. So we have to figure out how to stop these errors from happening. Someone mentioned that we can fix the problem of people being listed as friends of themselves by making sure that in our loop, we reject any result where i and j are equal.

This still leaves several issues to resolve. For example, how do we easily recognize friendship pairs that are reversed? These two lines of output look different, but in fact they represent a duplicate:

    Samantha -- Tyler
    Tyler -- Samantha

At that point we were out of time, so I said we would finish this in the next lecture.

Stuart Reges

Last modified: Fri Apr 8 08:25:56 PDT 2011