As an example, I asked people how we could write a program that would count all of the occurrences of various words in an input file. I had a copy of the text of Moby Dick that we looked at to think about this. I showed some starter code that constructs a Scanner object tied to a file:
import java.util.*; import java.io.*; public class WordCount { public static void main(String[] args) throws FileNotFoundException { Scanner console = new Scanner(System.in); System.out.print("What is the name of the text file? "); String fileName = console.nextLine(); Scanner input = new Scanner(new File(fileName)); while (input.hasNext()) { String next = input.next(); // process next } } }Notice that in the loop we use input.next() to read individual words and we have this in a while loop testing against input.hasNext(). I pointed out that we'll have trouble with things like capitalization and punctuation. I said that we should at least turn the string to all lowercase letters so that we don't count Strings like "The" and "the" as different words:
while (input.hasNext()) { String next = input.next().toLowerCase(); // process next }But I said that dealing with punctuation was more than I wanted to attempt in this program, so I decided that we'd live with the fact that Strings like "the" and "the," and "the." would be considered different words. We're looking for a fairly simple example here, so I didn't want to worry too much about punctuation.
To flesh out this code, we had to think about what kind of data structure to use to keep track of words and their frequencies. One person suggested that we might use arrays or ArrayLists. For example, we could have an ArrayList of words and an ArrayList of counts where element "i" in one corresponds to element "i" in the other. This approach is often described as "parallel arrays." It's not a very object-oriented approach because we really want to associate the word with its counts rather than have a structure that puts all the words together and another that puts all the counts together. Someone suggested that we could make a class for a word/count combination and then have an ArrayList of that. That's true, but Java gives us a better alternative. The collections framework provides a data abstraction known as a map.
The idea behind a map is that it keeps track of key/value pairs. In our case, we want to keep track of word/count pairs (what is the count for each different word). We often store data this way. For example, in the US we often use a person's social security number as a key to get information about them. I would expect that if I talked to the university registrar, they probably have the ability to look up students based on social security number to find their transcript.
In a map, there is only one value for any given key. If you look up a social security number and get three different student transcripts, that would be a problem. With the Java map objects, if you already have an entry in your map for a particular key, then any attempt to put a new key/value pair into the map will overwrite the old mapping.
We looked at an interface in the Java class libraries called Map that is a generic interface. That means that we have to supply type information. It's formal description is Map<K, V>. This is different from the Queue interface in that it has two different types. That's because the map has to know what type of keys you have and what type of values you have. In our case, we have some words (Strings) that we want to associated with some counters (ints). We can't actually use type int because it is a primitive type, but we can use type Integer.
We are going to use a slight variation of Map known as SortedMap. A SortedMap is one that keeps its keys in sorted order. For us, that would mean that the words from the file will be kept in sorted order, which is a nice feature to implement. More importantly, you'll need to use a SortedMap for your homework assignment, so we want to practice using that one.
So our map would be of type SortedMap<String, Integer>. In other words, it's a a map that keeps track of String/Integer pairs (this String goes to this Integer). SortedMap is the name of the interface, but it's not an actual implementation. The implementation we will use is TreeMap. So we can construct a map called "count" to keep track of our counts by saying:
SortedMap<String, Integer> count = new TreeMap<String, Integer>();There are only a few methods that we'll be using from the SortedMap interface. The most basic allow you to put something into the map (an operation called put) and to ask the map for the current value of something (an operation called get).
I asked what code we need to record the word in our map. Someone suggested using the put method to assign it to a count of 1. So our loop becomes:
SortedMap<String, Integer> count = new TreeMap<String, Integer>(); while (input.hasNext()) { String next = input.next().toLowerCase(); count.put(next, 1); }This doesn't quite work, but it's getting closer. Each time we encounter a word, it adds it to our map, associating it with a count of 1. This will figure out what the unique words are, but it won't have the right counts for them.
I asked people to think about what to do if a word has been seen before. In that case, we want to increase its count by 1. That means we have to get the old value of the count and add 1 to it:
count.get(next) + 1and make this the new value of the counter:
count.put(next, count.get(next) + 1);So we have two different calls on put. We want to call the first one when the word is first seen and call the second one if it's already been seen. Someone suggested using an if/else for this. The only question is what test to use. The SortedMap includes a method called containsKey that tests whether or not a certain value is a key stored in the map. Using this method, we modified our code to be:
SortedMap<String, Integer> count = new TreeMap<String, Integer>(); while (input.hasNext()) { String next = input.next().toLowerCase(); if (!count.containsKey(next)) { count.put(next, 1); } else { count.put(next, count.get(next) + 1); } }The first time we see a word, we call the put method and say that the map should associate the word with a count of 1. Later we call put again with a higher count. And we keep calling put every time the count goes up. What happens to the old values that we had put in the map previously? The way the map works, each key is associated with only one value. So when you call put a second or third time, you are wiping out the old association. The new key/value pair replaces the old key/value pair in the map.
Then we talked about how to print the results. Clearly we need to iterate over
the entries in the map. One way to do this is to request what is known as the
"key set". The key set is the set of all keys contained in the map. The Java
documentation says that it will be of type Set
One final point I made about the SortedMap interface is that you can associate
just about anything with just about anything. In the word counting program, we
associated strings with integers. You could also associate strings with
strings. One thing you can't do is to have multiple associations in a single
map. For example, if you decide to associate strings with strings, then any
given string can be associated with just a single string. But there's no
reason that you can't have the second value be structured in some way. You can
associate strings with arrays or strings with ArrayLists.
Then I switched to talking about grammars. We are going to use an approach to
describing grammars that is known as a "production system". It is well known
to those who study formal linguistics. Computer scientists know a lot about
them because we design our own languages like Java. This particular style of
production is known as BNF (short for
Backus-Naur Form). Each production describes the rules for a particular
nonterminal symbol. The nonterminal appears first followed by the symbol
"::=" which is usually read as "is composed of". On the right-hand side of the
"::=" we have a series of rules separated by the vertical bar character which
we read as "or". The idea is that the nonterminal symbol can be replaced by
any of the sequences of symbols appearing between vertical bar characters.
We can describe the basic structure of an English sentence as follows:
I pointed out that you can draw a diagram of how to derive a sentence from the
BNF grammar. Wikipedia has an example of this under the entry for parse tree.
Then we "drilled down" a bit into what a noun phrase might look like. I
suggested that the simplest form of noun phrase would be a proper noun, which I
expressed this way:
I pointed out that it is important to realize that the input is "tokenized"
using white space. For example, the text "Han Solo" is broken up into
two separate tokens. So it's not a single terminal, it's actually two
different terminals.
At this point I mentioned the fact that we're going to use a slight variation
of BNF notation. To keep things simple, we'll use just a simple colon in place
of the "::=" in the rules above. So our three rules became:
I also pointed out that these are not proper sentences because they contain the
nonterminal symbol <vp>. That's because we never finished our grammar. We
haven't yet defined what a verb phrase looks like. Notice that the program
doesn't care about whether or not something is enclosed in the less-than and
greater-than characters, as in "<vp>". That's a convention that is often
followed in describing grammar, but that's not how our program is
distinguishing between terminals and nonterminals. As mentioned earlier,
anything that appears to the left of a colon is considered a nonterminal and
every other token is considered a terminal.
Then I said that there are other kinds of noun phrases than just proper nouns.
We might use a word like "the" or "a" followed by a noun. I asked what those
words are called and someone said they are determiners. So we added a new rule
to the grammar:
We asked the program to generate 5 <np> and we got something like this:
When we ran the program again, we started by asking for 5 adjective phrases and
got a result like this:
This produced even more interesting sentences, as in the following 10:
for (String word : count.keySet()) {
// process word
}
We would read this as, "for each String word that is in count.keySet()..."
To process the word, we simply print it out along with its count. How do we
get its count? By calling the get method of the map:
for (String word : count.keySet()) {
System.out.println(count.get(word) + "\t" + word);
}
I didn't try to print all of the words in Moby Dick because it would
have produced too much output. Instead, I had it show me the counts of words
in the program itself. Obviously for large files we want some mechanism to
limit the output. At that point I passed out the handout with my commented
solution. In that version, I include some extra code that asks for a minimum
frequency to use. We ran that on Moby Dick and saw this list of words
that occur at least 500 times:
What is the name of the text file? moby.txt
Minimum number of occurrences for printing? 500
4571 a
1354 all
587 an
6182 and
563 are
1701 as
1289 at
973 be
1691 but
1133 by
1522 for
1067 from
754 had
741 have
1686 he
552 him
2459 his
1746 i
3992 in
512 into
1555 is
1754 it
562 like
578 my
1073 not
506 now
6408 of
933 on
775 one
675 or
882 so
599 some
2729 that
14092 the
602 their
506 there
627 they
1239 this
4448 to
551 upon
1567 was
644 were
500 whale
552 when
547 which
1672 with
774 you
Although I show the output here as being lined up, it didn't look that way in
jGRASP. For some reason jGRASP is handling tab characters badly in output.
<s> ::= <np> <vp>
We would read this as, "A sentence (<s>) is composed of a noun phrase (<np>)
followed by a verb phrase (<vp>)." The symbols <s>, <np> and <vp> are known as
"nonterminals" in the grammar. That means that we don't expect them to appear
in the actual sentences that we form from the grammar.
<np> ::= <pn>
So then I asked people for examples of proper nouns and we ended up with this
rule:
<pn> ::= Matt | Han Solo | New York | Trogdor | Pikachu | Michael Jackson
Notice that the vertical bar character is being used to separate different
possibilities. In other words, we're saying that "a proper noun is either Matt
or Han Solo or New York or Trogdor..." These values on the right-hand side are
examples of "terminals". In other words, we expect these to be part of the
actual sentences that are formed.
<s>: <np> <vp>
<np>: <pn>
<pn>: Matt | Han Solo | New York | Trogdor | Pikachu | Michael Jackson
I saved this file and ran the program. It read the file and began by
saying:
Available symbols to generate are:
[<np>, <pn>, <s>]
What do you want generated (return to quit)?
I pointed out that we are defining a nonterminal to be any symbol that appears
to the left of a colon in one of our productions. The input file has three
productions and that is why the program is showing three nonterminals that can
be generated by the grammar. I began by asking for it to generate 5 of the
"<pn>" nonterminal symbol and got something like this:
Michael Jackson
New York
Trogdor
Michael Jackson
Matt
In this case, it is simply choosing at random among the various choices for a
proper noun. Then I asked it for five of the "<s>" nonterminal symbol and got
something like this:
Michael Jackson <vp>
Michael Jackson <vp>
Trogdor <vp>
Michael Jackson <vp>
Pikachu <vp>
In this case, it is generating 5 random sentences that involve choosing 5
random proper nouns. So far the program isn't doing anything very interesting,
but it's good to understand the basics of how it works.
<det>: the | a | an | some | this
Using this, we changed our rule for <np>:
<np>: <pn> | <det> <n>
Notice how the vertical bar character is used to indicate that a noun phrase is
either a proper noun or it's a determiner followed by a noun. This required
the addition of a new rule for nouns and I again asked for suggestions from the
audience:
<n>: kite | ball | house | tornado | bulldozer | mat | narwhal | phaser | laser cat
At this point the overall grammar looked like this:
<s>: <np> <vp>
<np>: <pn> | <det> <n>
<pn>: Matt | Han Solo | New York | Trogdor | Pikachu | Michael Jackson
<det>: the | a | an | some | this
<n>: kite | ball | house | tornado | bulldozer | mat | narwhal | phaser | laser cat
We saved the file and ran the program again. Because there are five rules in
the grammar, it offered five nonterminals to choose from:
Available symbols to generate are:
[<det>, <n>, <np>, <pn>, <s>]
What do you want generated (return to quit)?
Notice that the nonterminals are in alphabetical order, not in the order in
which they appear in the file. That's because they are stored as the keys of a
SortedMap that keeps the keys in sorted order.
this bulldozer
an house
New York
this house
this narwhal
In this case, it is randomly choosing between the "proper noun" rule and the
other rule that involves a determiner and a noun. It is also then filling in
the noun or proper noun to form a string of all terminal symbols. I also asked
for five of the nonterminal symbol <s> and got something like this:
New York <vp>
some kite <vp>
the kite <vp>
Han Solo <vp>
some laser cat <vp>
This is getting better, but we obviously need to include something for verb
phrases. We discussed the difference between transitive verbs that take an
object (a noun phrase) and intransitive verbs that don't. This led us to add
the following new rules:
<vp>: <tv> <np> | <iv> | <adv> <vp>
<tv>: hit | hugged | defenstrated | grokked | laughed at | spooned | smoked
<iv>: died | exploded | imploded | wept | leveled up | evolved
We saved the file and ran the program again and each of these three showed up
as choices to generate:
Available symbols to generate are:
[<det>, <iv>, <n>, <np>, <pn>, <s>, <tv>, <vp>]
What do you want generated (return to quit)?
Now when we asked for 10 sentences (10 of the nonterminal <s>), we got more
interesting results like these:
an tornado frolicked
Pikachu kicked some bulldozer
some phaser stole a tornado
the phaser wept
New York kicked this house
this phaser frolicked
Pikachu compiled Han Solo
this house ignited
New York touched this narwhal
a kite ignited
Then we decided to spice up the grammar a bit by adding adjectives. We added a
new rule for individual adjectives:
<adj>: furry | moist | nauseous | shiny | warm | nautical | delicious | superfluous
Then we talked about how to modify our rule for noun phrases. We kept our old
combination of a determiner and a noun, but added a new one for an article and
a noun with an adjective in the middle:
<np>: <pn> | <det> <n> | <det> <adj> <n>
But you might want to have more than one adjective. So we introduced a new
nonterminal for an adjective phrase:
<np>: <pn> | <det> <n> | <det> <adjp> <n>
Then we just had to write a production for <adjp>. We want to allow one
adjective or two or three, so we could say:
<adjp>: <adj> | <adj> <adj> | <adj> <adj> <adj>
This is tedious and it doesn't allow four adjectives or five or six. This is a
good place to use recursion:
<adjp>: <adj> | <adj> <adjp>
We are saying that in the simple case or base case, you have one adjective.
Otherwise it is an adjective followed by an adjective phrase. This recursive
definition is simple, but it allows you to include as many adjectives as you
want.
warm warm furry moist
delicious
superfluous
nautical nauseous delicious shiny
nautical
Notice that sometimes we get just one adjective ("delicious") and sometimes we
get several because it chooses randomly between the two different rules we
introduced for adjective phrase.
New York evolved
an superfluous mat spooned Trogdor
the moist kite imploded
some bulldozer hit an nauseous nauseous laser cat
an kite exploded
the shiny narwhal exploded
the nauseous phaser died
Michael Jackson evolved
Pikachu grokked the ball
the kite hit an phaser
We made one last set of changes to the grammar to include adverbs and ended up
with this final version of the grammar:
<s>: <np> <vp>
<np>: <pn> | <det> <n> | <det> <adjp> <n>
<pn>: Matt | Han Solo | New York | Trogdor | Pikachu | Michael Jackson
<det>: the | a | an | some | this
<n>: kite | ball | house | tornado | bulldozer | mat | narwhal | phaser | laser cat
<adj>: furry | moist | nauseous | shiny | warm | nautical | delicious | superfluous
<adjp>: <adj> | <adj> <adjp>
<vp>: <tv> <np> | <iv> | <adv> <vp>
<tv>: hit | hugged | defenstrated | grokked | laughed at | spooned | smoked
<iv>: died | exploded | imploded | wept | leveled up | evolved
<adv>: slowly | hungrily | viciously | dangerously
Below are 25 sample sentences generated by the grammar:
an bulldozer grokked Pikachu
an nauseous tornado hungrily spooned Han Solo
an superfluous furry nauseous house died
this mat defenstrated New York
a mat slowly dangerously evolved
the kite hungrily smoked this kite
this narwhal died
this mat leveled up
Matt hungrily defenstrated the furry phaser
the furry mat defenstrated this mat
a superfluous nauseous narwhal imploded
this superfluous ball grokked Pikachu
an superfluous nauseous warm furry delicious furry phaser hit some bulldozer
an nauseous nautical warm narwhal slowly slowly died
an moist moist superfluous narwhal leveled up
the nauseous superfluous delicious shiny kite wept
Trogdor hit an house
an superfluous superfluous phaser leveled up
Matt spooned some delicious laser cat
a warm delicious mat dangerously hungrily smoked the narwhal
Han Solo exploded
Michael Jackson evolved
Michael Jackson laughed at the phaser
this nauseous nauseous phaser smoked Matt
some nauseous warm ball spooned a kite
Stuart Reges
Last modified: Sun May 2 19:01:53 PDT 2010