CSE143 Notes for Friday, 11/7/08

I discussed the next programming assignment. The program searches for all ways to form anagrams of a particular phrase. For example, if the phrase is "george bush", then some combinations of words that constitute anagrams are [bee, go, shrug] or [bugs, go, here] or [go, he, bus, erg]. An anagram is a rearrangement of the letters. Each of these word combinations has the same set of letters as "george bush". The program uses a dictionary of words to find all combinations that are anagrams of the given phrase.

In solving this problem, one of the questions your program will have to consider is whether a particular word is relevant to the problem you're trying to solve. We used the "george bush" example and looked at words from the short testing dictionary called dict1.txt. The first in the list is "abash". That word is not relevant to the problem of finding anagrams of "george bush" because there are two a's in "abash" but no a's in "george bush". So that's not a word that matters to us. The same is true of the words "aura", "bar" and "barb" because they all have a letter "a" in them and "george bush" has no occurrences of the letter "a".

The first word that we find is relevant is the word "bee". All of the letters of "bee" appear in "george bush". So that means that we need to explore this possibility. We do that by taking the letters from "bee" away from the letters for "george bush". That leads to a new set of letters that we could use to continue the process.

So how does this relate to backtracking? The potential solution space is the set of all combinations of words from the dictionary. We can think of this as a decision tree by thinking in terms of picking a first word, then picking a second word, then picking a third word, and so on. So at the top of our decision tree, the choice we are making is what word should come first and the various possibilities are the dictionary of words we've been given. If we use the words from dict1.txt, it looks like this:

                         first word?
                              |
      +--------+-------+------+-------+------+------+ ...
      |        |       |      |       |      |      |
    "abash"  "aura"  "bar"  "barb"  "bee"  "beg"  "blush"  ...

Obviously we only want to follow paths that make sense. So when we see that words like "abash" and "aura" don't make sense for the phrase "george bush", we don't follow them. The first one we'd really explore is the one that involves choosing "bee". Once we've chosen "bee" as the first word, at the next level of the tree, we consider all possible choices for a second word:

                         first word?
                              |
      +--------+-------+------+-------+------+------+ ...
      |        |       |      |       |      |      |
    "abash"  "aura"  "bar"  "barb"  "bee"  "beg"  "blush"  ...
                                      |
                                 second word?
                                      |
              +--------+-------+------+-------+------+------+ ...
              |        |       |      |       |      |      |
           "abash"  "aura"  "bar"  "barb"  "bee"  "beg"  "blush"  ...

At each level, we go through all the words. Remember that backtracking involves an exhaustive search, so you just try all possibilities that make sense. Of course, at this second level, we wouldn't be dealing with the same set of letters as before. At the first level we were looking for words that could be used to form the phrase "george bush". At the second level, we've already accounted for the letters in the first word ("bee"), so now we are searching for something that matches what's left ("gorg ush").

I mentioned that the low-level details for this problem involve keeping track of how many of each letter we have, subtracting the letters for a word from the letters for the phrase, figuring out whether that subtraction is going to work, and so on. All of these details are handled by the LetterInventory class that we wrote for assignment #2. The inventory objects keep track of how many of each letter you have and the "subtract" method sees whether you can subtract one set of letters from another set of letters.

So let's revisit some of this in terms of LetterInventory objects. Imagine making a LetterInventory object for "george bush". If we printed it, we'd get as output [beegghorsu]. We'd find that when we try to subtract the inventory for words like "abash" and "aura", it fails (subtract returns a null). The first call that succeeds is the call that subtracts the inventory for the word "bee". If we printed the new inventory, we'd get [gghorsu]. As we explore various words we could subtract from this, we find that we can successfully subtract the inventory for "go" to get an inventory that prints as [ghrsu]. And from that we can subtract the inventory for "shrug" to get an inventory that prints as [] (in other words, an empty inventory).

For the anagram problem, finding an empty LetterInventory is like getting to column 9 in the 8 queens problem. It means we've found a solution. It means that in our various explorations, we've found a sequence of words that we can successfully subtract from the original to get down to an empty inventory. That means we've accounted for every letter of the original with the current combination of words that we're exploring, which means this is a solution that we'd want to report.

That's how the anagram problem can be solved with backtracking. Each level of the decision tree (which means each invocation of the recursive method) involves choosing one word. Which choices do we pursue? Only those where we can successfully subtract the word's inventory from the current inventory. And as we proceed to lower levels in the tree, we use the new inventories that the subtract method gives us as the problem to solve at that level. In other words, the overall problem involves a phrase like "george bush", but as we make specific choices, that leads to new smaller problems that involve a smaller inventory of letters because as we choose words, we account for more and more of the letters of the original phrase. And whenever we get to an empty inventory in our recursive backtracking, we know we've found a solution that should be printed.

I mentioned that in some sense this can be the easiest assignment of the quarter. The solution is short (mine was 45 lines long) and we're being very specific about how to approach it. Recursive backtracking is a very specific technique and we're telling you exactly how to apply it to this problem. So you could, theoretically, be done in an hour. Most likely it will take a lot longer because this is your first exposure to backtracking and it will take you a while to figure out how all of this works.

I advised people to make sure that you understand how the 8 queens solution works, because if you understand it, you are in a much better position to write the anagram code. The anagram backtracking is very similar. Most recursive backtracking solutions look very similar when you come to understand the technique and how the recursive code works.

I spent the last few minutes pointing out two optimizations that I am requiring you to include. First, it's clear that we will often be needing the LetterInventory for a particular word. There is no reason to do that more than once for any given word, so I'm asking you to "preprocess" the dictionary to compute a LetterInventory for each word. How would we store these results? For each word we end up with a LetterInventory and we want to be able to quickly access the LetterInventory for any particular word. This is a perfect application of a Map. Remember that a Map is composed of key/value pairs, associating each key with a specific value. Here the keys are the words and the values are the LetterInventory objects.

Another optimization I've asked you to perform is to prune the dictionary before you begin the recursive backtracking. In the examples above I was working with the entire dictionary for the backtracking, but you can cut this down considerably by going through the dictionary before you begin the recursion and picking out just the words that are relevant. For example, if we're working with the phrase "george bush", we'd throw out words like "abash" and "aura" right away because they could never be part of a solution.

Stuart Reges

Last modified: Thu Nov 20 09:49:41 PST 2008