From: Martha Mercaldi (mercaldi@cs.washington.edu)
Date: Sat Dec 04 2004 - 22:56:08 PST
PROVERB: The Probabilistic Cruciverbalist
Greg A. Keim, Noam M. Shazeer, Michael L. Littman
Summary:
This paper presents the design, architecture, and evaluation of PROVERB
a crossword puzzle solving system.
Two primary ideas:
Most of the words in this paper were spent describing the architecture
and various ways software might try to solve a crossword (approximating
solver experience via a database, classifying clues types, identifying
and describing several solution techniques). I presume this was the
first paper on the project, and that is the reason for such detail.
The second half of the paper evaluated the solver’s performance on a
set of “normal” puzzles as well as in tournament play. The authors
identify some clear successes, in particular the use of probabilistic
reasoning when merging proposed targets to fill the grid. I was
disappointed that this analysis didn’t go further and in addition to
discussing the clues the program solved correctly, also investigate the
clues that stumped it.
One or two largest flaws:
A friend of mine who contributes crosswords to the NYT and has read
this paper elaborated that PROVERB’s poor performance at the tournament
was due primarily to one puzzle on which it failed miserably. In that
puzzle, all of the clues were “Spoonerized” (e.g. “Home is near” was
given instead of “Nome is here” as a clue for “Alaska”). Humans
(extremely clever ones, anyway) can detect this, but the software could
not. I think this is very interesting both anecdotally and in that it
reveals something about the flaws of the system. The common opinion is
that if the software had competed in any other year it would have made
a better showing. This leads to a criticism I had of the paper even
before learning the full story. I had been looking forward to an
investigation of what types of clues tripped up the program and why the
authors felt it failed.
This is a smaller complaint, but I personally would have like to have
seen where in Table 2 (characterizing NYT puzzle difficulty by day) the
tournament puzzles fit. The assumption is that they were even more
difficult than the Saturday puzzles at the NYT, but it would have been
nice to show this.
Open research questions:
As evidenced from my earlier comments I am very interested in the
“failure analysis” for this program. I think studying the clues which
stumped the existing system would in the least interesting case reveal
a shortcoming of the program and in the most interesting case reveal
something about how humans solve puzzles. This might produce a meaty
challenge for the AI community to try to incorporate some other aspect
of human thinking.
This paper talks about “expert” modules, and it would be interesting to
consult the real experts (the humans who beat it at the tournament) to
see how they classify and solve clues in their mind. In the current
architecture the clue answering and the merging steps are separated,
but I know that when I solve these puzzles the steps are interleaved
with each half providing me some information to help in the
other. Perhaps looping between the two steps would improve performance.
This archive was generated by hypermail 2.1.6 : Sat Dec 04 2004 - 22:56:13 PST