Assignment 1

Suppose you are building a software agent that is supposed to support UW students in pursuing their education. You are competing to produce the most intelligent agent of this type, and the winning entry will be named the "Husky Helper".
What features do you feel would be most important to have, in order to win the contest, and how could the judges measure these features?

The intent of the contest is to produce the "most intelligent" tool. What characteristics might qualify a tool as "intelligent"? Some possibilities:

Does the tool learn or adapt? Does its behavior becomes "better" the more it interacts with its user?

Does the tool become more personalized to each user? or does it just change the same way for everyone?
However, some part of the tool's change could be because it was learning from all users. In general, the more data it has, the more able to make correct learning decisions (i.e. decisions about how to change itself) it will be. The part of the tool's change that is different per user should be the part on which different users will have different preferences or needs. Otherwise, for common needs, the tool might appear to be changing in the same way for all students -- this would be legitimate.
Here's an example: A speech recognition system might improve if it had more words in its dictionary. It could get new words from anywhere, (e.g. a product upgrade, or each user might be asked to help by entering words that the system had trouble with) and then make them available to all. But it should adapt to each user's accent and style of speaking separately.
Can the tool can accept free-form or "natural" input, or does it just have fixed sets of options? An example would be a natural-language interface, where the user types or speaks just as they might to a person, versus a menu- or form-based system.

How successful is the tool at figuring out what the user wants to do? Does it know when it's having trouble, and does it have a fall-back in that case, so it can avoid failing entirely?
Recovering from failures an important topic in "dialog systems" that use natural language.
Does it detect whether the user's request might have bad consequences? (For instance: "Please get rid of all my old files." "How old?" "Oh, say, from before last quarter?" But...some of those old files are important configuration files! And what about old directories (directories are, themselves, files) that contain more recent files? The tool should refuse to do this, at least without restrictions.)
Safety is a very significant issue for agents that can make changes, not just gather information.
Does the tool have a fixed set of "domains" it can work in? (E.g. the tool might be able to handle shopping, and setting up appointments, and searching reference materials, and other specific areas.) Or, could it be taught (by the user) to work in a new domain?

Does the tool have a fixed set of "tools" it can use? (E.g. it might be able to search the Web, or look in particular databases, or log on to a particular type of computer and do commands.) Can it be taught to use a new tool?
If it has tools that can't directly do the job it needs to do, can it figure out how to combine tools to get there? Can it do tasks that require planning -- finding a series of actions that get to the desired end result?
How much does the tool "know" already? Does it do a good job "right out of the box"?
Much of the knowledge the tool might need will be in common among all users. And if the tool was fairly "raw" to begin with, and needed to be trained, it would be appropriate to do a lot of training before releasing the tool (or before submitting it for judging).
Can the tool reason? Can it make inferences from data?
Can it plan how to complete a task that requires multiple actions, where some may depend on finishing others first, or some can't be done at the same time?
Can the tool go out and find information for itself?
Can it generalize? For instance, if we tell it what we want by giving it examples, can it find things that are not just exactly those examples?

How could we decide, fairly, which tool was "most" intelligent? What sort of quantitative measures might we use? If we have to use subjective measures, is there any way to make them more impartial?

We want the tool to interpret the user's requests, to figure out what they want to do. So one measure is: How often does the tool make a mistake? How frequently does it misunderstand the user's input, or incorrectly decide what the user wants to do, or provide a response that isn't what the user was looking for? We'd want to measure this "error rate" according to how much use the tool was getting, not just elapsed time.
A caution: We may not be able to distinguish whether the tool is figuring out how to do these things, or whether the tool designer just wrote them into the tool. One possible way to tell these apart is that we don't expect the tool designer to have thought of everything -- we'd find boundaries that the tool couldn't go beyond.

We're looking for the tool to become "better", so one could watch whether the error rate decreases as the tool is used.

However, we don't want to penalize the tool if it's very competent to begin with -- that would measure how much training and data gathering effort the authors put into it. So one would want to include both the absolute error rate and the improvement in the "score".

Another measure is, how long (how many steps; how much dialog) does it take for the tool to do what the user wants? Does this get shorter?

How easy is it to deliberately confuse the tool? Does it have a good recovery?

How natural is the interface? Is the user limited to very specific requests? Are there only menus and forms?

More generally, how much intelligence is required for the service the tool is trying to provide? How hard is the task it's trying to perform?
A problem in rating tools that perform different tasks is: How do we take into account the difficulty of the task that they're trying to perform? Here, we might need to rely on experts in user interfaces, who would know the relative difficulty of the tasks. For instance, a natural language interface that uses speech is harder than one that uses typed sentences. A tool that has a small, fixed, set of "domains" for which it can recognize requests, and always picks one of them, is simpler than one that tries to determine when the user is talking about a domain it doesn't know about. And it would be considerably harder for the tool to try to add that new domain to its repertoire!
Can the tool eventually do things that it couldn't do before?