Lecture 1: introduction; types of measures; types of studies This class discusses the appropriate way to evaluate a technique or tool. * example technique: code inspections/reviews * example tool: visualization Sometimes the best way is via a user study, sometimes not. The key questions to ask are: * What to measure? * How to measure it? External vs. internal measures Architecture: Every (computer) system can be viewed as a collection of interconnected, communicating parts. The box-and-arrow diagram that shows these parts and how they communicate is called the architecture diagram. We can open up any of the black boxes to find that, under the hood, it is itself a system of smaller communicating parts. When you evaluate a system, it is good to measure both external and internal measurements. * The external measurements are more important: if the system doesn't do what it is supposed to, then the system's composition and the behavior of its components doesn't matter. * The internal measurement are useful in understanding why the system behaves well or badly. As a simple example, if you have a system that is implemented as a series of filters, then it is very valuable to know the relative contribution of each filter to the result. This helps you decide how to improve the system, and it helps you understand what parts are most important. * The ultimate external measurement is utility to users! It's tempting to get tied up in the technology you have created and focus on (external) measurements of it that are ultimately internal measurements with respect to the real world. There are many valuable types of of evidence. Choose one that is appropriate to the situation. (Ask yourself, "In what respect am I or my community most skeptical?", and work on that first. Make sure that your results are *actionable*.) Often this will not be a user study, but sometimes it is. Beware of "obvious" facts that are not true or are not supported by appropriate evidence. Types of user studies Here is a useful matrix of types of user studies. (Most references state that the three main types of empirical evaluations are surveys, case studies, and controlled experiments.) Goal How to obtain evidence qualitative quantitative (few datapoints) (many datapoints) Ask interview survey Observe case study controlled experiment An interview is asking an informant for verbal/textual information; it is often open-ended, though some questions may be set. A survey is asking a group of people the same questions, to get a sense for common beliefs of a group of people. A case study is observing someone in a new situation, such as using a new methodology or tool. It is often performed when you don't know what the outcome will be or the ways in which the users will react. For example, you put a tool in the hands of some users and see what happens. Ethnographic studies are a type of long-term case study, in which a sociologist lives with a group for an extended time in order to understand (the causes of) behaviors that would not be apparent from a brief observation. Longitudinal studies are also a type of long-term case study, in which replication occurs with the same subjects over time. A controlled experiment is observing the same task being done multiple times in different conditions. The control condition emulates the state of the practice. The treatment condition makes some change, such as administering a medicine or displaying a visualization. The experimenter compares the outcomes of the trials in order to infer the effect of the treatment. The boundaries of the four quadrants of the matrix are a bit fuzzy, because most user studies that you will perform will provide you with both qualitative and quantitative data. For example, suppose that you perform a survey. You are likely to include a few open-ended questions. Even if you do not, you may be able to make generalizations from the data about groups of people or correlated survey questions. Likewise, you can "code" answers given in an interview (that is, put the responses into categories) and obtain some numeric data. Nonetheless, the matrix is helpful in understanding the distinct types of user studies, and you should think about what your primary goals are. A common progression through the quadrants for a project is: ask+qual, ask+quant, observe+qual, observe+quant. I have a bias toward studies that observe, over those that ask a person's opinion. A person's opinion may be mistaken, or the person may not understand their own beliefs. Here are two examples of the latter. 1. John F. Kennedy won the presidential election by a very small margin. When people were polled just after the election, 50% of them said they had voted for Kennedy and 50% for Nixon. Two years later, in 1962, 60% of people said they had voted for Kennedy. The next year, after Kennedy's assasination in 1963, 2/3 of people said they had voted for him. 2. Cialdini et al. asked homeowners what argument would motivate them best to save water. They responded, in order: 1. It protects the environment. 2. It benefits society. 3. It saves money 4. A lot of other people are doing it. Then, the researchers put plackards on doorknobs and observed the homeowers' behavior. They found that the most effective argument (by a lot) was "A lot of other people are doing it", which people had predicted would be least effective. At the end of the day, facts and actions matter more than beliefs and opinions. So, you should do your best to get to the "observe" row as quickly as possible. Asking people does have its place: it is very inexpensive, and it an effective way to learn about situations for which you have very limited information. Observational techniques require that you have designed, and probably implemented/documented/whatever, the treatment being evaluated. There are ways around this. Paper prototypes enable you to do a case study of a GUI without having implemented it. There is a story of testing early ATM cash machines by having a person inside a box push money out a slot. The appropriate user study (or none!) depends on the maturity of the research project. A common progression is: * show something is possible * the researcher does a task * show that it can be useful * replicate once via a case study * show that it is generally useful and usable * replicate multiple times via experiments What the class is really about The work we will do in this class is not specific to software engineering, although we will often choose examples from that domain. The ideas are generally applicable regardless of your object of investigation. The work we are doing is real computer science. Many computer scientists shy away from squishy subjects that involve people. They are hard to understand, manage, and abstract, but they are among the most important topics in computer science. The hidden curriculum of this class is to teach you how to do research. That's why there is a project rather than a set of assignments, for example. Even if you don't care about user studies, you will become a better researcher. Even if you don't care about software engineering, you will become better at doing user studies.