=========================================================================== A practical guide to controlled experiments of software engineering tools with human participants Andrew J. Ko and Thomas D. LaToza and Margaret M. Burnett Empirical Software Engineering journal Controlled experiments with human participants are rare, appearing in 2% of papers. The paper claims that there is "still a need for more". What would the right percentage be? Establishing your research questions is more important than most of the issues in this paper, but is out of scope for this paper. "Improved understanding" of a phenomenon is a non-goal for researchers, and "improved understanding" of a program is a non-goal for programmers. In general, scientific research should have an actionable outcome. Says the key tradeo6f is control vs. artificiality. I view other important tradeoffs as narrowness and cost: controlled experiments are narrow and often more costly. Human Subjects Board (HSB) or Institutional Review Board (IRB) approval: We'll skip over this because there is a blanket exception for coursework. Just know that long delays can be a pain. * You still need to get informed consent from your subjects, permit them to opt out of tasks, etc. * I will act as your IRB. The paper underemphasizes the importance of doing pilot studies. It recommends them, but that is wrong: you should *always* do them, and not report the results in your paper unless your design was perfect and you didn't make any changes. Likewise, you should always do a "sandbox pilot" beforehand, where you the experimenter try the tasks. Again, this should not be reported; it is just assumed that you did it. Ceiling effects occur when participants run out of time to complete their task. If this happens, then timings become compressed (many are the same value) and are less useful -- and you may not be able to discriminate based on how many people succeeded at the task either. * I feel you should usually aim for most people to complete the task. This is the opposite of the paper's advice. Within-subjects counterbalancing: This is when each participant performs multiple tasks. That makes it possible to estimate the skill of each individual, which removes the single biggest variable in most tasks and thus makes the experiment more sensitive to measuring the effect of the treatment. Note that learning/fatigue is important: subjects usually do better on their second task, but sometimes worse, so randomize the order in which tasks are done too. Training: subjects should always be trained on the tool. Training should include using the tool for the assessed tasks. If you don't, then you are essentially testing the tool's learnability by a rank novice rather than its usefulness, which is much more important. Bias: Experimenters, when designing the experiment, tend to bias the tasks toward what the tool is good at. End-to-end tasks are better than focused ones that ask a person to do some small part of a real task. That also shows the practical importance. For example, saving a few seconds on a task that is done rarely is not very important and would not show up as statistically significant in an end-to-end task. I prefer "found" tasks, such as fixing a real bug that was reported in an issue tracker. Be sure to describe the scope of your tasks, much as you do with research questions. There will inherently be unrealism in your task. What types of unrealism are most important? Precision and recall are the standard measures for an information retrieval task. F-measure can be thought of as an average of them. Never trust the subject. Always evaluate and measure the subject rather than taking his word. (Even if there is some subjectivity involved.) With regard to software development, we generally care about * correctness * performance * cost (= mostly human development time, but don't forget maintenance costs!) =========================================================================== The Truth, the Whole Truth, and Nothing but the Truth: A Pragmatic Guide to Assessing Empirical Evaluations by Stephen M. Blackburn, Amer Diwan, Matthias Hauswirth, Peter F. Sweeney, et al. ACM Transactions on Embedded Computing Systems This paper isn't as concrete and helpful as the previous one. It identifies three sins of reasoning (think of them as threats to validity) and two sins of exposition (think of them as bad writing). The advice is focused on run-time systems, such as performance of compiler optimizations. Many of them are common sense, such as don't throw out datapoints. But an example is given where showing the best run out of 30 runs paints a very different story than showing the mean and confidence interval (that graph would be less misleading if it went all the way to 0 rather than the y axis only going from 9 to 12.5). There aren't a lot of examples, perhaps to avoid giving offense. One example is that the size of the environment (i.e., number of environment variables) can have a big impact on the speedup achieved by an optimizing compiler. Nonetheless, it's good to be reminded of these things so that you don't fall prey to them. ===========================================================================