===========================================================================

A practical guide to controlled experiments of software engineering tools with human participants
Andrew J. Ko and Thomas D. LaToza and Margaret M. Burnett
Empirical Software Engineering journal

Controlled experiments with human participants are rare, appearing in 2% of
papers.  The paper claims that there is "still a need for more".  What
would the right percentage be?

Establishing your research questions is more important than most of the
issues in this paper, but is out of scope for this paper.

"Improved understanding" of a phenomenon is a non-goal for researchers,
and "improved understanding" of a program is a non-goal for programmers.
In general, scientific research should have an actionable outcome.

Says the key tradeo6f is control vs. artificiality.
I view other important tradeoffs as narrowness and cost:  controlled
experiments are narrow and often more costly.

Human Subjects Board (HSB) or Institutional Review Board (IRB) approval:
We'll skip over this because there is a blanket exception for coursework.
Just know that long delays can be a pain.
 * You still need to get informed consent from your subjects, permit them
   to opt out of tasks, etc.
 * I will act as your IRB.

The paper underemphasizes the importance of doing pilot studies.
It recommends them, but that is wrong:  you should *always* do them, and
not report the results in your paper unless your design was perfect and you
didn't make any changes.
Likewise, you should always do a "sandbox pilot" beforehand, where you the
experimenter try the tasks.  Again, this should not be reported; it is just
assumed that you did it.

Ceiling effects occur when participants run out of time to complete their
task.  If this happens, then timings become compressed (many are the same
value) and are less useful -- and you may not be able to discriminate based
on how many people succeeded at the task either.
 * I feel you should usually aim for most people to complete the task.
   This is the opposite of the paper's advice.

Within-subjects counterbalancing:  This is when each participant performs
multiple tasks.  That makes it possible to estimate the skill of each
individual, which removes the single biggest variable in most tasks and
thus makes the experiment more sensitive to measuring the effect of the
treatment.  Note that learning/fatigue is important:  subjects usually do
better on their second task, but sometimes worse, so randomize the order in
which tasks are done too.

Training:  subjects should always be trained on the tool.  Training should
include using the tool for the assessed tasks.  If you don't, then you are
essentially testing the tool's learnability by a rank novice rather than
its usefulness, which is much more important.

Bias:  Experimenters, when designing the experiment, tend to bias the tasks
toward what the tool is good at.  End-to-end tasks are better than focused
ones that ask a person to do some small part of a real task.  That also
shows the practical importance.  For example, saving a few seconds on a
task that is done rarely is not very important and would not show up as
statistically significant in an end-to-end task.
I prefer "found" tasks, such as fixing a real bug that was reported in an
issue tracker.
Be sure to describe the scope of your tasks, much as you do with research
questions.

There will inherently be unrealism in your task.  What types of unrealism
are most important?

Precision and recall are the standard measures for an information retrieval
task.
F-measure can be thought of as an average of them.

Never trust the subject.  Always evaluate and measure the subject rather
than taking his word.  (Even if there is some subjectivity involved.)

With regard to software development, we generally care about
 * correctness
 * performance
 * cost (= mostly human development time, but don't forget maintenance costs!)


===========================================================================

The Truth, the Whole Truth, and Nothing but the Truth: A Pragmatic Guide to Assessing Empirical Evaluations
by Stephen M. Blackburn, Amer Diwan, Matthias Hauswirth, Peter F. Sweeney, et al.
ACM Transactions on Embedded Computing Systems

This paper isn't as concrete and helpful as the previous one.  It
identifies three sins of reasoning (think of them as threats to validity)
and two sins of exposition (think of them as bad writing).

The advice is focused on run-time systems, such as performance of compiler
optimizations.

Many of them are common sense, such as don't throw out datapoints.  But an
example is given where showing the best run out of 30 runs paints a very
different story than showing the mean and confidence interval (that graph
would be less misleading if it went all the way to 0 rather than the y axis
only going from 9 to 12.5).

There aren't a lot of examples, perhaps to avoid giving offense.
One example is that the size of the environment (i.e., number of
environment variables) can have a big impact on the speedup achieved by an
optimizing compiler.

Nonetheless, it's good to be reminded of these things so that you don't
fall prey to them.

===========================================================================