===========================================================================

Invariant Inference for Static Checking:  An Empirical Evaluation

Possible points of discussion:

Size of programs
 * These programs are tiny -- 60-120 lines of code including the ADT and
   client.  The paper claims that verification is modular, so results on small
   programs should translate to larger ones that will be verified
   class-by-class.  Do we buy that?
 * It took 45+ minutes to prove these tiny programs free of null pointer
   exceptions and array bounds errors.  Is the verification technique so
   impractical as to be uninteresting?
 * No success on the QueueAr program means the experimenters totally
   screwed up with their pilot experiments.  Maybe they should have spent
   more than 5 minutes writing the test suite.
 * The experimenters probably should have used more different test suites,
   rather than ones that ended up being statistically insignificantly
   different.  If a great test suite would have reduced the time spent to
   0, then the researchers should have been able to find a test suite that
   would have reduced thetime spent in half.

Demographics table should have been for 33, not 39, participants.

Are the artificially constrained test suites compelling?
(It's the main research question.)
Should participants just have spent the 30 minutes to create a better test suite?

Design:
 * Randomized partial factorial design (also known as fractional
   factorial design).  Introduces possible confounds between a main effect and
   an interaction effect.  Shouldn't the paper have told us the actual
   fractional design?
 * "Participants could be assigned the same treatment on both trials.":
   Is that a good or a bad choice?

Statistical significance:
 * many differences that are not statistically significant are not reported
   -- not even means, which may be (correctly or misleadingly) suggestive.
   (We get a few details dribbled out elsewhere, such as section 4.4 saying
   that Daikon recall was 83% but Houdini recall was 100%, yet these were
   not statistically significant.)
   Is that a good or a bad choice?
 * p = .10 significance level.  There's nothing magical about p=.05, but it
   is the standard.  Why use p=.10?

Key measure is success at the task.  For users who are not successful, do
you buy the usefulness of precision and recall, and their definitions?

This study took up to 3 hours of each participant's time.  That's long!
(No compensation either.)

Were the research questions clear?
 * Research questions are (implicitly) given at the beginning of each section
   that describes the results, which motivates why the metric is interesting.
 * In general, a controlled experiment is necessary if there is a controversy
   in the research community.  Absent a controversy, there isn't much need.
   The context of this paper is the deep skepticism of some researchers that
   an unsound, incomplete tool could be useful for a task, such as formal
   verification, that requires perfection:  any incorrectness in the
   user-written annotation means that the program cannot be verified.  The
   researchers' hypothesis was that people are resilient to being shown
   proposals that are incomplete and parts of which are incorrect.

===========================================================================

An Experimental Evaluation of Continuous Testing During Development

Possible points of discussion:

Motivation: running tests has a cost: remembering, waiting, context switch
out of and back into task.
Previous work showed that the more quickly you learn of a defect after its
creation, the easier the bug is to fix.

Hypothesis:  faster notification leads to less wasted time.
But also, faster notification may lead to annoying distractions.
  (Popup windows everywhere would surely reduce productivity, and some
  researchers criticized the notion of continuous testing on the grounds
  that it would be more distracting than useful.)
Research question:
  Does faster notification of test failures lead to faster completion times?
  To what extent is such an improvement due to continuous compilation and
  to what extent is it due to benefits of continuous testing beyond those
  of continuous compilation?

Quasi-experiment:  "We could control neither time worked
nor whether students finished their assignments, but we measured
both quantities via monitoring logs and class records."

"A few students in the treatment groups chose to ignore the tools and thus
gained no benefit from them.":  should they have been removed from the
study and the statistics?

Threats to validity:
 * Rank novices:  largely new to Java, first Java programming course.
 * All tests provided (simulates test-first development?)

Statistics:
 * 20 variables as predictors:  is this too many?  shopping around for
 statistically significant effects?

No effect on time worked, though the main hypothesis was that students
would work less time because previous work had predicted 10-15% time
savings could be possible.  (Means refute the hypothesis, but differ by
only 5% and are not statistically significant.)  Students seem to set a
time budget for their homework.

 * "I completed the assignment faster" got 4.6/7 (Likert scale) for
   continuous compilation and 5.5/7 for continuous testing.  Yet it was
   wrong for both.

===========================================================================