=========================================================================== Invariant Inference for Static Checking: An Empirical Evaluation Possible points of discussion: Size of programs * These programs are tiny -- 60-120 lines of code including the ADT and client. The paper claims that verification is modular, so results on small programs should translate to larger ones that will be verified class-by-class. Do we buy that? * It took 45+ minutes to prove these tiny programs free of null pointer exceptions and array bounds errors. Is the verification technique so impractical as to be uninteresting? * No success on the QueueAr program means the experimenters totally screwed up with their pilot experiments. Maybe they should have spent more than 5 minutes writing the test suite. * The experimenters probably should have used more different test suites, rather than ones that ended up being statistically insignificantly different. If a great test suite would have reduced the time spent to 0, then the researchers should have been able to find a test suite that would have reduced thetime spent in half. Demographics table should have been for 33, not 39, participants. Are the artificially constrained test suites compelling? (It's the main research question.) Should participants just have spent the 30 minutes to create a better test suite? Design: * Randomized partial factorial design (also known as fractional factorial design). Introduces possible confounds between a main effect and an interaction effect. Shouldn't the paper have told us the actual fractional design? * "Participants could be assigned the same treatment on both trials.": Is that a good or a bad choice? Statistical significance: * many differences that are not statistically significant are not reported -- not even means, which may be (correctly or misleadingly) suggestive. (We get a few details dribbled out elsewhere, such as section 4.4 saying that Daikon recall was 83% but Houdini recall was 100%, yet these were not statistically significant.) Is that a good or a bad choice? * p = .10 significance level. There's nothing magical about p=.05, but it is the standard. Why use p=.10? Key measure is success at the task. For users who are not successful, do you buy the usefulness of precision and recall, and their definitions? This study took up to 3 hours of each participant's time. That's long! (No compensation either.) Were the research questions clear? * Research questions are (implicitly) given at the beginning of each section that describes the results, which motivates why the metric is interesting. * In general, a controlled experiment is necessary if there is a controversy in the research community. Absent a controversy, there isn't much need. The context of this paper is the deep skepticism of some researchers that an unsound, incomplete tool could be useful for a task, such as formal verification, that requires perfection: any incorrectness in the user-written annotation means that the program cannot be verified. The researchers' hypothesis was that people are resilient to being shown proposals that are incomplete and parts of which are incorrect. =========================================================================== An Experimental Evaluation of Continuous Testing During Development Possible points of discussion: Motivation: running tests has a cost: remembering, waiting, context switch out of and back into task. Previous work showed that the more quickly you learn of a defect after its creation, the easier the bug is to fix. Hypothesis: faster notification leads to less wasted time. But also, faster notification may lead to annoying distractions. (Popup windows everywhere would surely reduce productivity, and some researchers criticized the notion of continuous testing on the grounds that it would be more distracting than useful.) Research question: Does faster notification of test failures lead to faster completion times? To what extent is such an improvement due to continuous compilation and to what extent is it due to benefits of continuous testing beyond those of continuous compilation? Quasi-experiment: "We could control neither time worked nor whether students finished their assignments, but we measured both quantities via monitoring logs and class records." "A few students in the treatment groups chose to ignore the tools and thus gained no benefit from them.": should they have been removed from the study and the statistics? Threats to validity: * Rank novices: largely new to Java, first Java programming course. * All tests provided (simulates test-first development?) Statistics: * 20 variables as predictors: is this too many? shopping around for statistically significant effects? No effect on time worked, though the main hypothesis was that students would work less time because previous work had predicted 10-15% time savings could be possible. (Means refute the hypothesis, but differ by only 5% and are not statistically significant.) Students seem to set a time budget for their homework. * "I completed the assignment faster" got 4.6/7 (Likert scale) for continuous compilation and 5.5/7 for continuous testing. Yet it was wrong for both. ===========================================================================