Pros and cons of studies with students If your study is on students, do your results generalize to professionals? We discussed ways in which students are different from and similar to professional programmers. The key takeaway is that a user evaluation usually compares a control group to a treatment group, in order to determine the effect of the treatment on the subjects. [We spent too much time on this and should have reserved more time for example experimental designs at the end of class.] We tied this into threats to validity. It depends on your research question: * quality of university education (the treatment is experience, and the question is whether it helps -- whether there are things the university wasn't able to provide. * effectiveness of a tool =========================================================================== Writing a proposal Think of this as a draft of a paper, with parts not yet filled in. Parts to include: * research question and/or hypothesis * what is most novel or interesting about the thing you have proposed? * how to isolate that research question * what is the control group? Is it explicit or implicit? * use realistic tasks, and argue that your tasks are realistic in relevant ways * related work * why the results are actionable Be concrete! * Methodology section: explain what you will do. * Results section: Include (empty) tables and charts. * must know what you plan to measure, or you will not successfully measure it! * must know what outcomes you expect to achieve from the evaluation, or you won't collect useful data * Give the approach in as much detail as possible. * permits evaluation * just writing it will help you see flaws in it! * For reproducibility: must give enough information for someone else to do exactly what you have done. Parts can be bullet points at this point, as long as the key parts are clear and complete. Polished text is not important, but ambiguity is not acceptable. Later, you will want to set a schedule, perform a pilot study, gather more information, etc. You don't need that information at this point; just talk about the end goal. When writing the proposal, it may be helpful to think about the Heilmeier catechism. * Parts of it are relevant to a user evaluation, but other parts aren't, because they are focused on the bigger picture. * Your evaluation would be just part of a larger overall project * Contrast to previous ways of doing things is important to isolating the question and having a control. =========================================================================== Here are some example questions that you could study: We considered about 4 of these, going through the process of sketching an experiment: research questions/hypotheses, isolating them, actionability, etc. UIs: WSYWYG vs. explicit markup smartphone vs. computer efficiency effect of errors on user behavior -- how many is too many? using a tool within vs. outside an IDE effect of quick fixes Programming languages: Typed vs. untyped programming languages functional vs procedural vs OO design for students, oo-first vs procedural-first instruction Do programmers understand the concepts underlying their tools, or are they just blindly trying to eliminate errors? What are ways to encourage understanding? Testing: TDD (test-driven design) vs. test-last strategy are automatically-generated tests helpful? artificially slow down test suites; how does programmer behavior change? Debugging: Do fault localization tools help programmers? different tools to eliminate NPEs incremental vs. all-at-once approach to annotating source code for verification Maintenance: different techniques for getting up to speed on a project perform tasks with and without documentation/specification/assertions ---------------- Try to be quantitative It's less compelling to me to just do: * give it to a user and watch them (this can be really useful! but the methodology matters less) =========================================================================== Example user evaluations: I received only meager suggestions for papers about user evaluations. So we'll focus on papers I know about. =========================================================================== Students vs. professionals most experiments are done on college undergrads. This limits generalizability, but it is easy and cheap, and the population is homogeneous and comparable to other experiments. Class-generated differences: CS students: * younger * willing to do studies * fewer demands on time * want experience * eager to learn * more formal education * more theoretical style * fewer existing habits Programmers: * more years of experience * different experience: * bigger projects * bigger teams * full lifecycle (maintenance, planning) * incomplete requirements * better work ethic * more unpredictable and constrained schedule =========================================================================== Comments on some papers on experimental methodology of using students vs. professionals In experiments, there is generally a control group and an experimental treatment group, and the purpose of the experiment is to determine how the treatment affects the performance of the subjects. The absolute performance of the subjects generally does not matter -- the only thing that matters is the relative performance of the control and treatment groups. Therefore, whether students perform the same as professional programmers is not relevant. The only thing that matters is whether students respond the same to experimental treatments. For example, if a methodology or tool improves students' performance, does it also improve professionals' performance? If so, then use of students as subjects is valid, whether or not the students perform as well as the professionals on an absolute scale. Unfortunately, some papers that compare students to professional programmers compare their absolute performance rather than, or in addition to, their improvement, when the improvement is what really matters. Comparing the absolute performance of students to programmers seems useful only to evaluate the quality of the students' university training. @Article{HostRW2000, author = "H{\"o}st, Martin and Regnell, Bj{\"o}rn and Wohlin, Claes", title = "Using Students As Subjects---A Comparative Study of Students and Professionals in Lead-Time Impact Assessment", journal = JEmpiricalSE, year = 2000, volume = 5, number = 3, pages = "201--214", month = nov, abstract = "In many studies in software engineering students are used instead of professional software developers, although the objective is to draw conclusions valid for professional software developers. This paper presents a study where the difference between the two groups is evaluated. People from the two groups have individually carried out a non-trivial software engineering judgement task involving the assessment of how ten different factors affect the lead-time of software development projects. It is found that the differences are only minor, and it is concluded that software engineering students may be used instead of professional software developers under certain conditions. These conditions are identified and described based on generally accepted criteria for validity evaluation of empirical studies.", } The abstract says, "It is found that the differences are only minor, and it is concluded that software engineering students may be used instead of professional software developers under certain conditions." I don't see the "certain conditions" stated in the paper (even though the abstract promises it), though the paper does have a "threats to validity" section. This study compares the capabilities of students to professional product managers, which doesn't matter if one is performing studies of the effect of a methodology or tool. The task is estimating which factors affect a product's time-to-completion. (When the paper says the groups carried out "a non-trivial software engineering task", it means they fille out a survey.) Example factors are competence of the development team, product complexity, requirements stability, etc. The students had not had any training in this task, but it is a core task for product managers. (Actually, estimating schedules is the core task for product managers, but this experiment didn't concern itself with that task, only with factors that might affect it.) The authors determine the ground truth by looking at the 8 products that the professional subjects worked on. The ground truth is bogus, since for only 6 of the 10 factors is the correlation as expected. For example, the ground truth strongly indicates that the more competent the team, the longer the project takes to complete. Probably there are other hidden factors at work, such as better programmers being given harder tasks, but we should generally ignore the paper's comparisons to the ground truth. Participants were 26 students and 18 professionals; one outlier of each was thrown out, leaving data for 25 students and 17 professionals. Their estimates were very close: the ranking of any of the 10 factors always differed by at most two places, and the Spearman correlation was .9 with a p-value of .007. Given that the student and professional rankings were so similar, they were about equally "correct" -- that is, equally similar to the "ground truth". @InProceedings{Runeson2003, author = "Runeson, Per", title = "Using students as experiment subjects -- An analysis on graduate and freshmen student data", booktitle = "Proceedings of the 7th International Conference on Empirical Assessment in Software Engineering", year = 2003, pages = "95-102", month = apr # "~8--10,", address = "Keele, UK", abstract = "The question whether students can be used as subjects in software engineering experiments is debated. In order to investigate the feasibility of using students as subjects, a study is conducted in the context of the Personal Software Process (PSP) in which the performance of freshmen students and graduate students are compared and also related to another study in an industrial setting. The hypothesis is that graduate students perform similarly to industry personnel, while freshmen student's performance differ. A quantitative analysis compares the freshmen' and graduate students. The improvement trends are also compared to industry data, although limited data access does not allow a full comparison. It can be concluded that very much the same improvement trends can be identified for the three groups. However, the dispersion is larger in the freshmen group. The absolute levels of the measured characteristics are significantly different between the student groups primarily with respect to time, i.e. graduate students do the tasks in shorter time. The data does not give a sufficient answer to the hypothesis, but is a basis for further studies on the issue.", } Freshmen, graduate students, and professionals were given training in the Personal Software Process (PSP). The paper's main hypothesis is not confirmed/supported, namely that that grad students (at the end of their studies) are similar to professionals but dissimilar to undergrads (at the beginning of their studies) in terms of absolute performance. The freshmen spent 47% more time than the graduate students, though overall there isn't enough data to confirm or deny the hypothesis. For people wondering whether students are good experimental subjects, more interesting and relevant than *absolute* performance is the measurement of improvement or *relative* performance: the effect of PSP on the participants. "The improvements are very much the same for the three groups." "The number of defects does not differ between the groups." They speculate that this is because the freshmen did not report all their defects. (Why didn't the authors just test the code that the subjects produced, or examine the subjects' version control repositories?) Metrics about code size are suspect, because undergraduate students were allowed to use a list package (the paper says this affects tasks 1A, 4A, and 6A). @InProceedings{SalmanTMJ2015, author = "Iflaah Salman and Ay{\c{s}}e {Tosun Misirli} and Natalia Juristo", title = "Are Students Representatives of Professionals in Software Engineering Experiments?", booktitle = ICSE2015, year = 2015, pages = "666-676", month = ICSE2015date, address = ICSE2015addr, } The paper's answer is: yes, use of students yields representative results when doing software engineering experiments. The paper's conclusions may be true, but its evidence for them is not convincing. Main conclusion: "Except for minor differences, neither of the subject groups is better than the other. Both subject groups perform similarly when they apply a new approach [namely test-driven development] for the first time." Elsewhere, the paper's conclusion is stated differently: when using the familiar standard test-last approach, professionals do better, but "students and professionals perform similarly when applying a development approach in which they are inexperienced." Table I reports related work (6 papers). It claims all the studies are experiments other than [9] which is a survey, but it seems to me that [5] is also a survey of opinions. Two [6,10] found a difference between professionals and students -- both studies were about code inspections/reviews. From the table, it seems that only HostRW2000 and Runeson2003 compare improvements, whereas the other 4 compare absolute performance. In 15 studies of TDD (test-driven design), "13 studies reported no difference between TDD and the test-last approach in terms of code quality." This paper measures static source code metrics, but it ignores more important issues such as correctness. "Whether the developed code was complete and correct is outside the scope of our study." Whereas the paper repeatedly says "We measured the code quality", the authors really measured source code metrics, including number of lines of code, number of lines per method, cyclomatic complexity, number of parameters, and also some things that might be coupled with actual quality such as test coverage and lack of cohesion metrics. The paper later calls these "internal code quality", which is a misnomer. The experiment involved 21 professionals and 14 students. Each one first did one project using TLD (traditional test-last design) then professionals did 2 projects using TDD and students did 3 projects using TDD. Each task was done on a different day, over the course of a week-long TDD training. The first, TLD task could be considered a training task. There are many differences between the experimental protocols for the professionals and the students, including which projects they used for TLD and TDD. This was fixed for professionals, random for students; 3 projects were toy and 1 was difficult, so this is a serious concern that makes comparisons, especially of code size, less meaningful. For TLD, the main differences between students and professionals are related to code size. The paper repeatedly mentions that professionals produce bigger methods with more operators and operands; but the professionals all worked on the same toy problem for TLD whereas the students worked on 4 different tasks, 3 of which are toy and one of which is difficult. They analyze the data separately for the TDD1 and TDD2 (first and second TDD tasks) cases; I'm not sure why. For TDD1, the authors say that students and professionals "appear to act similarly during their first implementation" (when first exposed to TDD). ===========================================================================