Pros and cons of studies with students

If your study is on students, do your results generalize to professionals?
We discussed ways in which students are different from and similar to professional programmers.
The key takeaway is that a user evaluation usually compares a control group to a treatment group, in order to determine the effect of the treatment on the subjects.


[We spent too much time on this and should have reserved more time for example experimental designs at the end of class.]

We tied this into threats to validity.

It depends on your research question:
 * quality of university education
    (the treatment is experience, and the question is whether it helps -- whether there are things the university wasn't able to provide.
 * effectiveness of a tool

===========================================================================

Writing a proposal

Think of this as a draft of a paper, with parts not yet filled in.

Parts to include:
 * research question and/or hypothesis
   * what is most novel or interesting about the thing you have proposed?
 * how to isolate that research question
   * what is the control group?  Is it explicit or implicit?
 * use realistic tasks, and argue that your tasks are realistic in relevant ways
 * related work
 * why the results are actionable

Be concrete!
 * Methodology section:  explain what you will do.
 * Results section:  Include (empty) tables and charts.
    * must know what you plan to measure, or you will not successfully measure it!
    * must know what outcomes you expect to achieve from the evaluation, or you won't collect useful data
 * Give the approach in as much detail as possible.
    * permits evaluation
       * just writing it will help you see flaws in it!
    * For reproducibility:  must give enough information for someone else to do exactly what you have done.
Parts can be bullet points at this point, as long as the key parts are clear and complete.  Polished text is not important, but ambiguity is not acceptable.

Later, you will want to set a schedule, perform a pilot study, gather more information, etc.  You don't need that information at this point; just talk about the end goal.

When writing the proposal, it may be helpful to think about the Heilmeier catechism.
 * Parts of it are relevant to a user evaluation, but other parts aren't, because they are focused on the bigger picture.
 * Your evaluation would be just part of a larger overall project
 * Contrast to previous ways of doing things is important to isolating the question and having a control.

===========================================================================

Here are some example questions that you could study:

We considered about 4 of these, going through the process of sketching an experiment:  research questions/hypotheses, isolating them, actionability, etc.

UIs:
WSYWYG vs. explicit markup
smartphone vs. computer efficiency
effect of errors on user behavior -- how many is too many?
using a tool within vs. outside an IDE
effect of quick fixes

Programming languages:
Typed vs. untyped programming languages
functional vs procedural vs OO design
for students, oo-first vs procedural-first instruction
Do programmers understand the concepts underlying their tools, or are they just blindly trying to eliminate errors?  What are ways to encourage understanding?

Testing:
TDD (test-driven design) vs. test-last strategy
are automatically-generated tests helpful?
artificially slow down test suites; how does programmer behavior change?

Debugging:
Do fault localization tools help programmers?
different tools to eliminate NPEs
incremental vs. all-at-once approach to annotating source code for verification

Maintenance:
different techniques for getting up to speed on a project
perform tasks with and without documentation/specification/assertions

----------------

Try to be quantitative
It's less compelling to me to just do:
 * give it to a user and watch them
   (this can be really useful!  but the methodology matters less)

===========================================================================

Example user evaluations:
I received only meager suggestions for papers about user evaluations.
So we'll focus on papers I know about.

===========================================================================

Students vs. professionals

most experiments are done on college undergrads.  This limits
generalizability, but it is easy and cheap, and the population is
homogeneous and comparable to other experiments.

Class-generated differences:

CS students:
 * younger
 * willing to do studies
    * fewer demands on time
    * want experience
    * eager to learn
 * more formal education
    * more theoretical style
 * fewer existing habits

Programmers:
 * more years of experience
 * different experience:
    * bigger projects
    * bigger teams
    * full lifecycle (maintenance, planning)
    * incomplete requirements
 * better work ethic
 * more unpredictable and constrained schedule

===========================================================================

Comments on some papers on experimental methodology of using students vs. professionals

In experiments, there is generally a control group and an experimental
treatment group, and the purpose of the experiment is to determine how the
treatment affects the performance of the subjects.  The absolute
performance of the subjects generally does not matter -- the only thing
that matters is the relative performance of the control and treatment
groups.  Therefore, whether students perform the same as professional
programmers is not relevant.  The only thing that matters is whether
students respond the same to experimental treatments.  For example, if a
methodology or tool improves students' performance, does it also improve
professionals' performance?  If so, then use of students as subjects is
valid, whether or not the students perform as well as the professionals on
an absolute scale.
  Unfortunately, some papers that compare students to professional
programmers compare their absolute performance rather than, or in addition
to, their improvement, when the improvement is what really matters.
  Comparing the absolute performance of students to programmers seems
useful only to evaluate the quality of the students' university training.

@Article{HostRW2000,
  author =       "H{\"o}st, Martin and Regnell, Bj{\"o}rn and Wohlin, Claes",
  title =        "Using Students As Subjects---A Comparative Study of Students and Professionals in Lead-Time Impact Assessment",
  journal =      JEmpiricalSE,
  year =         2000,
  volume =    5,
  number =    3,
  pages =     "201--214",
  month =     nov,
  abstract =
   "In many studies in software engineering students are used instead of
    professional software developers, although the objective is to draw
    conclusions valid for professional software developers. This paper presents
    a study where the difference between the two groups is evaluated. People
    from the two groups have individually carried out a non-trivial software
    engineering judgement task involving the assessment of how ten different
    factors affect the lead-time of software development projects. It is found
    that the differences are only minor, and it is concluded that software
    engineering students may be used instead of professional software
    developers under certain conditions. These conditions are identified and
    described based on generally accepted criteria for validity evaluation of
    empirical studies.",
}
The abstract says, "It is found that the differences are only minor, and it
is concluded that software engineering students may be used instead of
professional software developers under certain conditions."  I don't see
the "certain conditions" stated in the paper (even though the abstract
promises it), though the paper does have a "threats to validity" section.
  This study compares the capabilities of students to professional product
managers, which doesn't matter if one is performing studies of the effect
of a methodology or tool.  The task is estimating which factors affect a
product's time-to-completion.  (When the paper says the groups carried out
"a non-trivial software engineering task", it means they fille out a
survey.)  Example factors are competence of the development team, product
complexity, requirements stability, etc.  The students had not had any
training in this task, but it is a core task for product managers.
(Actually, estimating schedules is the core task for product managers, but
this experiment didn't concern itself with that task, only with factors
that might affect it.)
  The authors determine the ground truth by looking at the 8 products that
the professional subjects worked on.  The ground truth is bogus, since for
only 6 of the 10 factors is the correlation as expected.  For example, the
ground truth strongly indicates that the more competent the team, the
longer the project takes to complete.  Probably there are other hidden
factors at work, such as better programmers being given harder tasks, but
we should generally ignore the paper's comparisons to the ground truth.
  Participants were 26 students and 18 professionals; one outlier of each
was thrown out, leaving data for 25 students and 17 professionals.  Their
estimates were very close:  the ranking of any of the 10 factors always
differed by at most two places, and the Spearman correlation was .9 with a
p-value of .007.  Given that the student and professional rankings were so
similar, they were about equally "correct" -- that is, equally similar to
the "ground truth".
  
@InProceedings{Runeson2003,
  author =       "Runeson, Per",
  title =        "Using students as experiment subjects -- An analysis on graduate and freshmen student data",
  booktitle = "Proceedings of the 7th International Conference on Empirical Assessment in Software Engineering",
  year =      2003,
  pages =     "95-102",
  month =     apr # "~8--10,",
  address =   "Keele, UK",
  abstract =
   "The question whether students can be used as subjects in
    software engineering experiments is debated. In order to
    investigate the feasibility of using students as subjects, a
    study is conducted in the context of the Personal Software
    Process (PSP) in which the performance of freshmen students
    and graduate students are compared and also related
    to another study in an industrial setting. The hypothesis is
    that graduate students perform similarly to industry personnel,
    while freshmen student's performance differ. A
    quantitative analysis compares the freshmen' and graduate
    students. The improvement trends are also compared to
    industry data, although limited data access does not allow a
    full comparison. It can be concluded that very much the
    same improvement trends can be identified for the three
    groups. However, the dispersion is larger in the freshmen
    group. The absolute levels of the measured characteristics
    are significantly different between the student groups primarily
    with respect to time, i.e. graduate students do the
    tasks in shorter time. The data does not give a sufficient
    answer to the hypothesis, but is a basis for further studies
    on the issue.",
}
Freshmen, graduate students, and professionals were given training in the
Personal Software Process (PSP).  The paper's main hypothesis is not
confirmed/supported, namely that that grad students (at the end of their
studies) are similar to professionals but dissimilar to undergrads (at the
beginning of their studies) in terms of absolute performance.  The freshmen
spent 47% more time than the graduate students, though overall there isn't
enough data to confirm or deny the hypothesis.
  For people wondering whether students are good experimental subjects,
more interesting and relevant than *absolute* performance is the
measurement of improvement or *relative* performance:  the effect of PSP on
the participants.  "The improvements are very much the same for the three
groups."
  "The number of defects does not differ between the groups."  They
speculate that this is because the freshmen did not report all their defects.
(Why didn't the authors just test the code that the subjects produced, or
examine the subjects' version control repositories?)
  Metrics about code size are suspect, because undergraduate students were
allowed to use a list package (the paper says this affects tasks 1A, 4A,
and 6A).

@InProceedings{SalmanTMJ2015,
  author =    "Iflaah Salman and Ay{\c{s}}e {Tosun Misirli} and Natalia Juristo",
  title =     "Are Students Representatives of Professionals in Software Engineering Experiments?",
  booktitle = ICSE2015,
  year =      2015,
  pages =     "666-676",
  month =     ICSE2015date,
  address =   ICSE2015addr,
}
The paper's answer is: yes, use of students yields representative results
when doing software engineering experiments.
  The paper's conclusions may be true, but its evidence for them is not
convincing.
  Main conclusion: "Except for minor differences, neither of the subject
groups is better than the other.  Both subject groups perform similarly
when they apply a new approach [namely test-driven development] for the
first time."  Elsewhere, the paper's conclusion is stated differently:
when using the familiar standard test-last approach, professionals do
better, but "students and professionals perform similarly when applying a
development approach in which they are inexperienced."
  Table I reports related work (6 papers).  It claims all the studies are
experiments other than [9] which is a survey, but it seems to me that [5]
is also a survey of opinions.  Two [6,10] found a difference between
professionals and students -- both studies were about code
inspections/reviews.  From the table, it seems that only HostRW2000 and
Runeson2003 compare improvements, whereas the other 4 compare absolute
performance.  In 15 studies of TDD (test-driven design), "13 studies
reported no difference between TDD and the test-last approach in terms of
code quality."
  This paper measures static source code metrics, but it ignores more
important issues such as correctness.  "Whether the developed code was
complete and correct is outside the scope of our study."  Whereas the
paper repeatedly says "We measured the code quality", the authors really
measured source code metrics, including number of lines of code, number of
lines per method, cyclomatic complexity, number of parameters, and also
some things that might be coupled with actual quality such as test coverage
and lack of cohesion metrics.  The paper later calls these "internal code
quality", which is a misnomer.
  The experiment involved 21 professionals and 14 students.  Each one first
did one project using TLD (traditional test-last design) then professionals
did 2 projects using TDD and students did 3 projects using TDD.  Each task
was done on a different day, over the course of a week-long TDD training.
The first, TLD task could be considered a training task.  There are many
differences between the experimental protocols for the professionals and
the students, including which projects they used for TLD and TDD.  This was
fixed for professionals, random for students; 3 projects were toy and 1 was
difficult, so this is a serious concern that makes comparisons, especially
of code size, less meaningful.
  For TLD, the main differences between students and professionals are
related to code size.  The paper repeatedly mentions that professionals
produce bigger methods with more operators and operands; but the
professionals all worked on the same toy problem for TLD whereas the
students worked on 4 different tasks, 3 of which are toy and one of
which is difficult.
  They analyze the data separately for the TDD1 and TDD2 (first and second
TDD tasks) cases; I'm not sure why.  For TDD1, the authors say that
students and professionals "appear to act similarly during their first
implementation" (when first exposed to TDD).

===========================================================================