Usability evaluation considered harmful (some of the time)
Saul Greenberg and Bill Buxton 
CHI 2008
pages 111-120

  This paper is a bit discursive and in places is written in a disorganized
fashion, but it makes some good points.

  Says that most usability evaluations are in an unrealistic setting and
therefore show an existence proof that does not generalize to real-world
settings:  "What most researchers then try to do -- often without being
aware of it -- is to create a situation favorable to the new technique. The
implicit logic is that they should be able to demonstrate at least one case
where the new technique performs better than the old technique; if they
cannot, then this technique is likely not worth pursuing. In other words,
the usability evaluation is an existence proof."

  The community ends up with mostly weak evaluations being published.
 * Initial studies don't tell us much:  "first evaluations are typically
   confirmatory and thus weak."
 * Few replications are published because reviewers don't consider them novel and don't believe they add much (given the initial published studies).
  The paper distinguishes between a sketch and a prototype.  It doesn't
define sketch (but this term is apparently understood -- even if not
precisely defined -- in the HCI literature).  "By definition, a sketch --
even if implemented as an interactive system -- is a roughed out design. It
will have many holes, deficiencies, and undeveloped attributes. In
contrast, a prototype aids idea evaluation, either by validating it with
clients as they try it out, or through usability testing."  They define a
prototype as a near-final "approximation of a finished product" that users
can use.  I view a prototype as a platform for experimentation that may be
thrown away once the experiment is done.  For instance, one might prototype
part of a system to determine whether a library's performance is adequate,
or whether two libraries are compatible with one another.

  Usability vs usefulness:  These are orthogonal -- a product can have either
with or without the other.  The paper (wrongly) gives the community a pass
on evaluating usefulness:  "Usefulness is a very difficult thing to
evaluate".  The paper does acknowledge that usefulness is more important
than usability:  "Usability evaluation is predisposed to the world changing
by gradual evolution; iterative refinement will produce more usable
systems, but not radically new ones."  Therefore, it is incumbent on
researchers to evaluate usefulness, regardless of what this paper says.

  Guidance about whether to do usability evaluation:
 * One should NOT do usability evaluation "in very early design
   stages, in cases where usefulness overshadows usability, in
   instances where unpredictable cultural uptake dominates
   how an innovative system will actually be used."
 * "usability evaluation ... is appropriate for settings with well-known tasks
   and outcomes."  A weakness is that "they fail to consider how novel
   engineering innovations and systems will evolve and be adopted by a culture
   over time."  The page of examples of this are not very compelling to me;
   for example, cars became useful only after there was infrastructure in the
   form of paved roads, gas stations, etc.
 * "There are many other aspects of user-centered design that are just as
   important:  understanding requirements, considering cultural aspects,
   developing and showing clients design alternatives, affording new interface
   possibilities through technical innovations, and so on."

===========================================================================

Deliberate Delays During Robot-to-Human Handovers Improve Compliance With Gaze Communication
Henny Admoni and Anca Dragan and Siddhartha S. Srinivasa and Brian Scassellati
HRI 2014
This paper actually answers the following research question:  If a human
has already made a plan, can a robot mislead the human into violating the
previously-made plan?  The technique for doing the misleading is by the
robot looking in a direction that is the opposite of where the human should
be attending.  This question could have been answered with a human
confederate taking the place of the robot, and should have been to provide
a baseline for the robot results.

  The paper is written as if it is answering completely different questions
about whether robot communication via gaze is an effective way to
communicate in noisy environments where auditory signals are not possible,
and how to increase the likelihood that humans notice the robot's gaze.
This isn't a very interesting question.  The paper claims that "a natural
conclusion is to communicate information about where to put the object
using eye gaze", but in that case it should have compared to other
mechanisms for communicating a suggestion, such as shining a light on a bin
or projecting an arrow that points at a bin.

  When I talked to Sidd Srinivasa, he cast the paper in a third way:  it's
interesting for the military to create deceptive robots that can fool
adversaries regarding their intentions, and this is a first step toward
that goal.

  This study does not seem relevant to real-world tasks where participants
know how to interact with the robot.  Of the 32 participants, 14 didn't
notice the robot's gaze as a suggestion, even though they had been told
that the robot would make a suggestion.  So this study could be described
as about learning what a robot's cues mean.  If the participants actually
interacted with robots frequently and understood how the robot
communicates, then the results might be quite different.  "Participants
were also told that HERB's head would move and that HERB may provide
suggestions about how to sort the blocks, but that the final sorting method
was up to them."  (During the second that the robot was delaying the
handover, some participants focused on pulling harder on the block to get
it out of the robot's hand.)

  In addition to the above concerns about generalizability, the experiment
is unrealistic and uncalibrated.  "In this task, a robot called HERB hands
colored blocks to participants, who sort those blocks into one of two
colored boxes according to their personal preference."  The applicants were
told they could sort the blocks however they wanted.  They were not asked
what their goal was nor assessed on whether they achieved that goal.  It's
also possible that the robot's suggestions changed that goal.  The
participants were not asked whether the robot's gaze affected their goal or
their behavior.  Therefore, there is no objective data to back up the
paper's hypothesis that the robot caused the participants to under-perform
their intended plan or to change their plan.  It's also possible that the
task was so vague that the participants were not taking it very seriously
or never had a plan at all.  How is this relevant to real-world tasks that
have a goal?

  To answer these questions adequately, the researchers would need to ask
the participants what their sorting strategy is.  It's likely that the
researchers didn't want to ask the participants what their plan was,
because that would fix it in their minds and they would be less likely to
be misled by the robot's gaze.  However, the authors should have said this.
The experiment could also have been to divide blocks -- all the same --
into two bins and the experimenters could have investigated how often the
participants followed the robot's suggestions.  All this is really a
psychological experiment (but not a well-designed one) rather than having
anything to do with robotics.

  The paper's conclusions:
"We hypothesize that:
 * H1 A handover delay will cause people to pay more attention
   to HERB's gaze communication, and that
 * H2 Social gaze will lead people to comply more with HERB's
   counterintuitive suggestion."
H1 is supported and H2 is not supported.  The paper's authors gives reasons
that H2 might be true but not supported in their experiment, but they don't
spend as much time talking about why it might not be true or why H1 might
be false but supported in the experiment.  Other factors, such as social
gaze (the altitude of the gaze:  starting at the human's face ("joint
attention") and staying at that level, or starting at the block ("mirrored
gaze") and staying at that level), didn't have an effect.

  Suppose that the robot gaze did affect participants' choices.  Would that
be a positive result because the robot has power (the paper casts it that
way) or a negative result because the people performed worse on their task
when interacting with a robot?  (We do expect robots to make mistakes once
in a while.)  The paper doesn't discuss this.

  Here are more details about the experiment.

  Handover is "the act of transferring an item from one actor to another".
The paper uses that as context but isn't concerned with that.  The question
is whether drawing attention to the robot's gaze causes people to follow
the suggestion that is implicit in the gaze.

  Human subjects were given blocks by a robot and asked them to put the
blocks in either a blue or a yellow bin.  The blocks were blue, yellow,
half-and-half, or 70% of one color.  32 participants were each handed 5
blocks, in order, by the robot.

  The robot always looked at one of the two bins.  In the control condition,
the robot started to turn its head as it let go of the block (when the user
grasped the block).  In the treatment condition, the robot delayed letting
go of the block for 1 second (the amount of time it took to turn its head)
after the user grasped the block.  The question is whether more users
followed the robot's suggestion when the delay had encouraged them to
notice the suggestion.  The robot's suggestion for the 70-30 colored block
was always to put it in the bin of the opposite color, which is
counterintuitive.  The robot gripped each block on the minority color and
always suggested to the block in the bin of the color being gripped.

===========================================================================

An External Replication on the Effects of Test-driven Development Using a Multi-site Blind Analysis Approach
Davide Fucci and Giuseppe Scanniello and Simone Romano and Martin Shepperd and Boyce Sigweni and Fernando Uyaguari and Burak Turhan and Natalia Juristo and Markku Oivo
ESEM 2016

  The question is whether test-first or test-last development is more
productive and yields better code.  The paper lacks background, such as a
discussion of the reasons that people believe TDD (test-driven design or
test-first design) might be better.  That could have helped to determine
experimental questions.

  This paper replicates a previous experiment.  The experimental protocol is
the same as in the previous experiment, but the design is different:  "A
balanced crossover design is a type of repeated measure design -- i.e., the
measures are taken several times for the same participant -- in which a
participant is randomly assigned to a sequence of treatments rather than a
single one."  The study is with only 21 subjects (graduate students), which
seems too low to me for effects to be visible.

  "Results: The Kruskal-Wallis tests did not show any significant difference
between TDD and TLD in terms of testing effort (p-value = .27), external
code quality (pvalue = .82), and developers’ productivity (p-value = .83).  Nevertheless, our data revealed a difference based on the order in
which TDD and TLD were applied, though no carry over effect."

  For the blind analysis:
 * person A ran the experiment
 * person B removed the labels from the data
 * person C someone else did the statistical analysis
 * person B re-added the labels to the data
The paper says the blind analysis avoids bias but gives no concrete
examples of potential bias.  A blind analysis can be very useful because it
avoids subjective factors.  However, in this paper the analyses are all
objective.  Therefore, the blind analysis seems like a pointless exercise
and a waste of energy.  The only advantage to a blind analysis may be that
the researchers didn't iteratively decide which experimental analyses to
run based on the results of previous analyses.  They could have achieved
the same result by deciding in advance which analyses to run.  A
disadvantage of blind analysis may be investigating factors that are
unlikely to matter, such as the order in which TDD and TLD are applied
(that looks like a statistical artifact to me).

  The research questions are poorly posed.
"RQ1: Is there a difference between the number of tests written
by TDD developers and TLD developers?
RQ2: Is there a difference between the external code quality
of TDD developers and TLD developers?
RQ3: Is there a difference between the productivity of TDD
developers and TLD developers?"

  Regarding RQ1, the number of tests is an irrelevant measure.  Programmers
care about the quality, not size, of tests:  do the tests catch bugs?  The
experimenters could have measured and reported this, such as by running
each subject's test against every other subject's program, or even via
mutation analysis.  Even coverage would have been more interesting than
number of tests.  They actually measured not number of tests, but number
of assert statements within the test, but that's equally irrelevant.
Furthermore, paper describes the number of assertions as "testing effort",
but they didn't measure actual human effort.

  Regarding RQ2, quality is measured as percentage of assertions passed in
the experimenters' hidden test suite, for every user story where any
assertion is passed.  It would be better to only count the user stories
that the developer finished, because we don't want to count against the
developer:
 * the last story, which is half-finished
 * any stories in which one assertion happens to pass, but which the
   developer had not been targeting.
The paper gives no indication about whether these scenarios came up.  The
paper states "the quality of the portions of the tasks that were
implemented was acceptable (QLTY = 66.70%)".  This extremely low quality
(only 2/3 of the experimenters' hidden tests were passed, among stories in
which at least one of the experimenters' hidden tests were passed) makes me
suspicious of the experiment and whether it is measuring quality software
development.  I would also be interested in how many stories were 100%
correct, since that is the real goal; it is possible that passing 80% of
tests requires less than 80% of the effort.

  Given the flaws in the original experiment, it would have been more
valuable to improve the experiment rather than replicating it exactly.

  The paper says the authors assessed two tasks as equally complex, but
they didn't use their data (about subject performance on the tasks) to
assess whether the assessment was accurate.

  Participants were given training and practice in TDD, but not in TLD.  In
fact, the entire class (that the students were taking) was about TDD.  The
paper states, "our research questions were not disclosed to
[participants]", but those research questions were surely obvious to the
participants.

  The discussion of threats to validity (section 5) is a usefully
comprehensive list of such threats.  However, the threats are listed
without an explanation of why each one might be a threat.  For example:
"Selection: the effect of letting volunteers take part in the experiment
may influence the results, since they are generally more motivated.": why
doesn't this affect both tasks equally and thus not affect the overall
conclusions?


===========================================================================

LocalWords:  HCI Anca Dragan Srinivasa Scassellati HERB's Davide Fucci DoD
LocalWords:  Scanniello Shepperd Boyce Sigweni Uyaguari Burak Turhan Oivo
LocalWords:  Juristo Markku Handovers Henny Admoni Siddhartha Sidd H1 H2
LocalWords:  Kruskal TLD pvalue RQ1 RQ2 RQ3