Usability evaluation considered harmful (some of the time) Saul Greenberg and Bill Buxton CHI 2008 pages 111-120 This paper is a bit discursive and in places is written in a disorganized fashion, but it makes some good points. Says that most usability evaluations are in an unrealistic setting and therefore show an existence proof that does not generalize to real-world settings: "What most researchers then try to do -- often without being aware of it -- is to create a situation favorable to the new technique. The implicit logic is that they should be able to demonstrate at least one case where the new technique performs better than the old technique; if they cannot, then this technique is likely not worth pursuing. In other words, the usability evaluation is an existence proof." The community ends up with mostly weak evaluations being published. * Initial studies don't tell us much: "first evaluations are typically confirmatory and thus weak." * Few replications are published because reviewers don't consider them novel and don't believe they add much (given the initial published studies). The paper distinguishes between a sketch and a prototype. It doesn't define sketch (but this term is apparently understood -- even if not precisely defined -- in the HCI literature). "By definition, a sketch -- even if implemented as an interactive system -- is a roughed out design. It will have many holes, deficiencies, and undeveloped attributes. In contrast, a prototype aids idea evaluation, either by validating it with clients as they try it out, or through usability testing." They define a prototype as a near-final "approximation of a finished product" that users can use. I view a prototype as a platform for experimentation that may be thrown away once the experiment is done. For instance, one might prototype part of a system to determine whether a library's performance is adequate, or whether two libraries are compatible with one another. Usability vs usefulness: These are orthogonal -- a product can have either with or without the other. The paper (wrongly) gives the community a pass on evaluating usefulness: "Usefulness is a very difficult thing to evaluate". The paper does acknowledge that usefulness is more important than usability: "Usability evaluation is predisposed to the world changing by gradual evolution; iterative refinement will produce more usable systems, but not radically new ones." Therefore, it is incumbent on researchers to evaluate usefulness, regardless of what this paper says. Guidance about whether to do usability evaluation: * One should NOT do usability evaluation "in very early design stages, in cases where usefulness overshadows usability, in instances where unpredictable cultural uptake dominates how an innovative system will actually be used." * "usability evaluation ... is appropriate for settings with well-known tasks and outcomes." A weakness is that "they fail to consider how novel engineering innovations and systems will evolve and be adopted by a culture over time." The page of examples of this are not very compelling to me; for example, cars became useful only after there was infrastructure in the form of paved roads, gas stations, etc. * "There are many other aspects of user-centered design that are just as important: understanding requirements, considering cultural aspects, developing and showing clients design alternatives, affording new interface possibilities through technical innovations, and so on." =========================================================================== Deliberate Delays During Robot-to-Human Handovers Improve Compliance With Gaze Communication Henny Admoni and Anca Dragan and Siddhartha S. Srinivasa and Brian Scassellati HRI 2014 This paper actually answers the following research question: If a human has already made a plan, can a robot mislead the human into violating the previously-made plan? The technique for doing the misleading is by the robot looking in a direction that is the opposite of where the human should be attending. This question could have been answered with a human confederate taking the place of the robot, and should have been to provide a baseline for the robot results. The paper is written as if it is answering completely different questions about whether robot communication via gaze is an effective way to communicate in noisy environments where auditory signals are not possible, and how to increase the likelihood that humans notice the robot's gaze. This isn't a very interesting question. The paper claims that "a natural conclusion is to communicate information about where to put the object using eye gaze", but in that case it should have compared to other mechanisms for communicating a suggestion, such as shining a light on a bin or projecting an arrow that points at a bin. When I talked to Sidd Srinivasa, he cast the paper in a third way: it's interesting for the military to create deceptive robots that can fool adversaries regarding their intentions, and this is a first step toward that goal. This study does not seem relevant to real-world tasks where participants know how to interact with the robot. Of the 32 participants, 14 didn't notice the robot's gaze as a suggestion, even though they had been told that the robot would make a suggestion. So this study could be described as about learning what a robot's cues mean. If the participants actually interacted with robots frequently and understood how the robot communicates, then the results might be quite different. "Participants were also told that HERB's head would move and that HERB may provide suggestions about how to sort the blocks, but that the final sorting method was up to them." (During the second that the robot was delaying the handover, some participants focused on pulling harder on the block to get it out of the robot's hand.) In addition to the above concerns about generalizability, the experiment is unrealistic and uncalibrated. "In this task, a robot called HERB hands colored blocks to participants, who sort those blocks into one of two colored boxes according to their personal preference." The applicants were told they could sort the blocks however they wanted. They were not asked what their goal was nor assessed on whether they achieved that goal. It's also possible that the robot's suggestions changed that goal. The participants were not asked whether the robot's gaze affected their goal or their behavior. Therefore, there is no objective data to back up the paper's hypothesis that the robot caused the participants to under-perform their intended plan or to change their plan. It's also possible that the task was so vague that the participants were not taking it very seriously or never had a plan at all. How is this relevant to real-world tasks that have a goal? To answer these questions adequately, the researchers would need to ask the participants what their sorting strategy is. It's likely that the researchers didn't want to ask the participants what their plan was, because that would fix it in their minds and they would be less likely to be misled by the robot's gaze. However, the authors should have said this. The experiment could also have been to divide blocks -- all the same -- into two bins and the experimenters could have investigated how often the participants followed the robot's suggestions. All this is really a psychological experiment (but not a well-designed one) rather than having anything to do with robotics. The paper's conclusions: "We hypothesize that: * H1 A handover delay will cause people to pay more attention to HERB's gaze communication, and that * H2 Social gaze will lead people to comply more with HERB's counterintuitive suggestion." H1 is supported and H2 is not supported. The paper's authors gives reasons that H2 might be true but not supported in their experiment, but they don't spend as much time talking about why it might not be true or why H1 might be false but supported in the experiment. Other factors, such as social gaze (the altitude of the gaze: starting at the human's face ("joint attention") and staying at that level, or starting at the block ("mirrored gaze") and staying at that level), didn't have an effect. Suppose that the robot gaze did affect participants' choices. Would that be a positive result because the robot has power (the paper casts it that way) or a negative result because the people performed worse on their task when interacting with a robot? (We do expect robots to make mistakes once in a while.) The paper doesn't discuss this. Here are more details about the experiment. Handover is "the act of transferring an item from one actor to another". The paper uses that as context but isn't concerned with that. The question is whether drawing attention to the robot's gaze causes people to follow the suggestion that is implicit in the gaze. Human subjects were given blocks by a robot and asked them to put the blocks in either a blue or a yellow bin. The blocks were blue, yellow, half-and-half, or 70% of one color. 32 participants were each handed 5 blocks, in order, by the robot. The robot always looked at one of the two bins. In the control condition, the robot started to turn its head as it let go of the block (when the user grasped the block). In the treatment condition, the robot delayed letting go of the block for 1 second (the amount of time it took to turn its head) after the user grasped the block. The question is whether more users followed the robot's suggestion when the delay had encouraged them to notice the suggestion. The robot's suggestion for the 70-30 colored block was always to put it in the bin of the opposite color, which is counterintuitive. The robot gripped each block on the minority color and always suggested to the block in the bin of the color being gripped. =========================================================================== An External Replication on the Effects of Test-driven Development Using a Multi-site Blind Analysis Approach Davide Fucci and Giuseppe Scanniello and Simone Romano and Martin Shepperd and Boyce Sigweni and Fernando Uyaguari and Burak Turhan and Natalia Juristo and Markku Oivo ESEM 2016 The question is whether test-first or test-last development is more productive and yields better code. The paper lacks background, such as a discussion of the reasons that people believe TDD (test-driven design or test-first design) might be better. That could have helped to determine experimental questions. This paper replicates a previous experiment. The experimental protocol is the same as in the previous experiment, but the design is different: "A balanced crossover design is a type of repeated measure design -- i.e., the measures are taken several times for the same participant -- in which a participant is randomly assigned to a sequence of treatments rather than a single one." The study is with only 21 subjects (graduate students), which seems too low to me for effects to be visible. "Results: The Kruskal-Wallis tests did not show any significant difference between TDD and TLD in terms of testing effort (p-value = .27), external code quality (pvalue = .82), and developers’ productivity (p-value = .83). Nevertheless, our data revealed a difference based on the order in which TDD and TLD were applied, though no carry over effect." For the blind analysis: * person A ran the experiment * person B removed the labels from the data * person C someone else did the statistical analysis * person B re-added the labels to the data The paper says the blind analysis avoids bias but gives no concrete examples of potential bias. A blind analysis can be very useful because it avoids subjective factors. However, in this paper the analyses are all objective. Therefore, the blind analysis seems like a pointless exercise and a waste of energy. The only advantage to a blind analysis may be that the researchers didn't iteratively decide which experimental analyses to run based on the results of previous analyses. They could have achieved the same result by deciding in advance which analyses to run. A disadvantage of blind analysis may be investigating factors that are unlikely to matter, such as the order in which TDD and TLD are applied (that looks like a statistical artifact to me). The research questions are poorly posed. "RQ1: Is there a difference between the number of tests written by TDD developers and TLD developers? RQ2: Is there a difference between the external code quality of TDD developers and TLD developers? RQ3: Is there a difference between the productivity of TDD developers and TLD developers?" Regarding RQ1, the number of tests is an irrelevant measure. Programmers care about the quality, not size, of tests: do the tests catch bugs? The experimenters could have measured and reported this, such as by running each subject's test against every other subject's program, or even via mutation analysis. Even coverage would have been more interesting than number of tests. They actually measured not number of tests, but number of assert statements within the test, but that's equally irrelevant. Furthermore, paper describes the number of assertions as "testing effort", but they didn't measure actual human effort. Regarding RQ2, quality is measured as percentage of assertions passed in the experimenters' hidden test suite, for every user story where any assertion is passed. It would be better to only count the user stories that the developer finished, because we don't want to count against the developer: * the last story, which is half-finished * any stories in which one assertion happens to pass, but which the developer had not been targeting. The paper gives no indication about whether these scenarios came up. The paper states "the quality of the portions of the tasks that were implemented was acceptable (QLTY = 66.70%)". This extremely low quality (only 2/3 of the experimenters' hidden tests were passed, among stories in which at least one of the experimenters' hidden tests were passed) makes me suspicious of the experiment and whether it is measuring quality software development. I would also be interested in how many stories were 100% correct, since that is the real goal; it is possible that passing 80% of tests requires less than 80% of the effort. Given the flaws in the original experiment, it would have been more valuable to improve the experiment rather than replicating it exactly. The paper says the authors assessed two tasks as equally complex, but they didn't use their data (about subject performance on the tasks) to assess whether the assessment was accurate. Participants were given training and practice in TDD, but not in TLD. In fact, the entire class (that the students were taking) was about TDD. The paper states, "our research questions were not disclosed to [participants]", but those research questions were surely obvious to the participants. The discussion of threats to validity (section 5) is a usefully comprehensive list of such threats. However, the threats are listed without an explanation of why each one might be a threat. For example: "Selection: the effect of letting volunteers take part in the experiment may influence the results, since they are generally more motivated.": why doesn't this affect both tasks equally and thus not affect the overall conclusions? =========================================================================== LocalWords: HCI Anca Dragan Srinivasa Scassellati HERB's Davide Fucci DoD LocalWords: Scanniello Shepperd Boyce Sigweni Uyaguari Burak Turhan Oivo LocalWords: Juristo Markku Handovers Henny Admoni Siddhartha Sidd H1 H2 LocalWords: Kruskal TLD pvalue RQ1 RQ2 RQ3