Lecture 2: more on measures and studies; threats to validity; statistical vs. practical significance =========================================================================== The Heilmeier Catechism is a good set of questions that you should be able to answer about your proposed project: https://en.wikipedia.org/wiki/George_H._Heilmeier#Heilmeier.27s_Catechism =========================================================================== Theory section: * an explanation * a model -- an abstraction Question formation: * answer the most important question * answers must be actionable Threats to validity; or, research that makes you go "so what?" * motivation * is it a real problem? * big enough? / scope * poor examples * strawman motivation (need to compare to state of the art) * mismatch between experiment and reality * statistical insignificance * practical insignificance Questions to ask of a paper: * is it right? * does it matter? =========================================================================== Wohlin Chapter 2: Research methods: Scientific: create models that fit data Engineering: make changes, compare outcomes Emiprircal: create model, evaluate empirically (case studies, concrolled experiments) Analytical: compare a formal theor5y to empirical observations Experiments vs. quasi-experiments * experiment: completely controlled, random assignment of subjects to treatments * quasi-experiment: some correlation * case study: examine one instance, or case Exploratory vs Explanatory research: * exploratory: observe objects in their natural setting; qualitative * explanatory: quantifying a relationship, or establish cause-and-effect; quantitative Three types of empirical strategies: survey, case study, experiment * Survey: interviews or questionnaires "a snapshot of the situation to capture the current status" * Case study: observe one instance ("case") in a real-life context My interpretation is broader; setting a task for a user, even if it's not their real job, can be a case study. In other words, a case study or experiment has to be realistic (context, problem, measurements) but does not have to be real (during the course of one's day job). "especially when the boundary between phenomenon and context cannot be clearly specified" "the phenomenon may be hard to clearly distinguish from its environment." Emphasizes a case study as comparative, comparing to a known baseline. (That could be experiment-like, with some subjects having an experimental treatment and others not, or the baseline can be just known a priori. "If the method applies to individual product components, it could be applied at random to some components and not to others. This is very similar to an experiment, but since the projects are not drawn at random from the population of all projects, it is not an experiment." This seems like a non-interesting distinction; even in an experiment, you aren't drawing tasks from the population of all tasks. Confounding factors: * example: experience of the user * Note that differences in quantity may become differences in quality. Something may work well only on small or only on large projects. * Experiment: "In human-oriented experiments, humans apply different treatments to objects". I think of this differently, as applying treatments to the humans, who when perform some task. * quasi-experiment: "the assignment of treatments to subjects cannot be based on randomization, but emerges from the characteristics of the subjects or objects themselves." All the goals in section 2.4.1 can apply to case studies and even surveys, too. section 2.6: Types of replication. The conclusion is that there isn't standard terminology. "Differentiated replications study the same research questions, using different experimental procedures. They may also deliberately vary one or more major conditions in the experiment." I wouldn't call this a "replication". Note the rapid evolution of technology (and of education, which changes the subject population!), which together with the relatively low quality of studies may mean that exact replication is not desirable. "close replications sometimes require the same researchers to conduct the study, as they have tacit knowledge about the experiment procedures which hardly can be documented". I strongly disagree with this! The experimenters have a duty to fully, publicly document everything about the experiment. Section 2.7, "Theory in software engineering". Let's skip this; it has no clear definitions, and neither does the main paper it cites [72]. Meta-analysis is evaluation of multiple other studies. A literature review is one example. Fig 2.1 on page 25 says to do: survey, then laboratory experiment, then real life projects case study. That's different than what leture 1 suggested. Why? In my view, a case study doesn't have to be on a real life project (and shouldn't be, initially). Skip sections 2.9.2-4, on Quality Improvement Paradigm, Experience Factory, and Goal/Question/Metric Method. Skim section 2.10 on tech transfer. Most experiments are performed on college students. What are the pros and cons of that choice? =========================================================================== Wohlin Chapter 3: Types of measurement: * nominal: classification * ordinal: ranking, but distance between items is not meaningful * interval: difference between measurements is meaningful, but not the value itself * ratio: meaningful zero value, ration between measurements is meaningful Wohlin defines "direct" as directly measured and "indirect" as computed from direct measurements. I usually use these terms differently, in terms of how this affects the outcome that the end user actually cares about. "the external attributes are mostly indirect measures and must be derived from internal attributes of the object." I would say instead that the external measurements, like quality, can be measured and we have to determine how the internal measurements correlate with them. =========================================================================== When you do work, you and others want to know: * Are your results right? (see threats to validity) * This assumes you did what you said you would do and didn't make any methodological errors or omissions beyond your threats to validity. If those are the case, you have much bigger problems! * Do your results matter? (statistics, asking the right research questions) =========================================================================== Threats to the validity of research. Most often these are clumped into just: * Internal validity Refers specifically to whether an experimental treatment/condition makes a difference or not, and whether there is sufficient evidence to support the claim. * External validity Refers to the generalizibility of the treatment/condition outcomes. But sometimes the threats are more finely broken down. Example from http://www.psych.sjsu.edu/~mvselst/courses/psyc18/lecture/validity/validity.htm 1. Threats to construct validity. The measured variables may not actually measure the conceptual variable. 2. Threats to statistical conclusion validity. Type I (mistakenly rejects null hypothesis) or Type II error (mistake of failing to reject the null) may have occurred. 3. Threats to internal validity. IV - DV relation may not be directly causal (confounds = another variable mixed up with the IV; confounds provide alternative interpretations or alternative explanations for the results of the experiment). Internal validity is perfect only when there are no confounding influences. Also: correlation vs. causality. 4. Threats to external validity. Results may only apply to limited set of circumstances (e.g., specific groups of people or only some typefaces...) All of #1-#3 can be viewed as types of internal validity: is it a correct experiment? #4 is about generalization Another list of types of threats is: * construct (correct measurements) * internal (alternative explanations) * external (generalize beyond subjects) * reliability (reproduce) =========================================================================== Statistical significance vs. practical significance It can be easy to get statistical significance with experiments on non-humans. It can be hard to get statistical significance with experiments on humans. ===========================================================================