Lesson 14. Statistical Testing

Objectives¶

By the end of this lesson, students will be able to:

Compute summary statistics using numpy functions
Conduct an exploratory data analysis with visualizations
Explain statistical distributions and hypothesis testing

Setting up¶

To follow along with the code examples in this lesson, please download the Jupyter notebook:

lesson14.ipynb

Introduction to Statistics¶

The study of statistics is the study of models to gather, understand, and draw conclusions from real-world data. We’ve talked a bit in class about “data science” since our course is a “data programming”. I think of data science as the intersection of data programming (programs aimed at processing data) and statistics (reasoning about the world through models). There are definitely other things in data science (e.g., data visualization) but it’s a nice mental model.

Summary Statistics¶

One of the fundamental approaches in statistics is to come up with “summary statistics” of your dataset to give descriptions of patterns in the data. You are probably familiar with some of the most common summary statistics used that describe a dataset with a single number. Below, we list the most common ones and how to compute them assuming your dataset is stored as a 1D numpy.array.

The number of examples
The mean (or average) of the examples. The average of all the examples.
The median of the examples: If you sort the examples, which one is in the middle. This means that 50% are above and 50% are below the median.
The standard deviation captures the “spread” of the data. A higher standard deviation means the data is more spread out.

The following snippet shows how to compute these all using numpy for a dataset of how many cartwheels the TAs can do in a row.

import numpy as np

# Each entry shows the number of cartwheels a single TA can do in a row
cartwheels = np.array([2, 3, 4, 4, 2, 0, 2, 0, 1, 3, 2, 5, 3, 1])
print('Data')
print(cartwheels)
print()

print('Number of examples')
print(len(cartwheels))
print()

print('Mean of examples')
print(cartwheels.mean())
print()

print('Median of examples')
print(np.median(cartwheels))
print()

print('Standard Deviation of examples')
print(cartwheels.std())
print()

Histograms and Boxplots¶

Another way to analyze your dataset is to plot it! A histogram is a great way to see how your data is laid out and fits nicely with the summary statistics we just computed above. A histogram shows the number of observations on the y-axis, and a “bucket” of values on the x-axis. Note that the bars are touching, and they span intervals rather than single values. This is what differentiates them from a normal bar plot!

import matplotlib.pyplot as plt
import seaborn as sns

sns.histplot(cartwheels)
plt.xlabel('Number of Cartwheels')
plt.ylabel('Frequency')
plt.title('Histogram of Cartwheels Done in a Row')

We can also use a boxplot to visualize the spread of the data. Sometimes, a boxplot is also called a “box and whisker” plot. The boxplot is good at showing quartiles, or how much of the data is spread between values. One quartile is 25% of the data. In this particular boxplot, the first quartile is between 0 and 1.5, which means that 25% of our observations fell between these values. The second quartile is between 1.5 and 2, meaning that another 25% of our observations fell between these values, and so on!

sns.boxplot(cartwheels, orient='h') # orient='h' for horizontal, 'v' for vertical
plt.xlabel('Number of Cartwheels')  # if orient='v', this would be a y-label!
plt.title('Boxplot of Cartwheels Done in a Row')

Distributions¶

A distribution describes the likelihood of seeing possible values in your dataset. Here are some commonly-used distributions:

The uniform distribution - all values are equally likely.
The normal distribution - values appear like a bell curve where they are centered around some point and spread out on either side.
The binomial distribution - the number of “successes” depends on a predetermined probability and sample size.

Here, we use numpy to generate 1000 “normal” datapoints as our dataset. The loc and scale parameters define the mean and standard deviation, respectively. The distplot function from seaborn allows us to plot a density curve over a histogram, which shows the approximation of the distribution of our data.

# Try changing the 'size' parameter
# and see how the histogram changes!
data = np.random.normal(loc=0, scale=2, size=1000)
print('Preview of data')
print(data[:5])

sns.distplot(data)

Changing sizes

If you run the above code snippet, you will get slightly different yplots each time, but they will all look normal. If you change the number of samples to be much smaller, than you would get wildly different results each time.

We can also run a simulation to show what the uniform distribution looks like. In the uniform distribution, the low and high parameters give the upper and lower bounds for possible data values. Recall that in this distribution, every value is equally likely to be chosen. What do you notice when you change the values of the parameters?

data_unif = np.random.uniform(low=0, high=10, size=1000)
sns.distplot(data_unif)

Here’s our simulation for the binomial distribution. The binomial distribution assumes a fixed number of n data points, and the probability p of observing a “success”. If you try out different values of n, p, and size, what do you notice?

data_binom = np.random.binomial(n=50, p = 0.5, size=1000)
sns.distplot(data_binom)

Hypothesis Testing¶

The example of Lyft stocks was taken from The Ethical Algorithm by Aaron Roth and Michael Kearns. This book is fantastic and highly recommend!

Suppose you open your email one day and see an email from a random person saying “LYFT STOCK WILL GO UP! BUY NOW”. This is obviously spam, so you discard it. However, you’re kind of curious so at the end of the day you check the stock prices and noticed that Lyft did indeed go up. The next day, the same sender sends another email with the subject “LYFT STOCK WILL GO DOWN! SELL NOW” and as a rational person, you still ignore it. Surprisingly though, you check again at the end of the day and see the person was right again! This seems like just luck though so you still ignore this email.

This trend continues for 10 days, each day the sender correctly predicts the movement of Lyft’s stock! After the 10^th day, the sender asks you to pay for their services citing that they correctly predicted 10 days of stock movements as evidence that you will make money if you let them invest for them. So the question is, has this mystery sender figured out the trick to investments and you should pay them to make money, or are you on the way to being scammed?

One thing you could do is try to model the world and figure out what is the likelihood that this sender was able to guess this correctly by chance. So for simplicity, we suppose that Lyft stock goes up/down each day with equal probability and the days are independent (going up one day doesn’t affect the outcome the next). What is the probability that this sender could correctly guess 10 times in a row? Well this is the same as getting 10 heads in a row in a series of coin flips. With a little math, you can show that the probability of this happening is $0.5^{10} = 0.000976562$. That’s a 0.09% probability that this happened just randomly! While this is entirely possible that it could be happening by chance, it is exceedingly unlikely. You could be justified in trusting this anonymous sender now since it seems very unlikely that this is just happening by chance.

What we have just done is set up a hypothesis test. We saw some data (a series of stock movements) and wanted to test a hypothesis (is this sender legit?). A common way to do hypothesis testing is to come up with a null hypothesis, or a hypothesis that the world is simple (or behaving in an expected way), to analyze the data under. In our example, the null hypothesis is that the sender just got lucky which makes more sense than some magic ability to predict the future. If the specific data we saw looks unlikely to happen under the null hypothesis, we can reject the null hypothesis in favor of our alternative hypothesis (the sender is legit). This doesn’t necessarily prove the alternative hypothesis is correct. All this says is that assuming the null hypothesis (the simple case) is true, it’s very unlikely to see this dataset, so we will use that as evidence to say we don’t believe the null hypothesis is true.

So for hypothesis testing in general, you set up a null hypothesis and compute some measure of the probability for seeing the data, assuming the null hypothesis is true. This probability measure is the p-value. Again, the p-value says “Assuming the null hypothesis is true, this is the probability of seeing this specific dataset”. If that probability is sufficiently low, we reject the null hypothesis as being unlikely. A common technique is to pick some threshold p-value like 0.05 in advance and then reject the null hypothesis if the p-value is below that threshold. If the p-value is above the threshold, we fail to reject the null hypothesis. This does not mean the null hypothesis is true, it just means there is not sufficient evidence to reject it.

Important to highlight: A p-value says NOTHING about effect size! Even if something is statistically significant (e.g., the effect of a gene on the spread of influenza), it’s still entirely possible the effect is so small it doesn’t matter.

Another Example¶

Suppose that I have distributions of grades between 163 students who took 122 or 160. If you compute the average 163 grade for each group, you might notice a small difference, but the question is “is this difference significant enough to suggest the two groups are coming from different distributions (or is the difference just by chance)?” A null hypothesis for this type of question usually is something like “The samples come from the same population (e.g., both samples are drawn from the same distribution) and any difference just happens by chance”.

The procedure for testing this hypothesis goes something like this:

Compute the probability of seeing these scores assuming they come from one distribution.
If that probability is low (less than 0.05 usually), then reject the null hypothesis.
If we reject the null hypothesis, then we are saying it’s not likely the scores come from the same distribution so we’ll use that as evidence to say they come from different distributions.

False Discoveries¶

Notice, this process can make mistakes! It’s entirely possible for our sender to just get 10 guesses correct in a row. However, with a p-value that low, it’s exceedingly unlikely to happen by chance so we use that as evidence against this null hypothesis.

If we reject the null hypothesis when it’s actually true, we have made a false discovery or a Type I error (since we claim that the simple case is not true so there must be something of interest going on). On the other hand, another possible error comes from failing to reject the null hypothesis when it’s indeed false. This is called a Type II error(unfortunately, it doesn’t have a super common easier name to remember).

In fact, if you run the same experiment many times, you will get slightly different p-values for the null hypothesis since your data will look slightly different from experiment to experiment. While mistakes are possible, by setting your threshold for the p-value correctly you can make them unlikely. When you are using a threshold of the p-value of 0.05, you are saying that you are okay with a 5% chance of a false discovery (under some assumptions). 0.05 is a common threshold to use in scientific fields, but you could use any threshold you are comfortable with to control your probability of making false discovery. If you set a threshold of 0.01, then you will have a 1% chance of making a false discovery. The downside to making the threshold very small is it makes it much harder to reject the null hypothesis even if it’s false (meaning you will make a Type II error).

This p-value is not the same as the probability that the null hypothesis is true. It is the probability that you falsely reject the null hypothesis, which is an entirely different concept! Some statisticians (frequentists) think the question “what’s the probability the null hypothesis is true/false” makes no sense. The world either behaves by the null hypothesis or it doesn’t.

A simple example is running a physics experiment and asking “What is the probability that the gravitational constant is 9.8 $m/s^2$”? They argue this question makes no sense since the gravitational constant is just some number, there is no probability it can be one value versus another. A hypothesis tests asks the question “Assuming the gravitational constant is 9.8 $m/s^2$, what is the probability of seeing this particular dataset”. If that probability is low, then we reject that hypothesis.

Types of Tests¶

When selecting a statistical test, it’s important to consider the following factors:

Sample Size (typically, is the sample size greater than or less than 30?)
Type of Variable (quantitative, categorical, ordinal)
Number of Variables (one, two, or many)
Underlying Distribution Known/Unknown
- Normal distribution: Parametric tests (e.g., z-test, t-test) assume data is normally distributed, and that you know other metrics of the data like the mean, standard deviation, or variance (the square of the standard deviation).
- Non-normal distribution: Non-parametric tests (e.g., Mann-Whitney, Kruskal-Wallis) do not require normality.

Why the normal distribution?

According to the Central Limit Theorem (CLT), the distribution of sampling means will tend to be normal as the sample size increases, assuming that samples are independent and identically distributed. The CLT allows us to assume normality for the sample means, even if the original data itself isn’t normally distributed. However, since many tests focus on comparing means, we can still use the normality assumption for parametric tests.

Z-Test¶

A parametric test used to determine whether there is a significant difference between sample and population means when the population variance is known.

One-tailed Z-test: Tests if a sample mean is greater than or less than the population mean (just one or the other, but not both).
Two-tailed Z-test: Tests if the sample mean is significantly different from the population mean in either direction (greater or smaller).
Paired Z-test: Used when comparing two related samples or measurements.

Factors Influencing the Z-Test

Sample Size: Typically used with “large” samples (n $geq$ 30) where the Central Limit Theorem applies.
Variable Type: Continuous, interval data.
Underlying Distribution: Assumes normal distribution of the population.

T-Test¶

A parametric test used when comparing sample means, particularly when the sample size is small and the population variance is unknown.

One-tailed T-test: Tests if the sample mean is greater than or less than the population mean.
Two-tailed T-test: Tests if the sample mean is different from the population mean in either direction.
Paired T-test: Used for comparing two related samples (e.g., before and after treatment).

Factors Influencing the T-Test

Sample Size: Small sample sizes (n < 30) often require t-tests.
Variable Type: Continuous data.
Underlying Distribution: Assumes normality of the data or approximately normal for larger sample sizes.

Analysis of Variance (ANOVA)¶

ANOVA tests for significant differences between group means.

One-way ANOVA: Tests differences between means of three or more independent groups based on one independent variable.
Two-way ANOVA: Analyzes the impact of two independent variables simultaneously on one dependent variable.
Multi-way ANOVA: Tests the interaction effects between multiple factors.

Factors Influencing ANOVA

Sample Size: Larger sample sizes help reduce Type I and Type II errors.
Variable Type: Independent variable(s) must be categorical (nominal or ordinal); dependent variable is continuous.
Underlying Distribution: Assumes normal distribution within groups and homogeneity of variances (similar variance across groups).

Mann-Whitney U Test¶

A non-parametric test for comparing differences between two independent groups. It’s used when the assumptions of the t-test (normality, homogeneity of variance) are not met.

Factors Influencing the Mann-Whitney U Test

Sample Size: Can be used with small sample sizes.
Variable Type: Ordinal or continuous variables.
Underlying Distribution: No assumptions about the distribution of the data.

Kruskal-Wallis Test¶

A non-parametric version of one-way ANOVA used for comparing more than two independent groups.

Factors Influencing the Kruskal-Wallis Test

Sample Size: Suitable for small sample sizes.
Variable Type: Ordinal or continuous variables.
Underlying Distribution: No assumptions about the distribution of the data.

Chi-Squared Test¶

Used for categorical data to test relationships or goodness of fit.

Chi-Squared Independence Test: Tests if two categorical variables are independent.
Chi-Squared Goodness of Fit Test: Tests how well observed data fit an expected distribution.

Factors Influencing the Chi-Squared Test

Sample Size: Requires sufficiently large sample sizes (expected frequency of at least 5 in each cell).
Variable Type: Nominal (categorical) variables.
Underlying Distribution: No assumptions about the distribution of the data, but they must be categorical and (assumed to be) independent.

Coding

We won’t talk much about how to code these other types of tests, but documentation for many of them will be found in the scipy or statsmodels libraries! The important thing is that the choice of statistical test is one more tool in a fuller statistical analysis.

Case Study: Drosophila¶

Sometimes, it’s hard to conceptualize how these statistical tests and hypotheses play out in practice. Let’s look at a specific case study, with fruit flies (Drosophila). In this scenario, we are testing the ability of different genes in fruit flies in inhibiting the spread of influenza. The goal is to find a gene that is effective in accomplishing that.

The experiment will consist of isolating a single gene. We can manipulate this gene in a small group of flies (~30 flies) by turning it off and to try to observe a change in behavior. If this gene has an effect on inhibiting influenza, we could observe a difference by toggling this gene!

The way we measure the gene’s ability to inhibit the spread of the disease is to use a special marker on the disease that is luminescent. The experiment setup looks like the following:

Breed a group of fruit flies that have the one gene of interest “off”.
Infect the sample with the luminescent virus.
After some time, measure how bright the sample is in each fly. A brighter sample means the disease spread more while a lower brightness means less spread.

If the gene was effective in stopping the spread of the disease, turning it off would result in a brighter sample on average (since that gene was off, it was not able to prevent the spread of the disease).

The results are normalized so a value of 0 means the brightness is the same as a sample of flies with no gene editing. A positive value means a brighter sample while negative means dimmer sample (both compared to the “standard” case). Let’s start with a very rudimentary exploratory data analysis.

# Every point in our dataset here describes a normalized brightness value
#  There are 30 data points (so 30 flies in our sample)
data = np.array([1.5, 0.86, 0.34, 0.5, 0.6, -0.56, 
    0.13, 0.93, 0.56, 0.07, -1.02, 1.89, -0.82, 0.92, 
    0.32, 0.07, 0.55, 1.05, -1.19, 0.62, -0.19, 0.98, 
    -0.33, 0.36, 0.41, 0.02, -0.45, 1.37, 0.18, -0.57])

# It's useful to print out some summary statistics
our_std = data.std()  # used later
print('mean:', data.mean())
print('std:', our_std)

# We can also plot on a histogram
sns.distplot(data)

Testing Drosophila¶

In this context, we want to determine whether our sample of 30 flies is signficantly different from our normal sample of flies. In order to figure out how different our sample is from what we’d get from a normal sample, we need to establish a null and alternative hypothesis.

In this case, let’s say that our null hypothesis (or $H_0$) is that there is no difference between our sample and the normal sample. Our alternative hypothesis (or $H_A$) is that there is some difference.

How do we know whether we can reject the null hypothesis? We can do this by calculating a p-value through a z-test.

Food for thought: What conditions about the experimental setup might have led us to conclude that we needed to use a z-test?

# pass in our null hypothesis here, 0 meaning a normal distribution
teststat, pvalue = sm.ztest(data, value=0) 

print('Test Statistic:', teststat)
print('p-value:', pvalue)

The p-value essentially tells us the probability of observing our data if the null hypothesis is true. This is why it’s really important to specify your null hypothesis before doing any p-value calculations!

Food for thought: Based on this p-value, would you reject or fail to reject the null hypothesis?

Multiple Trials¶

In the above case study, we tested a sample of 30 flies from a larger population of flies. While our p-value allowed us to reject our null hypothesis, we could imagine a world where we sample 30 special flies. This is why it is common to run multiple trials when testing hypothesis to be more confident in our results!

In this cell, we are generating “normal” looking datasets that could model the data we observed. The np.random.normal method allows us to generate a fake dataset which we are assuming is another sample of 30 flies from the same population.

We then run a ztest against this generated data and we plot this in a histogram.

sns.distplot(data, color="blue")

for i in range(20):
  data = np.random.normal(loc=0.5, scale=our_std, size=30)  # our 'data' is normal at 0.5
  print(data.std())                                         #  in the real world, we would have actual numbers!

  teststat, pvalue = sm.ztest(data, value=0) # pass in null hypothsis here, 0 --> normal distrubution
  print('Test Statistic:', teststat)
  print('p-value:', pvalue)

  sns.distplot(data)

Multiple Hypothesis Testing¶

So far, we have only been testing one gene in our experiment. For the following examples, we are going to be considering a whopping 13000 genes that the fruit flies has. Our goal is to test each gene for it’s effectiveness against influenza. We’ll forgo testing each gene with multiple trials for the sake of this example.

In this cell, we loop over 13000 different experiments. For each experiment, we generate some random normal data similar to before. However, this time, our data is generated from the “true” distribution, i.e. our population is the boring, old world. This essentially assumes that every single sample we are taking is pulled from a normal looking population with no special reaction to the manipulation of the gene.

GENES = 13000
results = []

for i in range(GENES):
    # Generate 30 samples from the true distribution
    data = np.random.normal(loc=0, scale=1, size=30)

    # Do a hypothesis test comparing to the null hypothesis of a mean 0 normal (which is true)
    stat, pvalue = sm.ztest(data, value=0)
    results.append((stat, pvalue))

df = pd.DataFrame(results, columns=['TestStat', 'PValue'])
df

Now that we have a dataframe of 13000 different experiments, we can try to filter out the experiments which did have a signficant results.

num_significant = len(df[df['PValue'] < 0.05])
print(f'{num_significant} significant results ({100 * num_significant / len(df):.2f}%)')

What Happened?¶

Huh, even though our samples were pulled from the boring, old world, we still came up with a large number of genes which are significant in their effect in inhibiting influenza. These are known as a false discovering, and in this case, we have 700+ of them!

This is because of the sheer fact that we are rolling the dice 13000 times! Even if our genes have no effect on our population, sampling the same population with a normal distribution means we may eventually come up with a sample of the tail end of the bell curve. While in our case, we accidentally came to this conclusion, we might imagine this being done nefariously to create a favored conclusion.

p-hacking¶

Since we talked about the fact that this process is random (since your experiment has randomness in it), you might have noticed there is a way to “beat science” with this approach. If you run your experiment and don’t get a significant result, you can always just repeatedly run it again until you find a trial that has a significant outcome by chance! This is entirely possible since you could just get “lucky” and find experimental results that look significant (low p-value) even if your null hypothesis is true!

This process is called p-hacking and is a HUGE problem in science. While only rarely done intentionally as I outlined, there are lots of subtle ways scientists can do this accidentally which increases the number of false discoveries. A particularly rough example comes from the field of nutritional health, you hear almost every day about a new study that says something like

If you want to live longer and healthier, you should/shouldn’t drink coffee, or should/shouldn’t eat chocolate, or should/shouldn’t do this new diet, etc.

I would guess almost all of these are statistically significant results (usually a minimum bar for publishing) but they are likely the result of p-hacking (most likely unintentionally).

Unintentional p-hacking¶

How is it possible to p-hack unintentionally? Think back to our fruit fly gene example. It’s totally possible for us to set up each experiment (remember one experiment for each gene) correctly and promise that we will only run each experiment once to avoid maliciously p-hacking. The problem comes from the fact that we are running 13,000 experiments which makes it an almost certainty that we will make false discoveries!

Remember, when using a p-value threshold of 0.05, that means we are accepting a 5% chance of making a false discovery if the null hypothesis is true. Assuming the majority of those 13,000 genes are not relevant to the spread of influenza (meaning the null hypothesis is true for those genes), that’s a lot genes where we can possibly make this mistake.

How many false discoveries do you think we’ll make? It turns out we make around 700 false discoveries ($0.05 cdot 13000 approx 700$)! While each individual experiment is unlikely to be a false discovery, if we union over all experiments where the null hypothesis is true, we will certainly be making false discoveries.

This problem is called multiple-hypothesis testing and is a frequent problem in the world of big data where you can easily run hundreds if not thousands of experiments on a large dataset very quickly. This gives the power to find false discoveries at an alarming rate and the scientific community is still trying to figure out how to best progress given this reality.

Correcting Multiple Hypothesis: The Bonferroni Correction¶

Thankfully, there is a lot of research that has gone into this problem to help us more confidently test these hypotheses while reducing our false discovery rate. There are lots of approaches to solving this problem, but we will just highlight one popular choice due to its simplicity called the Bonferroni correction.

The idea of the Bonferroni correction is simply to use a stricter threshold to account for the fact you are doing many tests. Assume you are doing n hypothesis tests at the 0.05 threshold. The Bonferroni correction says that instead of using a threshold of 0.05 for each individual test, you need to use a threshold of 0.05 / n.

We can try out our new threshold which now comes up with a much lower number of significant p-values.

num_significant = len(df[df['PValue'] < 0.05 / 13000])
print(f'{num_significant} significant results ({100 * num_significant / len(df):.2f}%)')

A related issue is HARKing, or Hypothesizing After the Results are Known. Essentially, HARKing refers to the practice of formulating hypotheses based on the results of an analysis after seeing the data, rather than hypothesizing beforehand. This can lead to misleading conclusions because the hypothesis is shaped by the outcome of the analysis, rather than being a genuine, theory-driven prediction. Make sure to preregister your hypotheses and determine your thresholds beforehand!

⏸️ Pause and 🧠 Think¶

Take a moment to review the following concepts and reflect on your own understanding. A good temperature check for your understanding is asking yourself whether you might be able to explain these concepts to a friend outside of this class.

Here’s what we covered in this lesson:

Defining statistics
Data distributions
- Normal distribution
- Uniform distribution
Hypothesis testing
- Defining hypotheses
- Interpreting p-values
- Types of tests
- Multiple hypothesis testing
  - HARKing
  - p-hacking
  - Corrections

Here are some other guiding exercises and questions to help you reflect on what you’ve seen so far:

In your own words, write a few sentences summarizing what you learned in this lesson.
What did you find challenging in this lesson? Come up with some questions you might ask your peers or the course staff to help you better understand that concept.
What was familiar about what you saw in this lesson? How might you relate it to things you have learned before?
Throughout the lesson, there were a few Food for thought questions. Try exploring one or more of them and see what you find.

In-Class¶

When you come to class, we will work together on answering the conceptual questions here (which are also in your Canvas Quiz)!

Statistics Concepts¶

What is the mean and standard deviation for the data described below (rounded to the nearest hundredths place)?

import numpy as np

data = np.array([137, 134, 205, 19, 25, 26, 831, 108, 42.6])

(True / False) Increasing the number of random samples that we take from a normal distribution makes the sampling distribution look closer to the bell curve.
(True / False) Increasing the size of random samples that we take from a normal distribution makes the sample distribution look closer to the bell curve.
Suppose that we have an experiment where we are testing 100 hypotheses simultaneously. If our initial threshold was 0.03, what is the new threshold we will need to use for each individual test after implementing the Bonferroni correction?

Research Scenarios¶

For each of these scenarios, determine which of the following issues is relevant to the setup:

p-hacking
Testing multiple hypotheses without correction
HARKing

In class, we’ll discuss what we might do instead!

A team of researchers is investigating whether a new drug improves cognitive function in elderly patients. They have a predefined hypothesis that the drug will increase attention span, based on prior studies showing that similar compounds can help with focus. After running the clinical trials and analyzing the results, the researchers find instead that the drug has no effect on attention span but does show statistically significant differences in weight loss. When publishing their findings, the team reports that their hypothesis was that the drug would help with weight loss, and they don’t mention attention span at all.
A climate scientist is running simulations for tornado paths based on various environmental factors such as temperature, humidity, wind speed, location, time of year, and others. They want to explore the effect of each of these factors on tornado paths, so they test all the variables (and interactions between the variables) at once. They also run multiple simulations for each variable and interaction. At first, they find that all the variables in their model have a significant effect on the tornado path except for the location. The scientist runs the models several more times until they find a significant effect from the location. When they publish their findings, they write that all variables in their model have a significant effect on the tornado path.

Canvas Quiz¶

All done with the lesson? Complete the Canvas Quiz linked here!