Imagine you just developed a new strategy for coin flipping. Right before you release the coin, you blink 4 times. You want to test if this strategy will result in more heads than tails, so you set up a test where you flip a coin 20 times. You observe 13 heads and 7 tails. Your strategy resulted in heads 65% of the time! You rush out to start gambling on coin flips.
You probably already realized the error in our analysis. Even if our new strategy had zero impact, we would still expect to flip 13 heads about 7.3% of the time. This raises an important question: how do we know if our results are the product of our strategy or randomness? We don’t, but we can set up our experiment in a way that the odds of a random result are so small, that we are confident the effect is from our strategy.
For example, we can flip the coin 1000 times instead of 20. The odds of getting 65% heads randomly is now about 3.75*10^-22. So if we flipped 650 heads, maybe it is time to start gambling.
This idea of separating chance from our results is known as statistical significance, and it is the topic of today’s post.
Significance of significance
Statistical significance is one of the most misunderstood ideas in statistics because the use is so different from the everyday definition. In common parlance, significance is synonymous with “stuff that matters.” So we might assume that a statistically significant result is synonymous with “a result that matters.”
In reality, a statistically significant result can matter very little. For example, a significant result that changes our chance of getting a certain type of cancer from .002% to .001% probably doesn’t matter much.
In statistics, statistical significance just means distinguishable from random chance. As we saw in the coin flipping example, our first result wasn’t distinguishable from random chance, therefore, it was not statistically significant.
In 1925, Ronald Fisher published Statistical Methods for Research Workers. In it, he popularized the idea of using 5% as the cutoff for a statistically significant result. This means that there would only be a 1/20 chance of the observed effect being the result of random chance. Right off the bat, this should tell us that about 5 in every 100 published result is wrong. In reality, the number is far worse.
Fooled by randomness
To understand the issue, let’s use a great example from a Veritasium video on this subject. Imagine there are 100 hypothesis being tested in a field. 10 of them are true relationships and 90 are false. We already know that there is a 5% chance of wrongly classifying those false hypotheses as true, so there are 4.5 false positives (90 *.05). But for the 10 true hypotheses, we also have the chance of getting a false negative. Because of this, we may only find 8 of the true hypotheses. The rest of the results are correctly classified as true negatives, but these results rarely get published. They make only about 20% of the papers in a field.
So, in our body of published research, we will have 8 true positives, 4 false positives, and 3 true negatives. This means that a third of our published results will be wrong.
In 2011, a collaboration by 270 researches attempted to recreate 100 published psychology studies. This was called the Reproducibility Project, and they found that only 39 of the studies showed statistically significant results when reproduced.
In 2015, John Bohannon published a study that showed eating chocolate helps people lose weight. It was quickly picked up in the media and could be found in headlines around the world. The problem was that the whole thing was a scam. NPR has a great article on the subject. Bohannon purposely devised the study to give a false result. He had a small sample, like our coin flips, and measured multiple factors hoping to randomly get a hit on one. The hit was weight loss.
The flaws in the study design were obvious, but they were still published and shared. Bohannon wanted to show that the media cared more about clicks and views than actually reviewing the work.
As we’ve seen, even well-intentioned researchers are going to get it wrong about 30% of the time. When biases, P-hacking, and publishing incentives are added to the equation, the numbers gets even worse.
Evan Esar, an American humorist, once described statistics as:
The only science that enables different experts using the same figures to draw different conclusions.
It is important to remember that all data needs to be interpreted. In our opening example, we might have found that blinking 4 times has a statistically significant effect on head outcomes, but we should still be able to reason that blinking 4 times isn’t really doing anything. In the same way, we should be able to reason that eating chocolate probably isn’t the key variable for losing weight. If you’re still not convinced, I have a rock that keeps tigers away for sale.
I’ll close with this: despite all the issues discussed with statistical significance, it is still the best way we have of knowing. We may get it wrong 30% of the time, but imagine how many times we would get it wrong without.