McShane et al on Why We Should Abandon Statistical Significance

“The difference between ‘significant’ and ‘not significant’ is not itself statistically significant.” Gelman and Stern (2006).

Relying on this central premise as their critique for the way much of modern science publishing is conducted, McShane, Gal, Gelman, Robert, and Tackett recently published a short paper called “Abandon Statistical Significance,” arguably the most important paper published this year.

This is a paper that implicates all science publishing, as well as government testing, popular science writing, and perhaps more critically, what we say when we say something is meaningful in a scientific sense.

The crux of the paper is to ask the deep question of, “how do we know when the results of a study are meaningful, as opposed to simply pure noise?” Statisticians have traditionally done this by testing a null hypothesis (that there is no difference between two populations) and an alternative hypothesis (there is a meaningful difference between two populations).

As an example, if you wanted to test the question of whether people born on weekdays lived longer than people born on weekends, you’d collect all the actuarial data related to when people were born and when they’d died, and then measure whether there was a difference between the two populations (weekday births and weekend births).

We wouldn’t expect there to be much difference in these two populations. Especially if we collected data on large numbers of people, we’d expect the data to show little or no difference. And if there were a difference, we’d expect it to be just random noise (In statistical terms, we would expect a p-value measuring the difference between the two populations to be pretty close to 1).

This is super weak evidence to reject the null hypothesis. In other words, there’s no probably no connection between the day of the week you were born and how long you’re going to live.

This part isn’t what’s controversial. What’s more controversial is what happens at the other end of the spectrum, because it’s hard to know the exact moment – the threshold – when we can say with confidence that the results of a study are meaningful.

This is what McShane & Company emphasize in their paper “Abandon Statistical Significance.” There is no magic point at which a study becomes meaningful. Rather you have to look at all the circumstances related to the study to know what it means. One single statistical analysis cannot prove that something is significant, because other factors matter.  The “other factors” that matter include “prior and related evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain.”

Traditionally, scientists have rather arbitrarily decided that the 5% threshold (P-value <.05, less than 1 in 20 chance of being wrong) is the measure of when something is meaningful. But there is no scientific magic fairy dust that transforms the data from meaningless from meaningful when there is a difference between a 6 % and 5% chance of occurring at random. And that’s what Gelman means when he says that “[t]he difference between ‘significant’ and ‘not significant’ is not itself statistically significant.”

There is a huge motivation for researchers to find evidence that something is meaningful. Few careers are borne out of papers showing that something isn’t meaningful or significant. As a result, scientists are highly motivated to go find statistical significance and publish those results. This bias in favor of searching for statistical significance leads to lower quality research and more confusion about what is truly significant.

This is why we so often hear contradictory evidence about what diets, foods, and lifestyles are good for us and which are bad. Scientists are trying to make names for themselves by searching for supposedly meaningful evidence and publishing that information to the press.

McShane & Company convincingly argue that scientists and science publishers should instead look more deeply into the totality of the evidence related to any given study, rather than to magic thresholds that supposedly purport to show significance.