The p-Value Crisis: Why Most Published Research May Be Wrong
The 0.05 significance threshold has governed scientific publishing for 80 years. It was never meant for that purpose — and the resulting crisis is reshaping statistics.
In 2005, Stanford epidemiologist John Ioannidis published a paper titled "Why Most Published Research Findings Are False." It became one of the most downloaded papers in the history of PLOS Medicine. Its argument was statistical, not anecdotal: under typical conditions of scientific publishing, the majority of published findings with p < 0.05 are expected to be false positives.
Understanding why requires understanding what a p-value actually measures — and what it does not.
The definition
The p-value is the probability of obtaining data as extreme as or more extreme than what was observed, assuming the null hypothesis is true. It is written formally as:
p = P(data as extreme as observed | H₀ is true) What p-value IS: - Probability of the data, given no effect exists What p-value IS NOT: - Probability the null hypothesis is true - Probability the result is due to chance - Probability that your conclusion is correct - Measure of the effect's importance or size
Where the 0.05 threshold came from
Ronald Fisher, in his 1925 book Statistical Methods for Research Workers, suggested that a probability of 1 in 20 (0.05) was a "convenient" level for declaring an effect worthy of investigation. He explicitly intended it as a rough heuristic for individual experiments, not as a universal publication threshold.
The base rate problem
Suppose 10% of tested hypotheses are true (a reasonable estimate in early-stage research). A test with 80% power and α = 0.05:
| H₀ true (90 studies) | H₀ false (10 studies) | |
|---|---|---|
| Significant result (p < 0.05) | 4.5 false positives | 8 true positives |
| Non-significant result | 85.5 true negatives | 2 missed effects |
Among the 12.5significant results, 4.5 are false positives — a false discovery rate of 36%. If prior probability of a true effect is lower (5%), the false discovery rate rises above 50%.
The multiple comparisons problem
If you run k independent tests at α = 0.05: P(at least one false positive) = 1 − (1 − 0.05)ᵏ k = 1: 5.0% false positive probability k = 10: 40.1% k = 20: 64.2% k = 50: 92.3% The Bonferroni correction: use α/k per test to maintain overall α. With k = 20 tests: significance threshold = 0.05/20 = 0.0025
What is changing
The American Statistical Association issued a formal statement in 2016 warning against relying on p-value thresholds for publication decisions. Major journals including Basic and Applied Social Psychology have banned p-values entirely. Many fields now require pre-registration of hypotheses, larger samples, effect size reporting (Cohen's d, η², r), and replication before publication.
The p-value is not useless — it is one valid signal among several. The crisis arose from treating it as the sole arbiter of truth. A p-value tells you the data is surprising under the null hypothesis. It does not tell you the effect is real, large, reproducible, or important.