Home/Blog/The p-Value Crisis: Why Most Published Research May Be Wrong
Applied Probability9 min read

The p-Value Crisis: Why Most Published Research May Be Wrong

The 0.05 significance threshold has governed scientific publishing for 80 years. It was never meant for that purpose — and the resulting crisis is reshaping statistics.

P
The Probability Lab Team
July 22, 2025

In 2005, Stanford epidemiologist John Ioannidis published a paper titled "Why Most Published Research Findings Are False." It became one of the most downloaded papers in the history of PLOS Medicine. Its argument was statistical, not anecdotal: under typical conditions of scientific publishing, the majority of published findings with p < 0.05 are expected to be false positives.

Understanding why requires understanding what a p-value actually measures — and what it does not.

The definition

The p-value is the probability of obtaining data as extreme as or more extreme than what was observed, assuming the null hypothesis is true. It is written formally as:

p-Value Definition
p = P(data as extreme as observed | H₀ is true)

What p-value IS:
  - Probability of the data, given no effect exists

What p-value IS NOT:
  - Probability the null hypothesis is true
  - Probability the result is due to chance
  - Probability that your conclusion is correct
  - Measure of the effect's importance or size

Where the 0.05 threshold came from

Ronald Fisher, in his 1925 book Statistical Methods for Research Workers, suggested that a probability of 1 in 20 (0.05) was a "convenient" level for declaring an effect worthy of investigation. He explicitly intended it as a rough heuristic for individual experiments, not as a universal publication threshold.

Fisher later wrote that a scientific fact should be demonstrated by repetition under varied conditions — not validated by a single p < 0.05 result. The binary publication criterion he inspired was not what he advocated.

The base rate problem

Suppose 10% of tested hypotheses are true (a reasonable estimate in early-stage research). A test with 80% power and α = 0.05:

H₀ true (90 studies)H₀ false (10 studies)
Significant result (p < 0.05)4.5 false positives8 true positives
Non-significant result85.5 true negatives2 missed effects

Among the 12.5significant results, 4.5 are false positives — a false discovery rate of 36%. If prior probability of a true effect is lower (5%), the false discovery rate rises above 50%.

The multiple comparisons problem

Family-Wise Error Rate
If you run k independent tests at α = 0.05:
  P(at least one false positive) = 1 − (1 − 0.05)ᵏ

k = 1:   5.0% false positive probability
k = 10:  40.1%
k = 20:  64.2%
k = 50:  92.3%

The Bonferroni correction: use α/k per test to maintain overall α.
With k = 20 tests: significance threshold = 0.05/20 = 0.0025

What is changing

The American Statistical Association issued a formal statement in 2016 warning against relying on p-value thresholds for publication decisions. Major journals including Basic and Applied Social Psychology have banned p-values entirely. Many fields now require pre-registration of hypotheses, larger samples, effect size reporting (Cohen's d, η², r), and replication before publication.

The p-value is not useless — it is one valid signal among several. The crisis arose from treating it as the sole arbiter of truth. A p-value tells you the data is surprising under the null hypothesis. It does not tell you the effect is real, large, reproducible, or important.