bias-trading

P-Hacking: 20 Random Strategies. One Looks Brilliant. By Luck Alone.

The math says if you test 20 strategies, one looks significant by pure chance. That one is usually what gets published.

DT
Dominic Tschan
April 16, 20265 min read
P-Hacking: 20 Random Strategies. One Looks Brilliant. By Luck Alone.

Here's a cute statistical fact that should unsettle you.

If you test 20 random strategies on Bitcoin data, one of them will look statistically significant. Just by chance.

Not because it works. Because 20 random tries at a 5-percent threshold yields one random hit.

That one hit is what the guru sells you.

What P-Hacking Is

P-hacking (from "p-value hacking") is the academic name for: test enough strategies, and by pure luck, some will look like real edge.

The statistics work against you. At the standard 5-percent significance level, 1 in every 20 random patterns will pass the test. Even if all 20 are worthless.

If someone tests 100 strategies and shows you the 5 that passed, you're looking at nothing but noise.

The Retail Version

You see Twitter threads like this:

"I tested 50 trading strategies on Bitcoin. These 3 beat HODL. [chart]"

That's p-hacking made visible.

50 strategies at a 5-percent noise threshold: 2.5 should look significant by pure random chance. Finding 3 is right on the expected noise.

None of those 3 strategies have proven anything. They might be real. They might be statistical ghosts.

The Academic Version — More Sinister

Researchers do this too. Differently.

A finance professor tries 10 hypothesis. Nine fail. One passes p < 0.05. Guess which one becomes the paper?

The published paper looks like a clean discovery. Behind it: 9 invisible failures.

In academic finance, Harvey, Liu and Zhu (2016) documented 316 "factors" that had passed the 5% significance bar in published papers. After correcting for multiple testing, they concluded that a large share of those factors are likely false discoveries — the 1-in-20 random hits, dressed up as science. Marcos López de Prado proved the related False Strategy Theorem, which quantifies the same effect from the strategy-selection side.

This is why so many "proven" factors from 2010 papers stopped working after 2015. They weren't real. They got p-hacked into publication, then disappeared when tested on new data.

My Own Close Call

In the Alpha Hunt for Sandra's portfolio, I tested 11 momentum variants:

  • Top 5, 12M lookback, $5B floor
  • Top 5, 12M, $10B floor
  • Top 10, 12M, $5B floor
  • Top 10, 12M, $10B floor
  • Top 20, 12M, $5B floor
  • Top 10, 6M, $5B floor
  • Top 10, 3M, $5B floor
  • Top 10, 12M, $5B, trend filter
  • Top 5, 12M, $5B, trend filter
  • Top 10, 12M, $5B, with 10bps costs
  • Top 5, 12M, $5B, with 10bps costs

9 of 11 beat SPY. Looked great.

But wait: 11 tests at a 5-percent threshold means 0.55 random hits expected. 9 is way above that. Could still be real, could still be partial p-hacking.

What saved me from full p-hacking: all 11 were variants of the same academic theory (Jegadeesh-Titman 1993 momentum), not random unrelated strategies. Each test is partially redundant with the others.

That's actually the right defense. Don't test 20 wild ideas and pick the 1 that works. Test 1 idea in 20 ways and look for robustness.

How to Spot P-Hacking in a Claim

Red flag 1: "I tested X strategies and these Y worked." Where X >> Y. That's p-hacking by ratio.

Red flag 2: oddly specific parameters. "RSI(14) threshold 27.3 with exit at RSI 71.5, max hold 17 days." That specificity screams "I tuned to exactly one data sample."

Red flag 3: no out-of-sample test. If the strategy was both designed and validated on the same data, it's untrustworthy.

Red flag 4: new parameters, fresh discoveries. Every quarter there's a "new factor" that beats everything. Most evaporate by the next quarter. That's the p-hack graveyard.

Red flag 5: magic numbers from random tests. If a strategy's logic depends on a specific SMA period "23" and nothing near 23 works, that's a p-hack.

How to Test for It in Your Own Work

1. Control your hypothesis count. Before testing, write down what you're testing and why. Each new test AFTER the first is a degree of freedom. Budget them.

2. Use out-of-sample holdout. Design on 70 percent of the data. Test on the unseen 30 percent. If the strategy breaks on the holdout, it was p-hacked.

3. Test on different markets. If momentum works in US stocks, does it also work in European stocks? Japanese? If yes, it's likely real. If it only works in your test market, it's a p-hack.

4. Be extra skeptical of "new" discoveries. 98 years of academic research on markets. Everything simple has been tried. If you found "the one weird trick" nobody else found, probably it's noise.

The Deeper Lesson

The finance industry runs on p-hacked strategies. Fund managers justify expensive fees with backtests. Those backtests are usually p-hacked, retroactively fit to the data.

This is why most active funds underperform index funds long-term. Their "edge" was a statistical ghost.

Vanguard's John Bogle figured this out in the 1970s. Most "alpha" is illusion. The few real edges are tiny and hard to capture. For most investors, an index fund beats trying to be clever.

Momentum is an exception, documented across 98 years and 50+ markets. Value is another. Quality is a third. These survived decades of replication. That's how you separate real from p-hacked.

What We Do on BearBullRadar

Every strategy in the BotLab has:

  • Single, theory-driven hypothesis (not "test 50 things")
  • At least 6 years of data
  • Multiple timeframes tested
  • Transparent parameter count
  • Walk-forward validation when possible

When I can't do one of these, I label it "experimental" and keep it in the lab, not deployed.

The two strategies we actually trust enough to deploy with real money — Andromeda (DM+LD filter on BTC) and the new Momentum Top 10 — both pass all five tests.

The others are still learning. Including the Volume Spike bot, which is now labeled "p-hacking example: looked great on 16 months, broke on 6 years."

The Simple Rule

If a strategy "discovered" its edge, be skeptical.

If a strategy applies an edge that was discovered decades ago and replicated across markets, trust it more.

Trend is not your friend. Proven theory is.


-> Previous: Confirmation Bias -> Next: Regime Bias -> Back to pillar

Sources

Your Dominic, who tested 11 variants of ONE theory so I wouldn't accidentally p-hack 11 random ideas.


Disclaimer: Not financial advice. Past performance does not guarantee future results.

Disclaimer: This is not financial advice. All backtests are based on historical data and do not guarantee future results. Only invest what you can afford to lose.

Dominic Tschan

Dominic Tschan

MSc Physics, ETH ZurichPhysics teacher · Crypto investor · Bot builder

ETH physicist who tested 200+ trading strategies on 6 years of real market data. Runs 5 tier-labeled bots — 1 on real capital, 3 paper, 1 backtest-only. Here I share everything: results, mistakes, and lessons.

Free

Bot Alerts & Trading Lies

Get notified instantly when the bot buys or sells. Plus: free PDF, weekly myth-busting and bot performance updates.

Bot Signal AlertsFree PDF
No spamUnsubscribe anytimeYour data stays with us