P-Hacking: 20 Random Strategies. One Looks Brilliant. By Luck Alone.

Here's a cute statistical fact that should unsettle you.

If you test 20 random strategies on Bitcoin data, one of them will look statistically significant. Just by chance.

Not because it works. Because 20 random tries at a 5-percent threshold yields one random hit.

That one hit is what the guru sells you.

What P-Hacking Is

P-hacking (from "p-value hacking") is the academic name for: test enough strategies, and by pure luck, some will look like real edge.

The statistics work against you. At the standard 5-percent significance level, 1 in every 20 random patterns will pass the test. Even if all 20 are worthless.

If someone tests 100 strategies and shows you the 5 that passed, you're looking at nothing but noise.

The Retail Version

You see Twitter threads like this:

"I tested 50 trading strategies on Bitcoin. These 3 beat HODL. [chart]"

That's p-hacking made visible.

50 strategies at a 5-percent noise threshold: 2.5 should look significant by pure random chance. Finding 3 is right on the expected noise.

None of those 3 strategies have proven anything. They might be real. They might be statistical ghosts.

The Academic Version — More Sinister

Researchers do this too. Differently.

A finance professor tries 10 hypothesis. Nine fail. One passes p < 0.05. Guess which one becomes the paper?

The published paper looks like a clean discovery. Behind it: 9 invisible failures.

In academic finance, Harvey, Liu and Zhu (2016) documented 316 "factors" that had passed the 5% significance bar in published papers. After correcting for multiple testing, they concluded that a large share of those factors are likely false discoveries — the 1-in-20 random hits, dressed up as science. Marcos López de Prado proved the related False Strategy Theorem, which quantifies the same effect from the strategy-selection side.

This is why so many "proven" factors from 2010 papers stopped working after 2015. They weren't real. They got p-hacked into publication, then disappeared when tested on new data.

My Own Close Call

In the Alpha Hunt for Sandra's portfolio, I tested 11 momentum variants:

Top 5, 12M lookback, $5B floor
Top 5, 12M, $10B floor
Top 10, 12M, $5B floor
Top 10, 12M, $10B floor
Top 20, 12M, $5B floor
Top 10, 6M, $5B floor
Top 10, 3M, $5B floor
Top 10, 12M, $5B, trend filter
Top 5, 12M, $5B, trend filter
Top 10, 12M, $5B, with 10bps costs
Top 5, 12M, $5B, with 10bps costs

9 of 11 beat SPY. Looked great.

But wait: 11 tests at a 5-percent threshold means 0.55 random hits expected. 9 is way above that. Could still be real, could still be partial p-hacking.

What saved me from full p-hacking: all 11 were variants of the same academic theory (Jegadeesh-Titman 1993 momentum), not random unrelated strategies. Each test is partially redundant with the others.

That's actually the right defense. Don't test 20 wild ideas and pick the 1 that works. Test 1 idea in 20 ways and look for robustness.

How to Spot P-Hacking in a Claim

Red flag 1: "I tested X strategies and these Y worked." Where X >> Y. That's p-hacking by ratio.

Red flag 2: oddly specific parameters. "RSI(14) threshold 27.3 with exit at RSI 71.5, max hold 17 days." That specificity screams "I tuned to exactly one data sample."

Red flag 3: no out-of-sample test. If the strategy was both designed and validated on the same data, it's untrustworthy.

Red flag 4: new parameters, fresh discoveries. Every quarter there's a "new factor" that beats everything. Most evaporate by the next quarter. That's the p-hack graveyard.

Red flag 5: magic numbers from random tests. If a strategy's logic depends on a specific SMA period "23" and nothing near 23 works, that's a p-hack.

How to Test for It in Your Own Work

1. Control your hypothesis count. Before testing, write down what you're testing and why. Each new test AFTER the first is a degree of freedom. Budget them.

2. Use out-of-sample holdout. Design on 70 percent of the data. Test on the unseen 30 percent. If the strategy breaks on the holdout, it was p-hacked.

3. Test on different markets. If momentum works in US stocks, does it also work in European stocks? Japanese? If yes, it's likely real. If it only works in your test market, it's a p-hack.

4. Be extra skeptical of "new" discoveries. 98 years of academic research on markets. Everything simple has been tried. If you found "the one weird trick" nobody else found, probably it's noise.

The Deeper Lesson

The finance industry runs on p-hacked strategies. Fund managers justify expensive fees with backtests. Those backtests are usually p-hacked, retroactively fit to the data.

This is why most active funds underperform index funds long-term. Their "edge" was a statistical ghost.

Vanguard's John Bogle figured this out in the 1970s. Most "alpha" is illusion. The few real edges are tiny and hard to capture. For most investors, an index fund beats trying to be clever.

Momentum is an exception, documented across 98 years and 50+ markets. Value is another. Quality is a third. These survived decades of replication. That's how you separate real from p-hacked.

What We Do on BearBullRadar

Every strategy in the BotLab has:

Single, theory-driven hypothesis (not "test 50 things")
At least 6 years of data
Multiple timeframes tested
Transparent parameter count
Walk-forward validation when possible

When I can't do one of these, I label it "experimental" and keep it in the lab, not deployed.

The two strategies we actually trust enough to deploy with real money — Andromeda (DM+LD filter on BTC) and the new Momentum Top 10 — both pass all five tests.

The others are still learning. Including the Volume Spike bot, which is now labeled "p-hacking example: looked great on 16 months, broke on 6 years."

The Simple Rule

If a strategy "discovered" its edge, be skeptical.

If a strategy applies an edge that was discovered decades ago and replicated across markets, trust it more.

Trend is not your friend. Proven theory is.

-> Previous: Confirmation Bias -> Next: Regime Bias -> Back to pillar

Sources

Harvey, Liu, Zhu (2016) — "...and the Cross-Section of Expected Returns" — exposed 316 "factors" with massive p-hacking concerns
López de Prado — The False Strategy Theorem — math of how often false positives look real
Ioannidis — Why Most Published Research Findings Are False — foundational paper on the broader problem

Your Dominic, who tested 11 variants of ONE theory so I wouldn't accidentally p-hack 11 random ideas.

Disclaimer: Not financial advice. Past performance does not guarantee future results.

P-Hacking: 20 Random Strategies. One Looks Brilliant. By Luck Alone.

What P-Hacking Is

The Retail Version

The Academic Version — More Sinister

My Own Close Call

How to Spot P-Hacking in a Claim

How to Test for It in Your Own Work

The Deeper Lesson

What We Do on BearBullRadar

The Simple Rule

Sources

Dominic Tschan

Bot Alerts & Trading Lies

More Articles

The Tactician: One Number, Every Day, Beats HODL by 486 Points

Data-Snooping: 3.2 Million Parameter Combinations. Your "Winner" Is Noise.

Recency Bias: Why Last Year's Winners Usually Lose Next Year