Overfitting: Our Best Bot Made +45%, Then Lost 27% on Real Data

Yesterday, I had a gold nugget.

The Volume Spike Bot. +45.7% return over 16 months of Bitcoin data. 56% win rate. I called it the best bot in the BotLab. I celebrated.

Today, I extended the backtest to 6 years.

Result: -27.6%.

Not plus. Minus. The gold nugget was pyrite.

This is overfitting. The most common killer of trading strategies.

What Overfitting Is

Your backtest found a pattern that fits your test data perfectly. So well that you can't believe it won't work forever.

The problem: the pattern doesn't generalize. It fit the test period because you tuned it — consciously or not — to that specific period's quirks.

Like a student who memorizes the test answers. Perfect score on the practice exam. Disaster on the real one.

The Volume Spike Story

Here's what the bot did. When Bitcoin volume spiked 2.5 standard deviations above its 20-day average AND the price had dropped 3 percent in the last session, buy. Sell on a 4 percent rise or after 5 days.

On 2024-2025 data, this fired 8 times. 5 winners. +45.7 percent cumulative.

Beautiful. Publish the article.

Then I ran the same rules on 2020 through 2023. The bot fired 31 more times. 9 winners.

The new total: -27.6 percent.

What happened? The 2024-2025 period had a specific pattern: high-volume dips were followed by quick recoveries. V-shaped. Clean.

2022 was different. High-volume dips were followed by more dips. The bot kept buying. And buying. And buying. Averaging down into a crater.

The bot didn't know it was averaging into a bear market. It only knew the rules. The rules were tuned to conditions that didn't exist anymore.

Why Your Brain Does This

When you design a strategy, you iterate. Try rules. Run backtest. See result. Adjust rules. Re-run.

Each iteration is a form of tuning. Even if you don't realize it.

After 50 iterations, your strategy fits the test period very well. That doesn't mean it works going forward. It means the test period is a 50-iteration exam and you've memorized it.

How Much Does It Matter?

In my systematic testing of the BotLab over 6 years vs 16 months:

Bot	16 months	6 years	Verdict
Halving Countdown	+91%	+2,104%	Robust — more data, better
Seasonality	+30%	+1,483%	Robust
Contrarian RSI	+55%	+454%	Robust
Volume Spike	+45.7%	-27.6%	Overfit
VWAP Bot	+15.0%	-12.5%	Overfit
MACD Crossover	+22%	-5%	Marginal

The pattern is clear. Robust strategies get BETTER with more data. Overfit strategies get WORSE.

If a strategy looks great on 16 months but collapses on 6 years, it was overfit. Always.

The Rules I Now Follow

1. Test on at least 6 years. One full market cycle minimum. Better: 10+ years, which means stocks data (since crypto is too young).

2. Count parameters honestly. Every threshold, every lookback, every filter is a parameter. More parameters = more degrees of freedom = more overfitting risk. Under 5 parameters is acceptable. Over 10 is almost always overfit.

3. Use out-of-sample holdout. Design your strategy on 70 percent of the data. Test on the unseen 30 percent. If it breaks on the 30 percent, you've overfit the 70 percent.

4. Walk-forward validation. Train on years 1-3. Test on year 4. Retrain 1-4, test year 5. Repeat. If performance is inconsistent across walk-forward windows, it's overfit.

5. Accept smaller edges. A strategy claiming 45 percent per year is almost certainly overfit. A strategy claiming 15 percent per year might be real. The more extraordinary the claim, the more overfit the backtest.

The Momentum Strategy Test

Remember the momentum strategy I published — top 10 US stocks by 12-month return, monthly rebalance? Let me show you why it probably isn't overfit.

Parameter count: 3. Top N (10), lookback (12 months), rebalance frequency (monthly). That's it.

Out-of-sample: The strategy was designed in academic literature 30+ years ago. Jegadeesh and Titman 1993. I didn't tune it. I applied it.

Cross-market validity: Works in US stocks, EU stocks, Japanese stocks, emerging markets, commodities, currencies. Documented across 98 years of equity data.

Walk-forward: Each year in my test is out-of-sample with respect to the years before. Consistent positive returns.

Compare that to the Volume Spike bot: 6 parameters, tuned on 16 months, worked only in one regime. Everything screams overfit in retrospect.

I just didn't see it until I extended the data.

The Uncomfortable Lesson

If you're tuning a strategy and it looks great, your next question should not be "how can I deploy this?" It should be "have I overfit?"

The tests:

Does it work on data you didn't use for design?
Does it work on a different market?
Does it work on a different time window?
Would a simpler version work almost as well?

If yes to all four: probably real.

If the strategy collapses when you change any of these: overfit.

What Happened to the Volume Spike Bot

I didn't delete it from the BotLab. I relabeled it.

"Volume Spike Bot — Overfitting Example. See what happens when you trust 16 months of data."

Every visitor to the BotLab now learns what overfitting looks like, using my own mistake as the textbook case.

The other 22 bots kept running. The ones that survived the 6-year stress test are candidates for live deployment. The ones that didn't are educational.

Our best bot was a lie. But the lie taught more than the truth would have.

-> Previous: Hindsight Bias -> Next: Cherry Picking -> Full case study: Our Best Bot Was a Lie — the original article -> Back to pillar