methodology

Benchmark Tunnel Vision: When the HODL Comparison Misleads

A reader pushed back on our HODL-anchored methodology. The objection is half right, half wrong. Here's the multi-metric reporting that addresses it.

DT
Dominic Tschan
April 17, 202612 min read
Benchmark Tunnel Vision: When the HODL Comparison Misleads

I have argued — strongly, in Beat HODL or Don't Bother — that every strategy must clear the HODL benchmark over its full backtest window. That argument stands. HODL is the right anchor for evaluating whether a strategy creates wealth.

But there's a real methodological nuance I've understated, and a thoughtful reader pushed back this week. The objection roughly goes:

"You compare every strategy to HODL over the entire 8-year window. But strategies operate on different time horizons. A high-frequency bot that trades 50 times per year has a natural comparison horizon of weeks-to-months, not years. The strict full-window HODL comparison can be misleading for those strategies."

That objection is half wrong and half deeply right. This essay separates the two halves.


The Half That Is Wrong

The case for HODL as the dominant benchmark survives the objection. Three reasons that don't go away regardless of trading frequency:

1. Opportunity Cost Is Universal

Every dollar in your strategy is a dollar that's not in HODL. The comparison must happen on the same capital, over the same period. If HODL would have produced $80k from $10k over 8 years, and your bot produced $50k from $10k, your bot is strictly worse for that capital — irrespective of whether it trades 1× per year or 100× per year. Wealth is wealth.

You cannot escape this logic by saying "but my bot operates on a different time horizon." The investor's capital was at risk for 8 years either way. The 8-year outcome is what matters.

2. Fees And Taxes Compound Over Time

A high-frequency bot pays per-trade fees on every entry and exit. Over a year, even at 0.10% per round-trip, 50 trades = 10% fee drag. Over 8 years compounded, this gap is enormous. Plus tax: in Switzerland, monthly+ rotation can trigger professional-trader classification (gains taxed up to 40%+); in the US, short-term capital gains taxed as ordinary income.

These accumulate over the full window. Comparing to "HODL within each 30-day window" hides the multi-year fee+tax damage. You can only see it on the long horizon.

3. Walk-Forward Statistical Robustness

Splitting an 8-year backtest into 3 windows of ~2.7 years each gives N=3 for the regime-robustness check. Splitting into shorter rolling windows gives more observations, but each is so short that noise dominates. You need long windows to detect regime-overfit strategies. Shortening to fit a high-frequency narrative means accepting weaker statistical evidence.

So HODL-over-the-full-window stays. It's the anchor.


The Half That Is Right

The objection points to a real reporting weakness: we report total return + max drawdown + walk-forward score, but we don't report the metrics that capture HF-specific quality. A bot that loses to HODL by 200pp over 8 years could still be a perfectly good strategy if:

  • It has dramatically lower drawdown
  • It has high Sharpe (low volatility per unit of return)
  • It beats HODL in 70%+ of rolling 6-month windows (most months it's a winner, even if the totals don't reflect that)
  • It correlates poorly with HODL (diversification value)
  • Its win rate / asymmetry profile is attractive at the trade level

By reporting only "total return vs HODL," we give readers a binary verdict where the truth is multi-dimensional. That's the legitimate critique.

Concrete example: Tactician 2.0

The Tactician 2.0 over 8 years:

  • Total return: +1,126%
  • HODL same period: +603%
  • Verdict by total-return frame: BEATS HODL by +523pp

But look at the walk-forward breakdown:

  • Window 1 (2018-2020): bot beats HODL by ~+90pp
  • Window 2 (2020-2023): bot beats HODL by ~+150pp
  • Window 3 (2023-2026): bot LOSES to HODL by ~-75pp

In the most recent regime (the bull market we're currently in), the bot underperforms. The +523pp full-period win is real, but it's heavily driven by W1 and W2 (the bear and recovery cycles). If we deployed in W3 only, we'd be losing.

A retail investor reading "+1,126% vs +603%" reasonably expects the bot to outperform NOW. The walk-forward shows that's not necessarily true. The full-period number, by itself, is misleading.

What we're missing in reporting

For each bot, we should also show:

  1. Sharpe Ratio. Return per unit of vol. Compares strategies on a risk-adjusted basis.
  2. Calmar Ratio. Return per unit of MaxDD. Already used internally, should be visible.
  3. Rolling N-Month Beat-Rate. "In what % of rolling 6-month windows does the bot beat HODL?" If a bot beats HODL in 70% of 6-month windows but loses on full-period because of one bad regime, that's important context.
  4. Win Rate. % of trades that close positive. Critical for HF strategies.
  5. % Time in Market. What fraction of capital is deployed vs sitting in cash? Relevant for cycle-filter bots.
  6. Correlation to HODL. A strategy with low correlation provides diversification value even if its standalone returns are mediocre.

These don't replace HODL comparison. They complement it.


When To Weight Which Metric

For different bot types and different investor goals, different metrics matter most. Rough mapping:

Long-cycle filter (e.g., Watchdog)

  • Primary: Total return vs HODL (over multiple cycles)
  • Critical: Max Drawdown (the whole point is downside protection)
  • Less relevant: Win rate (you trade ~once a year, sample size is too small)

Daily momentum (e.g., Tactician 2.0)

  • Primary: Walk-Forward score (regime robustness)
  • Critical: Calmar ratio (return per pain)
  • Important: Rolling 6-month beat-rate (does it beat HODL most of the time?)

Cross-asset rotation (e.g., Rotator, Tri-Rotator)

  • Primary: Sharpe ratio (capturing the diversification benefit)
  • Critical: Walk-Forward score
  • Important: Correlation to single-asset HODL (lower = more diversification value)

Quality breakout (e.g., Volcano)

  • Primary: Calmar ratio (the whole point is risk-adjusted return)
  • Critical: Win rate (rare trades must hit)
  • Less relevant: Total return (low frequency means absolute return is naturally lower)

Sector momentum (e.g., Alpha Hunter)

  • Primary: Total return vs SPY (sector benchmark, not BTC HODL)
  • Critical: Max Drawdown vs SPY drawdown
  • Important: % time in market (how much of the equity premium is captured?)

For each bot, we should highlight the 2-3 metrics that matter most for that strategy class — not just paste the same row of stats everywhere.


What We're Doing About It

Three concrete changes shipping with this article:

1. New /methodology page

Explains the 3-test stack (continuous + walk-forward + parameter robustness) and the multi-metric reporting philosophy. Linkable from every bot card. Single source of truth for "how we test."

2. Bot Cards extended

Each Live Performance Card now shows a Backtest Stats section with: Sharpe, Calmar, Walk-Forward score, and Rolling N-Month Beat-Rate. Plus the existing Total Return vs HODL and Max Drawdown.

This gives readers the multi-dimensional view by default. The HODL comparison stays as the anchor, but it doesn't dominate alone.

3. This article

Honest discussion of the trade-off. Linked from /methodology and the new Bot Cards. So readers who wonder "why is this bot listed as a winner if it lost to HODL last year?" have a complete answer.


The Sharper Position

I want to be precise about what changed. Before this week, my position was:

"Beat HODL after fees over the full backtest window or you don't have a strategy."

That's still mostly correct. The strict version remains: a strategy that systematically loses to HODL across multiple regimes, after costs, is not worth deploying real capital on. The HODL filter still kills 90%+ of garbage strategies.

But the more nuanced version is:

"The HODL comparison over the full window is the necessary first filter. Strategies that pass it deserve a multi-metric evaluation that captures risk profile, regime consistency, and HF-specific quality. Some strategies that narrowly underperform HODL on total return can still be valuable in a portfolio context, especially if they offer dramatically lower drawdown, low correlation, or specific risk-management properties."

The second version is what we now apply. The first version is where the rejection-of-garbage filter still operates.

The original critique was right that we under-reported the second-stage metrics. It was wrong that the first-stage HODL filter is the wrong tool. Both halves had to be acknowledged.


The Even Sharper Position For Investors

If you're personally evaluating any trading strategy (ours or anyone else's), here are the questions that matter — in order:

  1. Did it beat HODL after fees over the longest available data? If no, ask why you'd pay it your time/risk. Most stop here.
  2. Does it pass walk-forward in 2/3+ independent windows? If only 1/3, the historical alpha was regime-specific. Likely not generalizable.
  3. Is the parameter selection robust across a neighborhood? If only one specific number works, you've found a lottery ticket, not an edge.
  4. What's the Calmar Ratio vs HODL Calmar? Risk-adjusted comparison matters more than raw return.
  5. What % of rolling 6-month windows does it beat HODL? Tells you about consistency vs total dominance.
  6. Win rate, asymmetry, % time in market? HF-specific quality at the trade level.

If you only check (1), you're using the right anchor but missing 80% of the picture. If you skip (1), you're chasing a story without a baseline.

The right move is both. Anchor + complementary metrics. That's what the new methodology page and Bot Cards are designed to support.


Related reading:


Not financial advice. Past performance does not guarantee future results. Multi-metric reporting reduces — but does not eliminate — the probability of being misled by any single number.

Disclaimer: This is not financial advice. All backtests are based on historical data and do not guarantee future results. Only invest what you can afford to lose.

Dominic Tschan

Dominic Tschan

MSc Physics, ETH ZurichPhysics teacher · Crypto investor · Bot builder

ETH physicist who tested 200+ trading strategies on 6 years of real market data. Runs 5 tier-labeled bots — 1 on real capital, 3 paper, 1 backtest-only. Here I share everything: results, mistakes, and lessons.

Free

Bot Alerts & Trading Lies

Get notified instantly when the bot buys or sells. Plus: free PDF, weekly myth-busting and bot performance updates.

Bot Signal AlertsFree PDF
No spamUnsubscribe anytimeYour data stays with us