How we test what we test
Three tests every strategy must pass before we deploy a dollar.
We rejected ~195 of 200 strategies tested. Here's the filter that did the rejecting, the metrics we report, and why HODL is our anchor benchmark — but never the only one.
The 3-Test Stack
Each test catches a different failure mode. Skipping any one means accepting a higher rate of deploying broken strategies. Each filter rejects roughly 50-80% of what passed the previous one — combined rejection ~95%.
Continuous Full-Period Backtest
Catches: obviously broken strategies
Run the strategy over the longest available data window (we typically use 6-8 years of daily Bitcoin data from Binance/Bybit). Apply realistic 0.10% per-trade fees and 0.05% slippage. Compare total return to HODL baseline.
3-Window Walk-Forward
Catches: regime-overfit strategies
Split the historical data into 3 non-overlapping equal windows. Run the strategy independently in each. Does it beat HODL in EACH window separately?
A strategy that beats HODL by +400% over 8 years might have produced ALL of that edge in one bull-market regime and lost in two others. The full-period number averages this away. Walk-forward exposes it.
Parameter Robustness Sweep
Catches: noise-overfit strategies
For your chosen parameter (lookback period, threshold, etc.), test ±2 to ±5 nearby values. Does the response form a plateau or an isolated spike?
A real edge produces similar results across nearby parameter choices (plateau). A lottery-winning fluke produces a single magic number with mediocre neighbors (spike). Walk-forward CAN'T catch this — only neighborhood testing does.
Why HODL Is Our Anchor Benchmark
Every strategy is tested against buy-and-hold of the same asset over the same window. Here's why — and what HODL comparison doesn't tell you.
- Opportunity cost. Every dollar in your strategy is a dollar not in HODL. If HODL would have made more, your strategy is paying you a negative wage.
- Zero-skill baseline. Anyone can HODL. Beating HODL requires demonstrable skill or edge.
- Tax + fee accumulation. Over multi-year windows, fees and tax friction compound and can wipe out short-term alpha. HODL captures none of these.
- Statistical robustness. Walk-forward over multiple regimes requires long windows, which forces the HODL frame.
- Time-horizon mismatch. A high-frequency bot trading 50x/year operates on a different natural timescale than 8-year HODL. Total-return comparison hides operational differences.
- Risk profile. HODL = constant exposure. A dynamic strategy = variable exposure. Same return, very different risk profile.
- Diversification value. A strategy with lower return but uncorrelated returns can be valuable in a portfolio context.
- Investor reality. Some investors must trade (tax-loss harvesting, regulatory). For them, HODL isn't available.
What We Report (Not Just Total Return)
HODL anchors the comparison. But the full picture needs 5+ additional metrics.
| Metric | What it tells you | Most relevant for |
|---|---|---|
| Total Return vs HODL | Did the strategy create wealth vs the zero-skill alternative? | all strategies |
| Max Drawdown | Worst peak-to-trough loss. Psychological / margin-call relevance. | all strategies |
| Calmar Ratio | Total return ÷ |Max Drawdown|. Higher = better risk-adjusted. | all strategies |
| Sharpe Ratio | Excess return per unit of volatility. Standard risk-adjusted metric. | all strategies |
| Walk-Forward Score | Number of independent sub-periods where strategy beats benchmark. | regime-robustness |
| Rolling N-Month Beat-Rate | % of rolling 6/12-month windows where strategy beats HODL. | HF strategies, investor experience |
| Win Rate | % of trades that close positive after fees. | HF strategies |
| Avg Win / Avg Loss | Asymmetry profile. Strategies can have low win rate but high asymmetry. | HF strategies |
| % Time in Market | What fraction of time is capital deployed vs sitting in cash? | cycle-filter strategies |
| Parameter Plateau Width | Number of nearby parameter values that also pass walk-forward. | overfit-detection |
What We Don't Do
Cherry-pick time windows
We test on full available history, not just bull markets. The losing windows are reported.
Hide failed strategies
Every retired strategy is documented on /post-mortems with date, reason, and lesson.
Ignore fees and slippage
All backtests include realistic 0.10%-0.20% per-trade costs. The free-fees backtest is a lie.
Optimize on the test data
Parameter selection happens out-of-sample. We don't fit and test on the same window.
Show only Sharpe or only Total Return
Single-metric reports hide failure modes. We show 5+ metrics per bot.
Promise alpha
Backtests don't guarantee future returns. Every bot card carries a tier label and a clear caveat.
See the methodology applied
Every bot on /bots passes the 3-test stack and reports the multi-metric panel. Every retired strategy on /post-mortems failed at least one of the tests — explained.