How we test what we test
Seven tests every strategy must pass before it earns a tier.
Each test catches a different way a backtest can lie to you. Of 11 bots run through the original 5-test suite, exactly one cleared every test. Two more tests (Cluster-Removal, Random-Baseline) were added in 2026-04 and 2026-05 after they exposed mistakes in our own thinking. One bot was retired to /post-mortems. The rest landed in Tier 2-3. Tiers are honest. The tests are the same for everyone.
2026-04-28: All bots paper-tracking. Real-money graduation now requires ≥6 months of forward-validated proof — no bot graduates from backtest-strength alone.
Why this exists
Until April 2026, BBR bots were validated with informal phrases like "+2,942% over 6 years", "13/13 parameter variants profitable", "OOS validated 60/40 split". All three turned out to be technically true but uninformative or actively misleading when subjected to harder tests.
- +2,942% return = full-period total dominated by 1-2 winning regimes. Walk-Forward shows the bot loses to HODL in most rolling 2-year windows.
- 13/13 robust = parameter sweep on an irrelevant axis. The actually-relevant axis was fragile.
- OOS validated 60/40 split = single static train/test = sample-of-1.
This page exists so we never repeat that pattern. Every new bot must pass the same harsh test that exposed the limits of every existing bot. No exceptions for favourites.
The 7-Test Suite
Each test catches a different way a backtest can lie. Skipping any one means accepting a higher rate of deploying broken strategies. The first five tests are run on every bot. Tests 6 and 7 were added when we discovered the failure modes they catch — both came from re-examining bots we'd already passed.
Walk-Forward Validation
Catches: strategies that worked once because of a lucky regime, not a structural edge
Roll a 2-year window across the entire history with a 90-day step. In each window, train the strategy's primary parameters on the first 60% of data and apply the chosen config to the last 40%. Compare the test-segment return to the primary benchmark (S&P 500). For bots that claim in-crypto edge: also check vs BTC HODL.
Multi-X Robustness
Catches: edges that depend on the one specific asset / filter combo the bot ships with
Run the same rule with 5+ alternative configurations. For multi-asset bots: swap pairings (BTC + each of GDX, GLD, SLV, TLT, XLE, SPY, TIP). For multi-filter bots: turn each filter on / off individually. Genuine edges generalize across nearby variants.
Parameter Sensitivity Sweep
Catches: strategies sitting on a single magic number with mediocre neighbors
Sweep the bot's primary parameter (lookback period, threshold, percentile) across 7-13 values bracketing the deployed default. A real edge produces a plateau — most nearby values also work. A statistical fluke produces a single-peak optimum. We measure the spike ratio: (best − avg-of-neighbors) / |avg-of-neighbors|. >0.30 is a warning. >1.0 is a red flag.
Hidden Parameter Test
Catches: a deployment knob (rebalance freq, position sizing, grid spacing) that wasn't advertised but heavily affects results
Sweep the secondary knobs every strategy actually depends on. For rotation bots: rebalance frequency (daily / weekly / bi-weekly / monthly). For grid bots: grid range and bin count. For multi-position bots: top-N concurrent positions. Real edges generalize to lower frequencies. A strategy that only works at daily rebalance is exploiting high-frequency noise, not a structural edge.
Fees & Slippage Sensitivity
Catches: edges that only exist in zero-cost backtests
Sweep round-trip cost from 0.05%/side (maker) to 0.30%/side (worst-case retail). The edge must survive realistic costs to be real. Most strategies pass this — fees rarely kill momentum/regime bots. The killer is path-dependence (Test 1).
Cluster-Removal Stress Test
Catches: bots whose headline edge depends on 2-3 lucky events
Identify the top-3 highest-PnL trade clusters in the backtest (a cluster = ≥2 trades within 7 days). Remove them. Re-compute total return, CAGR, MaxDD without those events. If the headline numbers collapse by >50%, the bot isn't a smooth-yield strategy — it's a crisis-liquidity provider that fires rarely but big. Still useful — but the bot card has to say so.
Random-Baseline Test
Catches: when a "signal" looks predictive but is actually statistical noise
Generate 100 random pseudo-signal sets with the same frequency as the real signal. Run each through the same backtest. Compare the real signal's Sharpe against the distribution of the 100 randoms. The real signal must beat the 95th percentile of the random distribution — otherwise it's noise that happened to align with returns.
Per-Trade Win/Loss Ratio
Reveals whether the edge is from frequent small wins or rare big wins
Extract per-trade P&L from the strategy's trade log. Compute the ratio of average win size to average loss size. W/L ≥ 2.5:1 with ≥20 trades is genuine trend-following structure — even at 50% win rate, the structural expected value per trade is positive.
Locked Beat-Rate → Stop Digging
Knowing when further parameter-tuning is wasted effort
When walk-forward beat-rates are locked (identical to the percentage point) or strictly worse across multiple parameter variants of either component — entry-side filters OR exit-side rules — the signal class itself is the structural limit. No further tuning of that strategy is going to rescue it. The right move is to stop digging on this signal class and reallocate research time to a different signal class entirely.
Decision rule revised 2026-05-05 (v2.2)
From binary to score-based
Until 2026-05-05 a single test failure could disqualify a bot from any tier above 4. That rule was too brittle: bots cleared 5 of 7 tests strongly and still got retired because of one knife-edge or one cluster-dependent year. We replaced the binary rule with a score sum while keeping every test threshold intact.
How a bot earns its score (max 16)
- T1 Walk-Forward: PASS=3 / WEAK=1 / FAIL=0
- T2 Multi-X: STRONG=2 / PARTIAL=1 / WEAK=0
- T3 Param-Sensitivity: ROBUST=2 / PARTIAL=1 / FRAGILE=0
- T4 Hidden-Parameter: ROBUST=2 / PARTIAL=1 / FRAGILE=0
- T5 Fees & Slippage: ROBUST=2 / OK=1 / FRAGILE=0
- T6 Cluster-Removal: PASS=2 / WARN=1 / CONCENTRATED=0
- T7 Random-Baseline: PASS=2 / FAIL=0
- Bonus W/L: ≥3.0 = +1 / 2.0–2.99 = +0.5
Score → Tier mapping
- ≥ 12 Tier 1 — real-money-eligible
- 8 – 11 Tier 2 — paper-track
- 5 – 7 Tier 3 — defensive role only
- < 5 Tier 4 — disproven
- T1 FAIL vs primary benchmark
- T6 CONCENTRATED (>70% top-3 shrinkage)
- T7 FAIL (real signal indistinguishable from random)
The 4-Tier System
After the 7+1 tests, every bot lands in one of four tiers. Tier dictates: real money, paper-track, defensive sleeve, or shelved.
Validation-Strong, Paper-Tracking
Cleared the v2.1 multi-benchmark gates on at least one path: Path 1 (return-superior vs S&P), Path 2 (Sharpe ≥1.0 + DD-protection), or Path 3 (diversification sleeve: Sharpe ≥0.8 + correlation to S&P ≤0.3 + capital preservation in S&P-bear windows). Currently paper-tracking under the all-paper policy adopted 2026-04-28.
Concept-Validated, Paper-Track
Concept-validated within their strategy class but doesn't meet Tier 1 v2.1 strict. Watchdog joined this tier 2026-04-28 after the formal recompute showed Sharpe 0.35 + DD-ratio 1.38 vs S&P (all 3 v2.1 paths fail strict).
Defensive Role Only
Walk-Forward FAIL but drawdown protection genuine (-30 to -50% MaxDD vs HODL's -70 to -85%). Useful as a portfolio sleeve, not a standalone alpha generator.
Disproven Hypothesis
Tested as a fix for a known problem and produced negative results. Documented for memory but not deployed.
Standards revised 2026-04-28 (v2.1)
Why we use S&P 500 as primary benchmark (not BTC HODL)
Until April 2026 we benchmarked every bot against BTC HODL. That was wrong for a real-money decision. The actual opportunity cost for 99% of investors isn't "just hold Bitcoin" — it's "put the money in an S&P 500 index fund." v2.1 uses three benchmarks in parallel.
S&P 500
Forward expectation: ~10% nominal/year
The actual opportunity cost. If you weren't running this bot, you'd almost certainly hold an index fund. Hedge funds compare to S&P, not crypto. This is the bar for "should this exist as a real-money allocation?"
BTC HODL
Forward expectation: ~30% CAGR (degrading)
The "in-crypto edge" benchmark. Only used when a bot explicitly claims to add value over passive crypto holding. BTC HODL 2014-2026 returned +21,000% — that's not repeatable, so we use 30% forward CAGR.
T-bills (~5%)
Risk-free rate
Absolute floor. Any strategy that doesn't clear this is paying you to take risk for nothing. No exceptions.
Honest update history
- v1 (pre-April 2026): single gate — "beat BTC HODL in ≥60% of Walk-Forward windows." Worked, but biased toward FAIL because BTC HODL 2014-2026 is the strongest secular trend in modern finance.
- v2 (2026-04-28 morning): added Path 2 — risk-adjusted-superior (Sharpe ≥1.0 + drawdown protection + cluster-test + W/L). Two parallel paths to each tier.
- v2.1 (2026-04-28 afternoon): switched primary benchmark from BTC HODL → S&P 500. Added Path 3 (diversification sleeve: market-neutral + low S&P correlation). Initially promoted Basis Sentinel + Alpha Hunter + Watchdog to Tier 1.
- v2.1 strict + all-paper-policy (2026-04-28 evening): formal Sharpe + DD-ratio recompute showed Watchdog at Sharpe 0.35 + DD-ratio 1.38 vs S&P — demoted to Tier 2 (all 3 v2.1 paths fail strict). Same evening: adopted all-paper policy. No bot graduates to real money until ≥6mo forward-validated proof. Watchdog real-money allocation paused since 2026-04-22.
We revise the standards openly — even when (especially when) the revision demotes our most-tested bot. The goal is not to make our own bots look good; the goal is to use the right comparison for each decision.
New section · all-paper policy adopted 2026-04-28
Real-Money Graduation Criteria
BBR is two weeks old. Backtest validation tells you a strategy could work; it doesn't tell you it does work in the actual conditions of the next six months. Effective 2026-04-28, no bot runs on real money until it accumulates forward-validated proof. Watchdog's historical real-money allocation is paused.
All four must hold for ≥6 months of live tracking:
- Tier 1 under v2.1 strict — bot must meet at least one of Path 1 (return-superior vs S&P 500), Path 2 (Sharpe ≥1.0 + DD-protection), or Path 3 (diversification sleeve: Sharpe ≥0.8 + correlation to S&P ≤0.3 + capital preservation in S&P-bear windows ≥70%).
- Forward beat-rate vs S&P ≥50% on rolling 12-month windows of live paper data — not backtest. The strategy must actually deliver in the regime it's being asked to deliver in.
- MaxDD ≤ backtest MaxDD × 1.2 — live drawdown stays within 20% of the backtest expectation. Larger live DD = strategy is being stressed in ways the backtest didn't capture.
- Sharpe within 0.5 of backtest estimate — live risk-adjusted return must be in the same neighborhood as the backtest claimed. Sharpe degradation is the most common live-vs-backtest gap.
First eligibility window
The earliest any bot can graduate is 2026-10-28 — six months after today. Currently eligible to apply: Basis Sentinel, Alpha Hunter (both Tier 1). Watchdog returns to Tier 1 first if its forward Sharpe stabilizes ≥1.0 or it shows explicit S&P-uncorrelated crisis-alpha.
Why this is honest
BBR is a 2-week-old project. "Live since 2026-04-12" is not a forward-validated track record — it's 16 days. Calling that "real-money-eligible" would be using a backtest claim as if it were forward proof. The all-paper window forces the framework to stand on its own: backtests choose candidates, forward results promote them.
⚠ The one bias the suite cannot fully escape
All five tests run on historical data. Even a strategy that passes every one is still a bet on regime continuity — the quiet assumption that the future will resemble the past. Walk-Forward catches regimes we've seen; it cannot catch regimes that haven't happened yet. That's why every bot card carries the "Past ≠ Future" chip. Why every backtest is a bet on the future →
What We Don't Do
Loosen criteria for favored bots
When a beloved strategy fails the suite, we change its tier — not the bar. The Surfer was demoted from "OOS validated" to "Tier 3 defensive" the day Walk-Forward exposed the original claim.
Cherry-pick time windows
Walk-Forward uses every 2-year window the data permits. Cherry-picking is structurally impossible.
Hide failed strategies
Every retired strategy is documented on /post-mortems with date, reason, and lesson. Hopper V2 lives there as a Tier 4 example.
Ignore fees and slippage
All backtests include realistic 0.10%-0.20% per-trade costs. Test 5 sweeps to 0.30%/side worst-case retail.
Optimize on the test data
Walk-Forward parameter selection happens on the train segment only. The test segment is used to score, never to tune.
Promise alpha
Backtests don't guarantee future returns. Tier labels are honest, not aspirational. Tier 2 means "concept validated, calibration questionable" — not "will make you rich."
See the methodology applied
Every bot on /bots renders its own validation panel — test results plus tier verdict. Every retired strategy on /post-mortems failed at least one test, explained.