How we test what we test

Seven tests every strategy must pass before it earns a tier.

Each test catches a different way a backtest can lie to you. Of 11 bots run through the original 5-test suite, exactly one cleared every test. Two more tests (Cluster-Removal, Random-Baseline) were added in 2026-04 and 2026-05 after they exposed mistakes in our own thinking. One bot was retired to /post-mortems. The rest landed in Tier 2-3. Tiers are honest. The tests are the same for everyone.

Tier 1 · Validation-strongTier 2 · Paper-trackTier 3 · Defensive onlyTier 4 · Disproven

2026-04-28: All bots paper-tracking. Real-money graduation now requires ≥6 months of forward-validated proof — no bot graduates from backtest-strength alone.

Why this exists

Until April 2026, BBR bots were validated with informal phrases like "+2,942% over 6 years", "13/13 parameter variants profitable", "OOS validated 60/40 split". All three turned out to be technically true but uninformative or actively misleading when subjected to harder tests.

  • +2,942% return = full-period total dominated by 1-2 winning regimes. Walk-Forward shows the bot loses to HODL in most rolling 2-year windows.
  • 13/13 robust = parameter sweep on an irrelevant axis. The actually-relevant axis was fragile.
  • OOS validated 60/40 split = single static train/test = sample-of-1.

This page exists so we never repeat that pattern. Every new bot must pass the same harsh test that exposed the limits of every existing bot. No exceptions for favourites.

The 7-Test Suite

Each test catches a different way a backtest can lie. Skipping any one means accepting a higher rate of deploying broken strategies. The first five tests are run on every bot. Tests 6 and 7 were added when we discovered the failure modes they catch — both came from re-examining bots we'd already passed.

Test 1 · The hardest, most important

Walk-Forward Validation

Catches: strategies that worked once because of a lucky regime, not a structural edge

Roll a 2-year window across the entire history with a 90-day step. In each window, train the strategy's primary parameters on the first 60% of data and apply the chosen config to the last 40%. Compare the test-segment return to the primary benchmark (S&P 500). For bots that claim in-crypto edge: also check vs BTC HODL.

PASS
≥60% of OOS windows beat S&P 500
WEAK
40-59%
FAIL
<40%
Result on our 11 bots: 7 of 7 momentum/regime bots tested in April 2026 FAILED here. Best was Alpha Hunter at 36%, all others 25-32%. Basis Sentinel passed 4 of 4 OOS windows (different strategy class — market-neutral carry).
Test 2 · Strongest positive signal

Multi-X Robustness

Catches: edges that depend on the one specific asset / filter combo the bot ships with

Run the same rule with 5+ alternative configurations. For multi-asset bots: swap pairings (BTC + each of GDX, GLD, SLV, TLT, XLE, SPY, TIP). For multi-filter bots: turn each filter on / off individually. Genuine edges generalize across nearby variants.

STRONG
≥80% of variants succeed
PARTIAL
50-79%
WEAK
<50%
The Watchdog surprise: tested individually, the on-chain DM+LD filters added zero incremental value. SMA(100)-only ties Full Watchdog at +208% vs +204%. Every multi-signal bot must run this test. The honest update.
Test 3 · The one almost nobody runs

Parameter Sensitivity Sweep

Catches: strategies sitting on a single magic number with mediocre neighbors

Sweep the bot's primary parameter (lookback period, threshold, percentile) across 7-13 values bracketing the deployed default. A real edge produces a plateau — most nearby values also work. A statistical fluke produces a single-peak optimum. We measure the spike ratio: (best − avg-of-neighbors) / |avg-of-neighbors|. >0.30 is a warning. >1.0 is a red flag.

Empirical: Hopper had spike ratio +2.15 at N=40 (red flag). Surfer had a broad plateau 1.5-3.0% (the only momentum/regime bot to PASS this test). Most fall in between.
Test 4 · Catches undocumented dependencies

Hidden Parameter Test

Catches: a deployment knob (rebalance freq, position sizing, grid spacing) that wasn't advertised but heavily affects results

Sweep the secondary knobs every strategy actually depends on. For rotation bots: rebalance frequency (daily / weekly / bi-weekly / monthly). For grid bots: grid range and bin count. For multi-position bots: top-N concurrent positions. Real edges generalize to lower frequencies. A strategy that only works at daily rebalance is exploiting high-frequency noise, not a structural edge.

Empirical: all four rotation bots (Hopper, Rotator, Tri-Rotator, Alpha Hunter) showed daily-rebalance dependency. Drop to weekly = lose 60-80% of return. Suspicious.
Test 5 · Necessary, but not sufficient

Fees & Slippage Sensitivity

Catches: edges that only exist in zero-cost backtests

Sweep round-trip cost from 0.05%/side (maker) to 0.30%/side (worst-case retail). The edge must survive realistic costs to be real. Most strategies pass this — fees rarely kill momentum/regime bots. The killer is path-dependence (Test 1).

Result on our 11 bots: all 11 pass. Fees are not the bottleneck. Path-dependence is.
Test 6 · Added 2026-04-28 (Lesson #13)

Cluster-Removal Stress Test

Catches: bots whose headline edge depends on 2-3 lucky events

Identify the top-3 highest-PnL trade clusters in the backtest (a cluster = ≥2 trades within 7 days). Remove them. Re-compute total return, CAGR, MaxDD without those events. If the headline numbers collapse by >50%, the bot isn't a smooth-yield strategy — it's a crisis-liquidity provider that fires rarely but big. Still useful — but the bot card has to say so.

PASS
Headline shrinks <50% without top-3 clusters
WARN
50-70% shrinkage — concentrated
CONCENTRATED
>70% — must be re-framed as crisis-alpha
The Sentry surprise (2026-04-28): Sentry showed +35.4% / Sharpe 1.70 / MaxDD only −3.11% — the cleanest drawdown profile in BBR history. Cluster-Removal showed 70% of the profit came from a single 4-day liquidation event in October 2025. Without those four days the strategy is +11.2%. We re-framed Sentry from "smooth funding-yield bot" to "crisis-liquidity provider on funding-rate triggers." Still real edge — just not what the old bot card said.
Three momentum bots tested 2026-05-04: Surfer (37% from Trump-rally Q4-2024), Tactician (39% from the same event — these two are NOT portfolio-diversifiers), Alpha Hunter (35% from AVGO+C spike March 2026). All three got the WARN tag plus a persona-update on their cards. None demoted — but the framing on their cards is now honest.
Caveat (Lesson #26, added 2026-05-06): if a strategy's cooldown between trades is at least as long as the cluster-window definition (currently 7 days), T6 passes mechanically — clusters can't form inside a window the strategy structurally locks. The PASS becomes uninformative. Caught during the Resistance-Breakout falsification (cooldown=7d, shrinkage=0% on every timeframe). Going forward: when cooldown ≥ cluster-window, T6 must be redefined or skipped — never silently PASSed.
Test 7 · Added 2026-05-04 (Lesson #21)

Random-Baseline Test

Catches: when a "signal" looks predictive but is actually statistical noise

Generate 100 random pseudo-signal sets with the same frequency as the real signal. Run each through the same backtest. Compare the real signal's Sharpe against the distribution of the 100 randoms. The real signal must beat the 95th percentile of the random distribution — otherwise it's noise that happened to align with returns.

PASS
Real signal Sharpe > p95 of randoms
FAIL
Real signal sits inside the random distribution
The Sentry-news-aware story (2026-05-04): we tested four variants of an "event-aware" Sentry that adjusts position sizing around macro events (Fed, CPI, FOMC). All four showed a +15-20% Sharpe improvement vs baseline. Looked promising. Random-Baseline showed: fake-event sets produced the same +15-20% lift. The improvement wasn't from event-awareness — it was from the volatility filter that came along for the ride. Funding rate already captures the event signal implicitly. Research line closed; no new bot.
When this is mandatory: any study claiming "event X causes outcome Y" must clear this bar before becoming a bot. Without it, we can't tell signal from coincidence.
Bonus · The deepest signal

Per-Trade Win/Loss Ratio

Reveals whether the edge is from frequent small wins or rare big wins

Extract per-trade P&L from the strategy's trade log. Compute the ratio of average win size to average loss size. W/L ≥ 2.5:1 with ≥20 trades is genuine trend-following structure — even at 50% win rate, the structural expected value per trade is positive.

Why this matters: the Watchdog FAILED Walk-Forward (25%) but PASSED here with W/L = 3.44:1 across 24 trades. That structural asymmetry is the actual reason the bot survives paper-tracking even after the v2.1 demotion to Tier 2. Without this test, we'd have written off the only BTC bot that genuinely has trend-following edge.
Meta-Diagnostic · Added 2026-05-06 (Lesson #27)

Locked Beat-Rate → Stop Digging

Knowing when further parameter-tuning is wasted effort

When walk-forward beat-rates are locked (identical to the percentage point) or strictly worse across multiple parameter variants of either component — entry-side filters OR exit-side rules — the signal class itself is the structural limit. No further tuning of that strategy is going to rescue it. The right move is to stop digging on this signal class and reallocate research time to a different signal class entirely.

The Resistance-Breakout case (2026-05-06): we ran four exit variants (SMA-20, SMA-50, trailing-30d-10%, trailing-15%-only) on the same entries — WF beat-rate vs HODL was identical 32.5% across all four. Then four entry-filter variants (volume-2×, vol-compression, trend-resumption, composite) on top of the strongest exit — every filter worsened the beat-rate to 25-30%. Two orthogonal rescue attempts, both confirming the signal itself is the limit. The strategy stays Tier 3 by structural fact, not parameter choice.
When to invoke: after any pair of orthogonal-component sweeps (one on entry, one on exit, both with at least 3 variants) where T1 doesn't cross the PASS threshold and the delta between best and worst variant is <10 percentage points. Stop further tuning. Document the structural limit. Move on.
Combined effect: of 11 bots run through Tests 1-5, only 1 (Basis Sentinel) cleared everything. Test 6 (Cluster-Removal) was added after Sentry's "cleanest drawdown in BBR history" turned out to be 70% from a single October-2025 event — Sentry stayed Tier 2 but its bot card got re-framed. Test 7 (Random-Baseline) was added after a +15-20% Sharpe lift from "event-awareness" turned out to match what 100 random fake-event sets produced — research line closed, no new bot. Both tests were lessons we paid for in real work, then institutionalised so we don't pay twice. ALL bots are now paper-tracking under the all-paper policy adopted 2026-04-28; no bot graduates to real money until ≥6mo forward-validated proof.

Decision rule revised 2026-05-05 (v2.2)

From binary to score-based

Until 2026-05-05 a single test failure could disqualify a bot from any tier above 4. That rule was too brittle: bots cleared 5 of 7 tests strongly and still got retired because of one knife-edge or one cluster-dependent year. We replaced the binary rule with a score sum while keeping every test threshold intact.

How a bot earns its score (max 16)

  • T1 Walk-Forward: PASS=3 / WEAK=1 / FAIL=0
  • T2 Multi-X: STRONG=2 / PARTIAL=1 / WEAK=0
  • T3 Param-Sensitivity: ROBUST=2 / PARTIAL=1 / FRAGILE=0
  • T4 Hidden-Parameter: ROBUST=2 / PARTIAL=1 / FRAGILE=0
  • T5 Fees & Slippage: ROBUST=2 / OK=1 / FRAGILE=0
  • T6 Cluster-Removal: PASS=2 / WARN=1 / CONCENTRATED=0
  • T7 Random-Baseline: PASS=2 / FAIL=0
  • Bonus W/L: ≥3.0 = +1 / 2.0–2.99 = +0.5

Score → Tier mapping

  • ≥ 12 Tier 1 — real-money-eligible
  • 8 – 11 Tier 2 — paper-track
  • 5 – 7 Tier 3 — defensive role only
  • < 5 Tier 4 — disproven
Hard floors (overrule the score, cap at Tier 2):
  • T1 FAIL vs primary benchmark
  • T6 CONCENTRATED (>70% top-3 shrinkage)
  • T7 FAIL (real signal indistinguishable from random)
Alternative anchor for trend-following bots (v2.2): Path 1 (return-superior) added a second route. A bot can clear it via the original WF ≥ 60% + Multi-X STRONG + W/L ≥ 2.5 — OR via the new WF ≥ 50% + W/L ≥ 3.0 + Win-Rate ≥ 50% + T6 PASS + T7 PASS. The alternative recognises that single-asset trend-following bots (like Der Wachter) can show structural edge through asymmetric per-trade payouts even when they don't generalize to other assets. Renaissance Medallion runs at ~55% accuracy; demanding 60% from every BBR bot was stricter than industry. The alternative trades raw beat-rate for a tighter W/L anchor (3.0 vs 2.5) plus mandatory passes on the two newest tests.
Retroactive recalibration (2026-05-05): after running Tests 6 + 7 retroactively on every live bot, the score system reshuffled the tiers. Hopper and Alpha Hunter dropped from Tier 1 to Tier 2 because the new Cluster-Removal data revealed that >70% of their headline edges concentrate in 3 trade clusters (T6 hard floor). Surfer rose from Tier 3 to Tier 2 (T3 plateau plus T7 PASS). Three rotation bots (Rotator, Tri-Rotator, Tactician) dropped to Tier 3 as defensive sleeves. The system is now stricter at the top (fewer Tier 1 bots) and fairer below (partial credit for partial wins).

The 4-Tier System

After the 7+1 tests, every bot lands in one of four tiers. Tier dictates: real money, paper-track, defensive sleeve, or shelved.

TIER 1

Validation-Strong, Paper-Tracking

Cleared the v2.1 multi-benchmark gates on at least one path: Path 1 (return-superior vs S&P), Path 2 (Sharpe ≥1.0 + DD-protection), or Path 3 (diversification sleeve: Sharpe ≥0.8 + correlation to S&P ≤0.3 + capital preservation in S&P-bear windows). Currently paper-tracking under the all-paper policy adopted 2026-04-28.

Currently here (2): Basis Sentinel (Path 3, Sharpe 2.0, market-neutral) and Alpha Hunter (Path 1, +93pp documented excess vs SPY over 6.6y). Real-money graduation requires ≥6mo forward-validated proof — see the Real-Money Graduation Criteria section below.
TIER 2

Concept-Validated, Paper-Track

Concept-validated within their strategy class but doesn't meet Tier 1 v2.1 strict. Watchdog joined this tier 2026-04-28 after the formal recompute showed Sharpe 0.35 + DD-ratio 1.38 vs S&P (all 3 v2.1 paths fail strict).

Currently here (6): Watchdog (demoted 2026-04-28), Hopper V1, Rotator, Tri-Rotator, Tactician (binary v1), Sentry (crisis-liquidity provider, cluster-dependent edge).
TIER 3

Defensive Role Only

Walk-Forward FAIL but drawdown protection genuine (-30 to -50% MaxDD vs HODL's -70 to -85%). Useful as a portfolio sleeve, not a standalone alpha generator.

Currently here: Surfer (re-positioned from "OOS validated" on 2026-04-26 after Walk-Forward exposed asymmetric Multi-X behavior — wins in bears, loses in bulls).
TIER 4

Disproven Hypothesis

Tested as a fix for a known problem and produced negative results. Documented for memory but not deployed.

Currently here: Hopper V2 (averaged-lookback ensemble — fixed the parameter overfit but Walk-Forward got 27pp worse, validating the lesson "ensembles fix overfit, not path-dependence").
Untested: any new bot idea remains untested until it has passed through this exact suite. Bots from earlier eras (Tactician 2.0, Contrarian, Genius, Sentry) carry an honest "not yet validated" marker on their cards until they are. We don't back-fill claims.

Standards revised 2026-04-28 (v2.1)

Why we use S&P 500 as primary benchmark (not BTC HODL)

Until April 2026 we benchmarked every bot against BTC HODL. That was wrong for a real-money decision. The actual opportunity cost for 99% of investors isn't "just hold Bitcoin" — it's "put the money in an S&P 500 index fund." v2.1 uses three benchmarks in parallel.

Primary

S&P 500

Forward expectation: ~10% nominal/year

The actual opportunity cost. If you weren't running this bot, you'd almost certainly hold an index fund. Hedge funds compare to S&P, not crypto. This is the bar for "should this exist as a real-money allocation?"

Secondary

BTC HODL

Forward expectation: ~30% CAGR (degrading)

The "in-crypto edge" benchmark. Only used when a bot explicitly claims to add value over passive crypto holding. BTC HODL 2014-2026 returned +21,000% — that's not repeatable, so we use 30% forward CAGR.

Floor

T-bills (~5%)

Risk-free rate

Absolute floor. Any strategy that doesn't clear this is paying you to take risk for nothing. No exceptions.

Honest update history

  • v1 (pre-April 2026): single gate — "beat BTC HODL in ≥60% of Walk-Forward windows." Worked, but biased toward FAIL because BTC HODL 2014-2026 is the strongest secular trend in modern finance.
  • v2 (2026-04-28 morning): added Path 2 — risk-adjusted-superior (Sharpe ≥1.0 + drawdown protection + cluster-test + W/L). Two parallel paths to each tier.
  • v2.1 (2026-04-28 afternoon): switched primary benchmark from BTC HODL → S&P 500. Added Path 3 (diversification sleeve: market-neutral + low S&P correlation). Initially promoted Basis Sentinel + Alpha Hunter + Watchdog to Tier 1.
  • v2.1 strict + all-paper-policy (2026-04-28 evening): formal Sharpe + DD-ratio recompute showed Watchdog at Sharpe 0.35 + DD-ratio 1.38 vs S&P — demoted to Tier 2 (all 3 v2.1 paths fail strict). Same evening: adopted all-paper policy. No bot graduates to real money until ≥6mo forward-validated proof. Watchdog real-money allocation paused since 2026-04-22.

We revise the standards openly — even when (especially when) the revision demotes our most-tested bot. The goal is not to make our own bots look good; the goal is to use the right comparison for each decision.

New section · all-paper policy adopted 2026-04-28

Real-Money Graduation Criteria

BBR is two weeks old. Backtest validation tells you a strategy could work; it doesn't tell you it does work in the actual conditions of the next six months. Effective 2026-04-28, no bot runs on real money until it accumulates forward-validated proof. Watchdog's historical real-money allocation is paused.

All four must hold for ≥6 months of live tracking:

  1. Tier 1 under v2.1 strict — bot must meet at least one of Path 1 (return-superior vs S&P 500), Path 2 (Sharpe ≥1.0 + DD-protection), or Path 3 (diversification sleeve: Sharpe ≥0.8 + correlation to S&P ≤0.3 + capital preservation in S&P-bear windows ≥70%).
  2. Forward beat-rate vs S&P ≥50% on rolling 12-month windows of live paper data — not backtest. The strategy must actually deliver in the regime it's being asked to deliver in.
  3. MaxDD ≤ backtest MaxDD × 1.2 — live drawdown stays within 20% of the backtest expectation. Larger live DD = strategy is being stressed in ways the backtest didn't capture.
  4. Sharpe within 0.5 of backtest estimate — live risk-adjusted return must be in the same neighborhood as the backtest claimed. Sharpe degradation is the most common live-vs-backtest gap.

First eligibility window

The earliest any bot can graduate is 2026-10-28 — six months after today. Currently eligible to apply: Basis Sentinel, Alpha Hunter (both Tier 1). Watchdog returns to Tier 1 first if its forward Sharpe stabilizes ≥1.0 or it shows explicit S&P-uncorrelated crisis-alpha.

Why this is honest

BBR is a 2-week-old project. "Live since 2026-04-12" is not a forward-validated track record — it's 16 days. Calling that "real-money-eligible" would be using a backtest claim as if it were forward proof. The all-paper window forces the framework to stand on its own: backtests choose candidates, forward results promote them.

A note on the previous Watchdog real-money allocation: The Watchdog (DM+LD) ran on a $6,000 BTC allocation from 2026-04-12 to 2026-04-22, when it was paused due to a chart-structure concern at the BUY trigger. Six days later, the v2.1 multi-benchmark recompute formally demoted it to Tier 2 + we adopted the all-paper policy. That history is documented at /post-mortems.

⚠ The one bias the suite cannot fully escape

All five tests run on historical data. Even a strategy that passes every one is still a bet on regime continuity — the quiet assumption that the future will resemble the past. Walk-Forward catches regimes we've seen; it cannot catch regimes that haven't happened yet. That's why every bot card carries the "Past ≠ Future" chip. Why every backtest is a bet on the future →

What We Don't Do

Loosen criteria for favored bots

When a beloved strategy fails the suite, we change its tier — not the bar. The Surfer was demoted from "OOS validated" to "Tier 3 defensive" the day Walk-Forward exposed the original claim.

Cherry-pick time windows

Walk-Forward uses every 2-year window the data permits. Cherry-picking is structurally impossible.

Hide failed strategies

Every retired strategy is documented on /post-mortems with date, reason, and lesson. Hopper V2 lives there as a Tier 4 example.

Ignore fees and slippage

All backtests include realistic 0.10%-0.20% per-trade costs. Test 5 sweeps to 0.30%/side worst-case retail.

Optimize on the test data

Walk-Forward parameter selection happens on the train segment only. The test segment is used to score, never to tune.

Promise alpha

Backtests don't guarantee future returns. Tier labels are honest, not aspirational. Tier 2 means "concept validated, calibration questionable" — not "will make you rich."

See the methodology applied

Every bot on /bots renders its own validation panel — test results plus tier verdict. Every retired strategy on /post-mortems failed at least one test, explained.