I Made My Own Tests Stricter. Half My Tier-1 Bots Got Demoted.

Half of my Tier 1 bots just got demoted by my own framework.

Two weeks ago, BearBullRadar had four Tier 1 bots on the home page. Today there are three. The math on that change is brutal: one demoted, one demoted, one held its rank by a hair, and one came back from Tier 2 through a path that didn't even exist nine days ago.

This week.

That's the window. The framework I built to catch overstated bots caught two of mine. Then I changed the rules so a third could come back. Then I changed the decision logic from binary to score-based, because the old logic was retiring bots that had cleared 5 of 7 tests strongly.

The framework worked. I just didn't enjoy what it told me.

Here's what I changed in the framework, what it cost me, and why I'm publishing this instead of pretending it didn't happen.

Why am I telling you this?

Most trading sites quietly retire failures and quietly promote winners. Validation-framework changes happen in private, if they happen at all. I'm publishing this in the same article that demoted half my flagship bots, because that's the only way to prove the framework wasn't designed to make my bots look good.

If the rules can be changed silently, the rules don't mean anything.

So I'm changing them out loud.

What you'll get in this article

The three new tests added in the past two weeks (Cluster-Removal, Random-Baseline, plus the retroactive sweep)
The score-based decision rule that replaced "any one fail = retired"
The trend-following alternative anchor that brought Der Wachter back
The result: who's in, who's out, who came back

Buckle up.

Test 6 — Cluster-Removal added

Sentry surprised me first.

The numbers in front of me looked perfect. +35.4% return. Sharpe 1.70. Max drawdown of just -3%. The cleanest backtest in BearBullRadar history.

Then I removed four days from the test. A four-day stretch in October 2025 when crypto liquidations went vertical.

Same bot. +11.2%.

Almost 70% of what looked like smooth alpha was the bot catching one chaotic event nobody could predict.

That single result created Test 6 — the Cluster-Removal Stress Test. The setup is simple. We sort all profitable trade-days by size. We delete the top three "cluster days." We re-run the equity curve. If the headline shrinks more than 70%, the bot is concentrated regime-luck, not broad alpha.

Then I ran it retroactively on every BearBullRadar bot. The result was sobering.

Out of 10 testable bots, only two passed clean: Der Wachter (The Watchdog) and Basis Sentinel. Six came back CONCENTRATED. Their advertised alpha was, in plain English, three Tuesdays in disguise.

The Hopper's top three clusters: 89.9% of total profit. Alpha Hunter: 99.1%. Tri-Rotator: 94.6%. Rotator: 76.1%. Tactician: 58%. Pendle Yield-Curve, an alpha candidate I tested this week: came back concentrated too.

A backtest with great numbers can still be 70% built on three Tuesdays.

But cluster-concentration isn't the only way a backtest lies. There's a more sneaky failure mode.

Test 7 — Random-Baseline added

I almost shipped a bot that wasn't real.

The story: I tested four variants of an "event-aware" Sentry that adjusted around macro events. Fed days. CPI prints. FOMC. All four variants showed +15-20% Sharpe improvement vs the baseline.

I had the headline drafted: "Sentry now reads the macro calendar."

Then I ran one more check. I generated 100 fake-event sets. Same frequency. Random dates. Ran my "smart" event-aware bot on those fake calendars.

Same +15-20% lift.

The improvement wasn't from event-awareness. It was the volatility filter that came along for the ride. Real macro days happened to be high-vol days. So did random fake-event days. The volatility filter was the real source of edge, the event calendar was decoration.

That afternoon I closed the research line. No new bot.

That research project gave us Test 7 — the Random-Baseline Test. Take the bot's actual signal frequency. Generate 100 random-signal sets at the same frequency. Run them through the same backtest. Compare the bot's Sharpe to the 95th percentile of the random distribution. If the bot doesn't beat random noise at the same trade rate, it's not a signal.

We ran this retroactively too. Out of 10 testable bots: all 10 passed. The signals are real.

They just turned out to be more concentrated than we knew.

Real-but-concentrated is a different bot from real-and-broad. The published cards have to say which one.

Two new tests added. One bot retired. Three more candidates falsified this week. But the worst part wasn't the new tests. It was what happened when we kept the old binary decision rule.

Why the binary rule had to go

The old rule was simple: any single test failure capped a bot at Tier 4. Retired.

That worked when we had five tests. With seven tests, it became a meat grinder.

This week we tested three new candidates. Vol-Reversal V2. A BTC-ETH Momentum Rotator. A Pendle Yield-Curve strategy. Each of them passed three or more of the seven tests strongly. Each of them got retired because of one failure.

Vol-Reversal V2 had a knife-edge optimum at exactly window=40. Spike-ratio of 2.01. Move the parameter to 35 or 45 and the edge vanished. Hidden-Parameter test: FRAGILE. Retired.

The BTC-ETH Rotator failed Random-Baseline at p84 instead of the required p95. It wasn't worthless. It just didn't beat random noise hard enough. Retired.

The Pendle Yield-Curve strategy looked promising until I checked: random-rotation through the same Pendle Principal Tokens beat my "systematic" rule. The rule was decoration on top of underlying yield. Retired.

Three candidates. Three retirements.

Looking at the failure patterns, I noticed something. A bot that cleared 5 of 7 tests strongly was structurally different from a bot that cleared 0 of 7. Both used to score Tier 4. That couldn't be right.

So we replaced the binary rule with a score sum. Every test contributes points. Walk-Forward 0-3. Multi-X 0-2. Parameter-Robustness 0-2. Hidden-Parameter 0-2. Fees 0-1. Cluster-Removal 0-3. Random-Baseline 0-2. Per-Trade W/L bonus 0-1. Maximum: 16 points.

Score above 12 = Tier 1. 8-11 = Tier 2. 5-7 = Tier 3 defensive sleeve. Below 5 = Tier 4 retired.

But three hard floors stay. Walk-Forward FAIL. Cluster-Removal CONCENTRATED. Random-Baseline FAIL. Any of those caps the bot at Tier 2 maximum, no matter how high the score gets.

The safety guarantees aren't score-tradable.

The bots got fairer. The flagships got fewer.

The alternative anchor

Now Der Wachter's story.

Der Wachter has been BearBullRadar's most-tested bot. Per-trade win/loss ratio of 3.44 to 1, the highest in the entire suite. Beat-rate against the S&P 500: 58.3% over the rolling walk-forward windows.

Under the old strict rule, Der Wachter needed 60% to clear Path 1. It hit 58.3%. That's a fail by 1.7 percentage points.

Then I went to look at Renaissance Medallion. The most successful hedge fund in modern finance. Industry estimates put Medallion's long-run signal accuracy at around 55%.

I was measuring my own trend-following bot against a bar that's stricter than Renaissance Medallion. Sitting in Zurich with a paper-tracked Bitcoin bot. Demanding it beat the most profitable quant fund of the last forty years.

That's not rigor. That's silliness wearing a lab coat.

So I added an alternative anchor for trend-following bots. Not a loosening. A re-shaping of the bar:

Walk-Forward beat-rate ≥ 50% (industry-standard, not 60%)
W/L ratio ≥ 3.0 (tighter than the original 2.5)
Win-Rate ≥ 50% (sanity check)
Cluster-Removal PASS (mandatory, not score-tradable)
Random-Baseline PASS (mandatory, not score-tradable)

The trade is explicit. Lower beat-rate threshold in exchange for a tighter per-trade asymmetry threshold, plus mandatory passes on the two newest tests. Trend-following bots show edge through asymmetric per-trade payouts. Big wins, small losses. They don't show edge through consistent beat-rates, because trend signals are sparse by design.

Der Wachter cleared the alternative anchor cleanly. WF 58.3%. W/L 3.44. Cluster broad-edge (10% deflation, the strongest pass in the suite). Random-Baseline Sharpe 2.02 against a 95th-percentile bar of 1.51. Real signal.

It came back to Tier 1. At the same time the Hopper and Alpha Hunter bots that had cleared the old strict rule got demoted because their cluster-removal numbers came out CONCENTRATED.

The framework rewards what's actually working, not what looks good in a press release.

Still with me?

The new tier list

Where everything stands as of today.

Tier 1 (3): Basis Sentinel (cleared all tests cleanly, score 13/16). Carry Router (continuous-yield design, separate test category). Der Wachter (alternative anchor: WF 58.3%, W/L 3.44, broad-edge cluster, signal real). Der Wachter is the only one trading real money on Bybit. The other two are paper-tracked.

Tier 2 (5): The Hopper and Alpha Hunter — both demoted from Tier 1 because their headline edges came back cluster-concentrated (Hopper 89.9%, Alpha Hunter 99.1%). The Surfer — promoted from Tier 3 because the new score system gave partial credit for its parameter-plateau and random-baseline pass (score 8/16). Sentry and SOL-Carry — unchanged at Tier 2 paper-tracking.

Update 2026-05-05 evening: Pendle Yield Hunter was promoted to Tier 1 later the same day after a custom continuous-yield 7-test schema scored it 15/16 — the highest in BBR history. The edge turned out to be the audit-curation step itself, not the within-set selection rule. See Lesson #25 (Random-Baseline Dual-Mode for curation strategies) for the methodology refinement that surfaced the distinction. Carry Router was also confirmed Tier 1 the same evening (score 13/16). Tier 1 is now four bots: Basis Sentinel, Carry Router, Der Wachter, Pendle Yield Hunter.

Tier 3 (3): Rotator, Tri-Rotator, Tactician — all demoted from Tier 2 because cluster-concentration plus parameter-fragility means they're defensive sleeves, not standalone alpha generators.

Tier 4 (5 retired): Hopper V2 (8.0 score, hard floor on Test 6). Contrarian (5.0 score). Scout (4.0). Scout v2 (0.0). Sharpshooter (7.5 score, sticky-floor caught the original Test 3 fragility despite the higher score). Re-tested under the new score system. None earned an un-retire.

The framework retired more bots than it promoted. That's the right direction for a framework whose job is catching false confidence.

What this means for you

If you're a reader who came here looking for a trading bot to copy, the honest answer is: only one of mine is currently trading real money. Der Wachter. And that one cleared a tighter version of the alternative anchor that didn't exist a week ago. The other 11 are paper-tracked.

You could read that as discouraging. I read it as the opposite.

Every retail trading site I've seen lets bots stay flagship even when their numbers are 99% from three Tuesdays. None of them publish a Cluster-Removal test. None of them publish a Random-Baseline test. None of them retire 5 of their own bots in public on a Monday.

When you compare your shopping options on the bot marketplace, here's the test. Ask the platform what its bots' Cluster-Removal results are. If they don't have an answer, the headline numbers you're seeing might be 99% concentrated.

That's the test. And on most platforms, you won't get an answer.

A short recap

Two weeks ago: 4 Tier 1 bots.

Three new tests added. One score-based decision rule. One trend-following alternative anchor.

Today: 3 Tier 1 bots. 6 Tier 2. 3 Tier 3. 5 retired in the post-mortems ledger.

The framework demoted The Hopper and Alpha Hunter. It caught Sharpshooter's fragility a second time despite a high score. It brought Der Wachter back through a path designed for trend-following structures. It also caught me trying to ship an event-aware Sentry that turned out to be a volatility filter wearing a calendar costume.

That's the system working, not failing.

The full v2.2 score system and alternative-anchor description live on the methodology page. The current tier list with all 12 bots is at /bots. The five retired bots, with their reasons, are at /post-mortems. If you want the seven tests in detail, that's the seven-tests article. If you want the comparison to retail platforms and hedge funds, that's our-bots-vs-the-rest.

For Quants: the raw numbers

Click for Cluster-Removal percentages, alt-anchor thresholds, and score breakdowns

Tier 1 final (3):

Basis Sentinel: score 13/16. WF 4/4 windows. Cluster-Removal pass. Random-Baseline pass.
Carry Router: continuous-yield design (separate test category). Tier 1 via Path 3 (diversification sleeve).
Der Wachter: alternative anchor cleared. WF 58.3% (7/12 windows), W/L 3.44, Cluster-Removal 10% deflation (broadest in suite), Random-Baseline Sharpe 2.02 vs p95 1.51.

Tier 2 (5):

The Hopper: demoted from Tier 1. Cluster-Removal 89.9% concentration.
Alpha Hunter: demoted from Tier 1. Cluster-Removal 99.1% concentration.
The Surfer: promoted from Tier 3. Score 8/16.
Sentry, SOL-Carry: unchanged at Tier 2.
Pendle Yield Hunter: promoted Tier 2 → Tier 1 on the evening of the same day, score 15/16 (highest in BBR history) — see update note above.

Tier 3 (3):

Rotator: Cluster-Removal 76.1%.
Tri-Rotator: Cluster-Removal 94.6%.
Tactician: Cluster-Removal 58%, Test 5 (Hidden-Parameter) FRAGILE.

Tier 4 retired (5):

Hopper V2: 8.0 score, hard floor on Test 6.
Contrarian: 5.0 score (W/L bonus tempted but didn't save).
Scout: 4.0.
Scout v2: 0.0.
Sharpshooter: 7.5 score, original Test 3 sticky-floor.

Falsified candidates this week (3 retired before promotion):

Vol-Reversal V2: spike-ratio 2.01 at w=40 (Hidden-Parameter FRAGILE).
BTC-ETH Momentum Rotator: Random-Baseline FAIL at p84.
Pendle Yield-Curve: random-rotation beats systematic rule.

Score-based v2.2 decision rule (max 16 points):

Walk-Forward 0-3, Multi-X 0-2, Parameter-Robustness 0-2, Hidden-Parameter 0-2, Fees 0-1, Cluster-Removal 0-3, Random-Baseline 0-2, Per-Trade W/L bonus 0-1.
Score >12 = Tier 1, 8-11 = Tier 2, 5-7 = Tier 3, <5 = Tier 4.
Hard floors (cap at Tier 2 max regardless of score): Walk-Forward FAIL, Cluster-Removal CONCENTRATED (>70% top-3), Random-Baseline FAIL (<p95).

Trend-following alternative anchor (Der Wachter only path):

Walk-Forward beat-rate ≥ 50% (vs 60% standard)
W/L ratio ≥ 3.0 (vs 2.5 standard)
Win-Rate ≥ 50%
Cluster-Removal PASS (mandatory, not score-tradable)
Random-Baseline PASS (mandatory, not score-tradable)

Methodology change log (2026-04-28 to 2026-05-04):

Added Test 6 Cluster-Removal (Lesson #13).
Added Test 7 Random-Baseline (Lesson #21).
Replaced binary "any-fail = retire" with score-based 16-point rule.
Added trend-following alternative anchor.
Retroactively re-tested all 12 bots + 5 already-retired bots under new framework.

Sources

This is not financial advice. All numbers shown are from backtests or paper-tracking, not real-money deployment. Under our all-paper policy, no BBR bot runs on real money until it has at least six months of forward-validated proof. Der Wachter is the single exception, on a small private allocation predating the all-paper policy.

Hit reply if you want to argue with the new tier list. I'd rather argue than discover later that I let you down.

— Dominic, the guy who demoted his own flagship bots in public so the framework could mean something.

I Made My Own Tests Stricter. Half My Tier-1 Bots Got Demoted.

Why am I telling you this?

What you'll get in this article

Test 6 — Cluster-Removal added

Test 7 — Random-Baseline added

Why the binary rule had to go

The alternative anchor

The new tier list

What this means for you

A short recap

For Quants: the raw numbers

Sources

Dominic Tschan

TheBot-Letter

More Articles

One Good Trade Is Not An Edge. Here's the Math.

I Tested the Most Obvious Trading Pattern. Zero Walk-Forward Wins.

The Bot That Locks In 11% Yield — And Why I Almost Shipped It Wrong