Seven Tests Every Bot Must Pass Before It Earns A Tier

It's April 28, 2026. Late afternoon. My desk in Zurich.

I'm staring at a backtest result for a bot called Sentry. The numbers in front of me are the cleanest I have ever seen on a crypto strategy.

Sharpe ratio of 1.70. Max drawdown of just -3%. Total return over the test window: +35.4%.

I almost wrote a celebration post. I had the headline already drafted: "The crypto bot with the smoothest equity curve in BBR history."

Then I ran one more test. The Cluster-Removal Test.

It removed exactly four days of data from the entire run, a four-day stretch in October 2025 when crypto liquidations went vertical. Same bot. Same parameters. Same trading logic. Just minus those four days.

The new total return: +11.2%.

Same bot. Different story. Almost 70% of what looked like "smooth alpha" was the bot catching one chaotic event nobody could predict.

That night I rewrote the bot card. Sentry is no longer the "smooth-yield bot." It's a "crisis-liquidity-provider," a strategy that earns most of its profit during a few rare days of market panic, and quietly does very little the rest of the year.

Same numbers. Honest framing.

A backtest will never tell you it's lying. It just answers the question you asked, with the data you gave it.

Why this framework exists

BearBullRadar launched in April 2026. Three of the first phrases I published on bot cards turned out to be technically true and completely useless:

"+2,942% over 6 years." True. The number is correct. It's also mostly Bitcoin going from $3,500 to $80,000. The bot's contribution was a small fraction.
"13 of 13 parameter variants profitable." True. They were all profitable. They were also all on the same single Bitcoin time series. One asset, one regime, dressed up as 13 independent confirmations.
"OOS validated 60/40 split." True. I held back 40% of the data and the bot still made money on it. Sounds rigorous. It isn't. A single in-sample / out-of-sample split is one experiment. It can pass by luck.

Three sentences I once published. Three sentences that were technically accurate and quietly misleading.

So I built a test suite that catches those lies. Today there are seven tests, plus one bonus check, plus a rule about benchmarks.

Here is what each test asks, why it exists, and what it has cost me in admissions.

Buckle up.

Test 1 — Walk-Forward

The question: Did the bot win in many windows, or one lucky window?

A walk-forward test cuts the historical data into many overlapping segments. The bot trades each segment as if it were live. Then I count: in how many of those segments did the bot beat just holding the asset?

The bar I set: 60% or higher.

In April 2026, I ran walk-forward on seven momentum-style bots that I had been running for months. I expected most to clear 60%.

Result: none of them did.

The best was Alpha Hunter, an equity-momentum bot, at 36%. Watchdog, our oldest BTC trend-follower, came in at 25%. Most of the rest landed between 28% and 32%.

That was the day I realized: most "OOS validated" claims out there are one lucky split, not a robust pattern.

A real edge wins in many windows. A lucky parameter wins in one.

Test 2 — Multi-X Robustness

The question: Does the edge work on similar assets, or only this one?

If a strategy survives only on Bitcoin, but breaks on Ethereum and Solana, that's a clue. Real edges generalize. Single-asset edges are usually narrative dressing on top of one lucky time series.

Watchdog was sold for a long time as "three independent filters working together." A 100-day moving average. A directional movement filter. A liquidity filter. Three checks. Triple confidence.

Then I ran Test 2. I disabled two of the three filters and kept only the moving average.

Full Watchdog made +204%. SMA-only made +208%.

The other two filters were cosmetic. The whole "three independent filters" story collapsed in one afternoon.

I rewrote the bot card the same week. New honest framing: "100-day moving average with 4% hysteresis. The other indicators are tracked for transparency, but they don't change decisions."

Still with me?

Test 3 — Parameter Sweep

The question: Is the value we deployed sitting on a plateau, or balancing on a knife-edge?

For every parameter the bot uses (lookback length, threshold, position size), I sweep through nearby values and check whether returns stay similar or collapse.

A real edge produces a plateau. Returns are similar across many neighboring values. Confidence: high.

A lucky edge produces a spike. The deployed value works. Anything 10% higher or lower loses badly. Confidence: zero.

Hopper, our cross-sector rotation bot, had a parameter called N (the number of look-back days). At N=40 it had a "spike ratio" of +2.15. That means the deployed value was dramatically better than its neighbors. N=35 and N=45 were both significantly worse.

That's a knife-edge. We re-tested. We re-tuned. We documented it on the bot card.

Test 4 — Hidden Parameter

The question: Is the strategy quietly relying on a knob nobody told the reader about?

The most common hidden parameter in my own bots: rebalance frequency.

Every rotation bot I built worked beautifully on daily rebalance. Switch the frequency to weekly (same logic, same picks, just rebalanced less often) and 60-80% of the return disappeared.

That's not edge. That's high-frequency noise being harvested through transaction cost margins. Real edge survives different rebalance cadences. Noise-harvesting doesn't.

If a bot only works on one rebalance schedule, that schedule is a parameter. It belongs on the bot card with the others.

Test 5 — Fees and Slippage

The question: Does the edge survive realistic costs?

Honest answer for our suite: yes, almost always.

Of the eleven bots I ran through tests 1 to 5 in April 2026, all eleven survived realistic fees and slippage. Path-dependence (Test 1) and cluster-dependence (Test 6) killed bots first. Fees were rarely the executioner.

That's not surprising. Most BBR bots hold positions for days or weeks, not minutes. Fee drag is real but secondary at our trading frequency.

Still, every bot has to clear it.

Test 6 — Cluster-Removal

The question: Does the headline edge depend on a handful of lucky events?

This is the test that broke Sentry. It also broke the story I was telling about two other bots.

The setup: I sort all profitable trade-days by size. The biggest ones go on top. Then I ask: what happens if I delete the top three "cluster days" and re-run the equity curve?

Sentry: top single 4-day cluster contributed about 70% of the headline edge. Without it, the bot returns +11.2% instead of +35.4%. Same bot. Different bot, really.

But wait. It gets more interesting.

I ran the same test on Surfer (a momentum-rotation bot) and Tactician (a regime-switcher). Surfer's top cluster: Q4-2024, the post-election Trump rally. That single cluster: 37% of total profit. Tactician's top cluster: same Q4-2024, same Trump rally. 39% of total profit.

Same event. Two different bots. Both depending on the same four-week regime trigger.

Which means: holding Surfer and Tactician together is not actually diversification. You think you have two bots. You have one bet on one rally, dressed up in two different bot-card icons.

I added a "cluster-warn" badge to both cards yesterday.

You can have two real edges that vanish on the same Tuesday.

Test 7 — Random-Baseline

The question: Does the signal beat noise of the same frequency?

This is the newest test. I added it after a research project that nearly fooled me.

I had this idea: make Sentry "event-aware." Have it adjust position size around major macro events. Fed days. CPI prints. ETF approvals. Sounds smart, right? News-aware bot adjusts to news.

I built four variants. Pre-event boost. Post-event V-bottom. Event blackout. Combined. I ran them all on real historical event dates.

Each variant showed +15% to +20% improvement in risk-adjusted return.

I almost shipped it. Then I ran the random-baseline check.

I generated 100 fake event calendars at the same frequency as real events. Random dates. Same number of "events" per month. Then I ran my bot variants on those fake calendars.

The result: the same +15-20% lift.

The improvement wasn't from event-awareness. It was the volatility filter that came along for the ride. Real macro days happened to be high-vol days, but so did random fake-event days, and the volatility filter was the real source of edge.

I closed the research line that afternoon. No new bot.

The bar Test 7 sets: a real signal must beat the 95th percentile of random noise at the same frequency. That's a much higher bar than "the backtest improved."

The bonus — Per-Trade Win/Loss

There's one extra check that has saved a bot from retirement.

Per-trade Win/Loss ratio: divide the average profit on winning trades by the average loss on losing trades.

Watchdog failed walk-forward at 25%. By that test alone, it should have been demoted out of any tier. But its per-trade W/L ratio across 24 historical trades was 3.44 to 1.

That structural property means even at a 50% win rate, the expected value per trade is positive. It's a kind of edge that path-dependence tests don't always catch.

Watchdog stayed in our suite. Demoted, but kept. Without that check, I would have buried the only BTC bot in BBR history with genuine trend-following structure.

Why I keep demoting my own bots

Here's the part that's hardest to write.

In April 2026 I demoted Watchdog from real-money status to paper-track only. I rewrote Sentry's persona from "smooth-yield" to "crisis-liquidity-provider." Yesterday I added cluster-warn badges to Surfer and Tactician.

Update 2026-05-05: Watchdog returned to Tier 1 the day after this article was published, via a new trend-following alternative anchor (50 percent beat-rate threshold instead of 60, plus tighter W/L 3.0 anchor). It now trades real capital on Bybit again. The framework that demoted it also brought it back — that's the point. See Half My Tier-1 Bots Got Demoted for the full recalibration story.

None of these bots is bad. They just aren't what the original framing claimed.

The harder truth: I'm the one who wrote the original framing. I'm also the one who tests it later. Same person. Both honest. Both biased in opposite directions.

My own incentive is to make the bots look good. The tests are the only thing protecting you from that incentive.

That's why every bot on /bots has a tier label. That's why every demotion gets a post-mortem. That's why the methodology page lists every test in detail with parameters.

Not because I love admitting failures. Because the alternative is a wall of "OOS validated" claims with no public demotions. That's the trader-marketing trick I built BearBullRadar to push back against.

What the tests cannot do

Every test in this suite runs on historical data.

Walk-forward catches regimes we have already seen. Cluster-removal catches dependence on past rare events. Random-baseline catches noise-mimicking the past distribution.

None of them catches a regime that hasn't happened yet.

That's the "Past != Future" chip on every bot card. It isn't marketing copy. It's the one thing the framework cannot fix. A strategy that passes all seven tests is still a bet that the future will look reasonably like the past. If 2027 brings a market structure we haven't seen since 2020, every backtest in this suite becomes a curiosity.

That's why no BBR bot graduates to real money until it has six months of forward-tracked paper performance on top of the seven tests.

The tests buy honesty about the past. Forward-tracking is the only thing that buys evidence about the future.

A short recap, in plain language

Seven questions. One bonus. That's it.

Did the bot win in many windows or one lucky one?
Does the edge work on similar assets, or only this one?
Is the deployed parameter on a plateau or a spike?
Is there a hidden knob nobody mentioned?
Does the edge survive real fees and slippage?
Does the headline edge depend on a few rare days?
Does the signal beat random noise of the same frequency?
(Bonus) Is the per-trade win/loss ratio structurally positive?

If a bot passes all of those, it earns a tier and goes to paper-tracking. After six months of clean forward performance, it can graduate to real money.

Most bots do not earn that. Most bots get demoted at least once. That's the system working, not failing.

Sources

This is not financial advice. All numbers shown are from backtests or paper-tracking, not real-money deployment. Under our all-paper policy, no BBR bot runs on real money until it has at least six months of forward-validated proof.

— Dominic, the guy who built 12 bots and demoted half of them on purpose.