"I ran 135 AI analyses on historical stocks. It didn't help."
That's the honest headline of today's article. $4.64 spent on Claude API calls. 135 company evaluations across 8 years. Result: −1.4 percentage points vs plain quantitative screening.
Why? Because Claude already knew how the story ended.
This is look-ahead bias. It's the most subtle of the ten.
The One-Sentence Version
Look-ahead bias is when your backtest uses information that wasn't available at the time of the trade. Even if you didn't mean to.
The Obvious Form
You're testing a strategy. On March 1, 2022, your rule says "buy if earnings beat expectations." But the earnings report wasn't released until March 15.
If your backtest uses March 15 data for a March 1 trade, that's look-ahead. You bought with information you couldn't have had.
Most serious backtest frameworks prevent this. Bybit's API gives you proper timestamps. SimFin includes publish dates.
But there's a sneakier version. And it destroyed my Claude experiment.
The Claude Version
I wanted to add Sandra's qualitative scoring to her backtest. Moat strength. Management quality. Those things she judges by gut.
Claude should be able to help. Feed it financials plus business description, get back a structured moat score. Do this for every stock in every year 2019-2025. Use those scores to re-rank the selection.
On paper, brilliant. In practice, silent poison.
Claude Sonnet 4.5 was trained on data up to early 2025. That means Claude knows:
- NVIDIA became the AI chip king
- Netflix survived the 2022 subscriber panic
- Meta's metaverse was a disaster but their ad business roared back
- Tesla's competitive moat crumbled against BYD
- Silicon Valley Bank collapsed
Even when I explicitly told Claude "evaluate this company as of January 2019, using only 2019-era knowledge," it couldn't un-know.
It's like asking a friend who already saw the sports results to bet on the game. They can try to pretend, but the outcome tugs at every judgment.
The Experiment
For each year 2019 through 2026, I ran the same pipeline:
- Quantitative screen picks top 20 by fundamentals
- Claude analyzes all 20 (moat score 0-100, mgmt score 0-100)
- Re-rank using: quant × 0.5 + moat × 0.3 + mgmt × 0.2
- Take top 10
- Apply Sandra's execution
Over 8 years, that's 135 unique company-year analyses. At 3 cents each via Sonnet 4.5, total cost $4.64.
Results:
| Approach | Aggregate Return | Mean per Ticker |
|---|---|---|
| Quant-only rolling | +28.5% | +27.7% |
| +Claude qualitative | +27.1% | +29.5% |
Claude's qualitative layer made the portfolio slightly worse. Mean per stock went up a tiny bit. Aggregate went down. Net effect: zero alpha, small negative due to rotation costs.
I was genuinely surprised.
Why Adding Intelligence Made It Worse
Three possibilities. Probably all three.
1. Look-ahead bias cancels out the value. If Claude's moat scores are partly contaminated by "I know NVIDIA won," then its high moat score for NVIDIA in 2019 is obvious in hindsight. Everyone who bought NVIDIA in 2019 won. Claude's hindsight picks the same winner as plain quant. No alpha.
2. Qualitative judgment is harder than we think. Maybe moat and management really don't add much on top of the quantitative signals. The fundamentals already capture what matters. The narrative is post-hoc.
3. Annual rotation kills compounding. Even if Claude picked slightly better stocks, forcing a sell on the first of each year cuts off the long-run winners. The top 10 list reshuffles. Compounding breaks.
My guess: it's mostly #3 plus #2. Look-ahead probably didn't even help Claude — it just prevented Claude from being obviously worse.
The Broader Lesson
Any backtest that uses modern data to make historical decisions is suspect. This includes:
-
AI models trained past your test date. GPT-4 knows about 2023. Claude Sonnet 4.5 knows about 2024-2025. Any use of these for "historical" research has a look-ahead stain.
-
Academic factor data. The Fama-French factor library publishes slightly revised historical factors each year. If you use the latest version for 2015 data, that version might include minor corrections from 2020 research. Tiny, but real.
-
Dividend-adjusted prices. If a stock split in 2020 and you're pulling "adjusted close" from Yahoo today, those 2015 prices are adjusted using information from 2020 on. For backtest purposes, small. For dividend-heavy strategies, matters.
-
Sentiment tools. Many sentiment scores are recalibrated periodically. The "fear and greed index" you pull today might not match the value someone saw in real-time on the test date.
The Cure
For AI specifically: Don't use modern LLMs for qualitative historical scoring. Either use quantitative signals (which are time-stamped) or accept that qualitative scoring is forward-only.
For data in general: Use point-in-time databases where every value has a "as reported on" timestamp. SimFin does this. CRSP does this. Free data usually doesn't.
For your own work: If you catch yourself "remembering" how a stock did to make a decision about what would have happened — you have look-ahead. You can't fix your memory. The only fix is discipline.
The Rule I Now Follow
For my Andromeda bot and for the Momentum Live bot on bearbullradar.com, rule one: nothing that generates signals has seen post-test data.
- Prices: adjusted close is fine, slight look-ahead but material only for dividend strategies
- Fundamentals: strict publish-date cutoff
- No LLMs in the signal loop
- No sentiment indices built after the test period
- No "quality screens" using criteria developed after the test
The result: a honest number. Probably a smaller number. But honest.
What This Cost Me
Remember Sandra's +28.5 percent rolling test? That was my honest baseline. I expected Claude's layer to bump it to +40 or +50 percent. When it instead dropped to +27.1 percent, I assumed I had a bug.
I didn't. I had look-ahead bias — in reverse. Claude's hindsight didn't add value because Sandra's quantitative screen was already capturing it.
Alpha from qualitative judgment that a computer can produce: zero, to the limit of my ability to measure.
Alpha from qualitative judgment that a human with decades of experience can produce: unknown, but not zero.
Sandra still has her edge. Just probably not an automatable one.
-> Previous: Survivorship Bias — how Sandra's 14 picks hid 12 losers -> Next: Hindsight Bias — "it was obvious, wasn't it?" -> Back to pillar
Sources
- Anthropic — Claude training data cutoff — model card
- Cochrane on look-ahead bias in finance — academic reference
- SimFin Point-in-Time Data — the dataset I used
Your Dominic, who spent 4.64 dollars to learn that AI can't help you trade the past.
Disclaimer: Not financial advice. Past performance does not guarantee future results.




