ChatGPT was not trained to predict stock returns. It was trained to predict the next token in a text sequence. This is not a minor technical distinction — it is the central fact that determines where the tool is useful and where it is dangerous. Traders who understand this distinction use LLMs as force multipliers. Traders who don't spend six months optimizing strategies that fail the moment they go live.
How Large Language Models Actually Work
GPT-4, Claude, Llama — all of these are autoregressive language models. They are trained on a massive corpus of text (books, websites, code, papers) with a single objective: predict the next token given all previous tokens. That's it. The entire architecture — billions of parameters, transformer attention, positional embeddings — is optimized to minimize next-token prediction error on text.
This objective produces extraordinarily capable systems. They can reason through problems, write code in any language, summarize documents, explain complex concepts, and generate coherent long-form text. But it has a fundamental implication for trading:
An LLM trained to predict the next token has never been explicitly trained to predict the next stock return. When it appears to do so — when it says "NVDA is likely to outperform this quarter" — it is pattern-matching against similar text in its training data, not running a financial model. The confidence in its language does not correspond to predictive accuracy.
When the model says "NVDA is likely to outperform this quarter," it is not running a financial model. It is generating text that is statistically consistent with how confident financial commentary sounds in its training data. The number it will give you if you ask for a Sharpe ratio estimate was not computed — it was predicted as plausible text. This distinction is not pedantic. It has a direct financial consequence.
What LLMs Can Actually Do for Traders
Within the correct mental model, LLMs are remarkably useful in a quantitative trading workflow. The key is treating them as coding and reasoning assistants, not as alpha generators.
Code generation for backtesting infrastructure
Writing correct backtesting code is tedious and error-prone. Vectorized pandas operations, handling timezone-aware timestamps, implementing rolling windows without look-ahead — these are exactly the tasks where LLMs shine. Claude Code in particular can write, debug, and refactor backtesting pipelines faster than most intermediate Python programmers. The bottleneck shifts from "can I write this?" to "do I know what to write?"
Hypothesis articulation and stress-testing
One of the hardest parts of systematic trading is articulating a hypothesis precisely enough to test it. LLMs are excellent sparring partners for this. You describe a pattern you've noticed; the model presses you on the mechanism, the failure modes, the data requirements. This is the Socratic dialogue that López de Prado argues should precede any backtest. "Why does this edge exist? Who is on the other side? What happens when the mechanism breaks?"
Natural language processing on financial text
Earnings call transcripts, 10-K filings, Fed speeches, analyst reports — these contain weakly predictive signal that is genuinely hard to extract with traditional NLP tools. LLMs can parse ambiguity, identify hedging language, and assess tone in ways that bag-of-words approaches cannot. This is a real use case with peer-reviewed evidence behind it (FinBERT, GPT-based sentiment features). The alpha is small and requires careful backtesting, but it's real.
The CLAUDE.md workflow: codified methodology
The most powerful use of LLMs in systematic trading is codifying your entire research methodology in a CLAUDE.md file that Claude reads at the start of every session. Instead of re-explaining the López de Prado pipeline every time, the model already knows: triple barrier labels, meta-labeling, CPCV with purging and embargo, sample uniqueness weights, PIT universe construction. Every code it generates follows the methodology automatically.
Where LLMs Fail Catastrophically
Confabulated backtests
Ask ChatGPT to backtest a momentum strategy. It will write code that runs, produces a beautiful equity curve, and reports impressive Sharpe ratios. Most of the time, that code contains at least one of: look-ahead bias (using future data in current-period calculations), survivorship bias (using today's index constituents for historical periods), or incorrect position sizing assumptions. The model produces confident, well-formatted, plausible code. The bugs are subtle enough that you won't catch them without domain expertise.
# LLM-generated backtest on NDX-100 momentum.
# Runs without errors. Produces a Sharpe of 1.4. Completely wrong.
ndx = pd.read_html('https://en.wikipedia.org/wiki/Nasdaq-100')[4] # today's list
tickers = ndx['Ticker'].tolist() # 100 current survivors — not PIT
prices = yf.download(tickers, start='2010-01-01', end='2024-01-01')['Adj Close']
momentum = prices.pct_change(63).rank(axis=1, pct=True) # 3-month rank
positions = (momentum > 0.75).astype(float) # top-quartile
positions = positions.div(positions.sum(axis=1), axis=0) # equal-weight
returns = (prices.pct_change() * positions.shift(1)).sum(axis=1)
# Sharpe: 1.4. CAGR: 19.2%.
# None of this happened. RIMM, YHOO, WFMI are gone from the universe.
# You got credit for avoiding every stock that failed. Not your strategy — the data.
The model's training data contains thousands of backtesting tutorials, blog posts, and Stack Overflow answers where people load SP500_current_holdings.csv and call it a done. The model learned the flawed pattern from the majority of examples in its training set. You can add the instruction "avoid survivorship bias" and it will include a comment # PIT universe loaded while still pointing to a stale static CSV. The fix is not a better prompt — it is providing the model with the correct data file and a methodology that encodes proper universe construction.
Hallucinated statistics
LLMs will confidently cite Sharpe ratios, factor loadings, and historical returns for strategies that they have not run. They're synthesizing text that sounds statistically credible based on training data. If you ask "what's the historical Sharpe ratio of a momentum strategy on the Nasdaq-100?" the model will give you a number. That number is not from a real backtest — it's a prediction of what a plausible answer looks like. Treat all statistics from LLMs without code and data provenance as hallucinated.
Stale universe data
LLMs have knowledge cutoffs. Even if you ask ChatGPT to tell you which companies are currently in the Nasdaq-100, it may give you an outdated list. More importantly, it has no mechanism to tell you which companies were in the index on a specific historical date. This is the survivorship bias problem in its most concrete form: the model simply doesn't have point-in-time membership data.
Task-by-Task Assessment
| Task | LLM Usefulness | Notes |
|---|---|---|
| Write backtesting code | Good (with review) | Always manually verify for look-ahead and bias — model cannot self-audit |
| Generate strategy hypotheses | Very good | Best as a structured dialogue, not a monologue |
| Parse earnings call transcripts | Good | Domain-fine-tuned models (FinBERT) outperform general LLMs |
| Explain ML methodology | Excellent | CLAUDE.md skill files make this systematic and consistent |
| Predict tomorrow's price | Useless | Wrong objective function — trained on tokens, not returns |
| Validate a backtest | Dangerous | Cannot detect survivorship bias or look-ahead in its own code |
| Know historical index membership | Cannot | Requires point-in-time data that LLMs don't have |
| Optimize portfolio weights | Unreliable | Use scipy/cvxpy with explicit constraints; don't rely on model math |
The Right Configuration: Claude Code + PIT Data
The workflow that works is not "ask ChatGPT what to buy." It's:
- Claude Code as your coding pair programmer — with a CLAUDE.md file that encodes the complete research methodology (hypothesis framework, PIT filtering, triple barrier labels, meta-labeling, CPCV). The model generates methodology-compliant code automatically.
- Survivorship-bias-free data — PIT index membership means the model's generated code works on a valid universe. The 208 bps bias isn't an abstraction; it's the difference between a strategy that looks real and one that is real.
- Human validation on the methodology — you own the hypothesis, the feature selection, and the interpretation of CPCV results. The LLM handles boilerplate; you handle judgment.
The NDX PIT Dataset includes 7 CLAUDE.md skill files covering the complete López de Prado pipeline: PIT universe construction, triple barrier labels, meta-labeling, CPCV, feature engineering, position sizing, and regime detection. Drop them into CLAUDE.md and Claude Code knows the methodology without any prompt engineering.
The Honest Verdict
ChatGPT and similar LLMs are transformative tools for quantitative trading workflows — as coding and reasoning assistants. They are not alpha generators, they cannot predict returns, and they will confidently produce flawed backtests that only domain expertise can catch.
The traders who get the most out of these tools are those who understand what they're actually doing: generating probabilistically plausible text that happens to include Python code. Use that superpower for what it's good at. Validate everything. Own the methodology.
"The bottleneck has shifted from whether you can write the code to whether you know what to write. That's a much harder problem — and a much more interesting one."
The model cannot know which companies were in the Nasdaq-100 on any given historical date. It has no mechanism to retrieve or verify this. If you ask it to load "the NDX-100 members as of Q2 2011," it will either hallucinate a list or fetch today's list and label it correctly. Neither is right. This is not a limitation of prompt engineering — it is a fundamental constraint of what a language model trained on static text can know about a dynamic historical record. The only solution is data.