AI Algorithmic Trading: What Works, What Doesn't, and What the Research Shows

Two things are called "AI" in trading and they are not the same thing. Generative AI — GPT-4, Claude, Llama — optimizes for next-token prediction on text. Predictive ML — random forests, gradient boosted trees, shallow neural networks — optimizes for a financial loss function on financial features. Different objective functions, different training regimes, different failure modes. One of them has peer-reviewed evidence of producing out-of-sample alpha. The other is genuinely useful for writing the code that tests whether you have any.

The Two AIs in Trading

Generative AI is optimized for next-token prediction. During training, the model sees a sequence of tokens and learns to predict what comes next. This produces systems that are extraordinarily good at pattern-matching over text — writing code, summarizing documents, reasoning through problems stated in natural language. But it says nothing about whether such a system can predict financial returns.

Returns prediction requires a model trained to minimize a financial loss function on financial data. This is predictive ML's domain: feature engineering, label construction (what exactly are we predicting?), cross-validation that respects temporal structure, and production deployment where the same signal that worked in 2018 still works in 2024.

Key distinction

GPT/Claude predicts the next token. A Random Forest trained on cross-sectional momentum features predicts relative returns. These are different problems, different tools, and different validity standards. Neither is inherently superior — they solve different things.

The reason this distinction matters: most retail "AI trading" content refers exclusively to generative AI. Prompting ChatGPT to "predict whether AAPL will go up tomorrow" is asking the wrong system the wrong question with no validation framework. The research-backed approach to ML in trading uses predictive models with proper validation — and leverages generative AI as a coding and reasoning assistant, not a signal generator.

What Actually Works

The academic literature on ML in finance is more skeptical than the marketing literature. Gu, Kelly, and Xiu (2020) in Empirical Asset Pricing via Machine Learning found that gradient boosted trees and neural networks do generate statistically significant out-of-sample alpha — but only with proper cross-validation, careful feature engineering, and realistic transaction cost assumptions. The raw improvement over a linear factor model was modest.

What consistently survives rigorous validation:

Regime-conditional momentum

Momentum (past returns predicting future returns) is one of the most robust factors in finance. But raw momentum is extremely regime-sensitive — it works in trending markets and violently reverses in drawdown regimes. ML adds value here not by predicting returns directly, but by learning when momentum is likely to work. A meta-label model trained on volatility regime features, market breadth, and macro indicators can substantially improve a simple momentum primary signal.

Nonlinear signal combination at daily frequency

Individual features at daily frequency are nearly useless: cross-sectional momentum, volume ratio, distance from 52-week high — each carries maybe 0.05% predictive power in isolation. The edge in GBTs and random forests at daily frequency is not discovering new signals; it's combining dozens of weak signals in a way that linear models can't. The interaction term between momentum and vol regime is something a linear regression won't discover. A cross-validated GBT will. Gu, Kelly, and Xiu (2020) showed that on a universe of US equities, nonlinear ML methods produce statistically significant alpha over a linear factor model — but the improvement is modest. The implication: you need a real edge in your hypothesis, not just a more complex model.

NLP on earnings calls and filings

This is the one place where generative AI's text understanding genuinely produces financial signal. Earnings call transcripts contain measurable signals: when management becomes more hedging in language ("approximately," "we believe," "subject to"), forward returns are worse. When specificity increases (exact dollar commitments, named product timelines), returns are better. FinBERT and fine-tuned transformers do outperform bag-of-words on this task. But the alpha is small — we're talking about 20-30 bps per signal, and it decays quickly as more participants adopt similar tools. The signal is real; the infrastructure to harvest it profitably is not trivial.

What Doesn't Work

Raw LLM price prediction

No serious quant research group is using GPT to predict whether a stock goes up tomorrow. The objective function mismatch is fundamental — you can't fine-tune a language model on "stock went up/down" labels and expect meaningful signal, because the model was already trained on billions of text tokens that dominate any financial fine-tuning. The signal-to-noise ratio in daily returns is extremely low, far lower than the tasks LLMs were designed for.

Unconstrained deep learning

Deep neural networks with many parameters and no strong inductive bias tend to overfit on financial data. Unlike in computer vision (where CNNs have a well-understood architectural advantage) or NLP (where the transformer architecture matches the sequential structure of language), there's no obvious architectural advantage for deep networks on structured financial tabular data over well-tuned gradient boosted trees. GBTs remain the go-to for most financial ML tasks.

Warning: overfitting masquerades as alpha

The most dangerous failure mode in financial ML is not a model that obviously fails — it's one that looks great in backtests and fails in production. Standard k-fold cross-validation is completely wrong for financial time series because it ignores the temporal ordering of data and creates look-ahead contamination. Standard backtests on index data with the current constituent list add an additional 208 bps/year of survivorship bias on top of this.

Sentiment trading without real-time data

News sentiment is widely hyped as an alpha source. It can be, but only with institutional-grade data infrastructure: real-time news feeds, sub-second processing, and awareness of exactly when information becomes public. Backtesting news sentiment with end-of-day price data introduces look-ahead bias in both the signal timing and the universe.

The López de Prado Framework

Marcos López de Prado's Advances in Financial Machine Learning (2018) provides the most rigorous published framework for applying ML to trading. Three contributions stand out as genuinely important for practitioners:

Meta-labeling: separating what to trade from when to trade

The insight is that a simple primary signal (e.g., "buy when 20d momentum > 0") can be right about direction but wrong about timing. A secondary meta-label model learns from the features around each trade signal: is this the kind of signal setup that historically works? The primary model selects candidates. The meta-model filters them. Position sizing uses the meta-model's probability output directly via Kelly.

# Primary signal: simple, interpretable, well-motivated
primary_long = features['mom_20d'] > features['mom_20d'].rolling(252).mean()

# Meta-label: learn WHEN the primary signal is trustworthy
meta_features = features[['vol_ratio', 'breadth_zscore', 'mom_20d', 'px_pos_52w']]
meta_y = (primary_events['label'] == 1).astype(int)  # was the primary signal right?

meta_model = RandomForestClassifier(n_estimators=300, max_depth=4, class_weight='balanced')
meta_model.fit(meta_features, meta_y, sample_weight=uniqueness_weights)

# Position size is Kelly-scaled by meta_probability
meta_prob = meta_model.predict_proba(X_test)[:, 1]
kelly_f = (meta_prob * pt_multiple - (1 - meta_prob)) / pt_multiple  # simplified Kelly

Combinatorial Purged Cross-Validation (CPCV)

A single train/test backtest gives you one Sharpe ratio. You have no idea if that number is representative or lucky. CPCV solves this by dividing the data into N folds and generating all C(N,K) = C(6,2) = 15 possible train/test path combinations. Each path gives an out-of-sample Sharpe. Now you have a distribution, not a point estimate.

At every fold boundary, events whose labels extend into the test set are purged from training. Then an embargo period (typically 5-10 days) prevents information leakage from serial correlation in returns. This is more restrictive than standard cross-validation — intentionally so. If a strategy can't survive purging and embargo, it was leaking in the first place.

The quality criterion: oos_sharpe.mean() / oos_sharpe.std() > 1.5. If your strategy produces a positive Sharpe on 13 of 15 paths, that's evidence. If it produces a positive Sharpe on 8 of 15 but the single-path number looked great, you were lucky. This distinction alone eliminates most of what people call "strategies."

Sample uniqueness weighting

When triple-barrier labels overlap (a 5-day label and a 4-day label that both started on the same day), training treats those samples as independent, overstating the effective sample size. Sample uniqueness weights each observation by the fraction of its lifetime that is exclusive — not shared with any other overlapping label. This is passed as sample_weight to sklearn estimators. It's a small but important correction that prevents overstating model confidence.

◈

The data problem comes first

Every technique above requires a clean, survivorship-bias-free universe. Backtesting momentum on the 2026 NDX-100 list introduces 208 bps/year of bias before you've written a single line of ML code. The NDX PIT dataset gives you point-in-time Nasdaq-100 membership for every trading day from 2007 to 2026 — the foundation that makes the LdP framework valid.

Data Quality Is the Real Bottleneck

The biggest predictor of whether a quantitative strategy is real or spurious is not which ML model you use — it's the quality of the data you train on. Three data quality problems dominate:

Survivorship bias in the universe

If your backtested universe only contains companies that survived to the current date, you're backtesting on winners by definition. In the Nasdaq-100 specifically, turnover is high — the index rebalances quarterly and has excluded dozens of companies over its history. YHOO was removed in 2017 when Verizon acquired it. RIMM/BlackBerry was removed as its business collapsed. WFMI (Whole Foods) fell 80% in the 2008 financial crisis and was eventually acquired. Backtesting momentum on the 2026 list means your backtest "avoids" these losers — not because your strategy avoided them, but because they aren't in your data at all. We measured this bias at 208 basis points per year on a simple momentum backtest.

Look-ahead bias in labels

Triple barrier labels can introduce look-ahead if constructed carelessly. The label for event t is determined by what happens at t+1 through t+h. Features must be constructed only from information available at t. This sounds obvious but breaks in subtle ways: accounting ratios based on quarterly reports that weren't yet filed, analyst estimates that embed future information, or index membership that you didn't know on day t.

Stale or adjusted price data

Price data needs to be properly adjusted for dividends and splits on a point-in-time basis. Using a single retrospective adjusted series introduces subtle biases when you're computing features like 52-week range position or drawdown — because the adjustment factors themselves are only known retrospectively.

A Practical Starting Point

Here is what a real hypothesis-first strategy looks like on NDX data — not a generic template, but a specific example that illustrates the difference between data mining and research.

Hypothesis: Nasdaq-100 large-cap momentum persists at 20-day horizons because institutional rebalancing is slow and earnings surprises cluster. The friction is institutional size — a fund managing $50B cannot exit a position in two days without moving the market, so price trends extend longer than they would in a frictionless world.

Primary signal: Cross-sectionally z-scored 20-day return above 1.0 standard deviation. Simple, directly motivated by the hypothesis, interpretable. No interaction terms, no lookback optimization.

Meta-label features: vol_ratio (rolling 5d vol / 20d vol — is uncertainty expanding?), index breadth z-score (are momentum signals correlated across the whole universe, which reduces edge?), sector concentration (is the momentum concentrated in one sector, suggesting macro not idiosyncratic?). The meta-model learns when the primary signal's environment is hostile.

Validation: CPCV with C(6,2) = 15 paths. If oos_sharpe.mean() / oos_sharpe.std() > 1.5, the hypothesis survives. If not, you return to the hypothesis — not to parameter tuning. Changing the 20-day lookback to 18 days because it improves one path's Sharpe is exactly the behavior that destroys strategies in production.

Sizing: Meta-model probability into half-Kelly. 2% max per ticker. 20% max per sector. These limits exist not because of risk management convention but because momentum strategies are vulnerable to sector-level crashes — think semiconductors in 2001 or cloud software in 2022.

The honest verdict

Where ML creates real edge in trading

✓ Meta-labeling: identifying WHEN a signal is live vs. dead
✓ Regime detection: conditioning strategy behavior on market state
✓ NLP on structured text: earnings call sentiment, filing analysis
✓ Nonlinear factor combination: improving classic factor models
✗ LLM price prediction: wrong objective function, no validation framework
✗ Deep learning on tabular data: overfits, GBTs consistently outperform
✗ Any ML trained on survivorship-biased data: garbage in, garbage out

AI Algorithmic Trading:
What Works, What Doesn't,
and What the Research Shows