13 min read

Claude Code for Algorithmic Trading:
The CLAUDE.md Research Framework

How to encode the López de Prado methodology once and have every backtest Claude generates be survivorship-bias-free, correctly labeled, and properly validated — without re-explaining the pipeline in every prompt.

Claude Code generates backtesting code faster than any programmer. It also generates backtests with survivorship bias, look-ahead bias, and incorrect cross-validation — confidently, without errors, with Sharpe ratios that look completely plausible. The difference between using Claude Code correctly and incorrectly is not which model you use or how you phrase your prompts. It's whether you've given the model a methodology to comply with.

The CLAUDE.md pattern solves this. You encode the complete research pipeline once. Every session that follows — every backtest, every feature, every position sizing calculation — is generated against that encoded methodology. The model doesn't need to rediscover the correct approach each time. It's already in context.

Why LLM-Generated Backtests Fail by Default

Claude and ChatGPT were trained on text, including enormous quantities of backtesting code, finance tutorials, and quantitative strategy articles. The problem is that the majority of this training material makes the same methodological errors that practitioners have been making for decades: loading current index constituents for historical periods (survivorship bias), computing features from data that wasn't yet available (look-ahead bias), using standard k-fold cross-validation on financial time series (temporal leakage).

When you ask the model to write a backtest, it generates code that is statistically consistent with the examples it was trained on — which means it generates code with these errors. You can instruct the model to "avoid survivorship bias." It will add a comment to that effect. The code will still load a static ticker file.

The core problem

Instructions in the prompt are context the model reasons about. Methodology encoded in CLAUDE.md is a concrete pattern with code examples that the model matches against. The difference in output reliability is substantial. A prompt instruction says what to do; a CLAUDE.md skill says exactly how to do it and shows the wrong alternative so it can avoid it.

For Nasdaq-100 strategies specifically, the survivorship bias problem is measurable: using the 2026 constituent list for all historical dates inflates CAGR by 208 basis points per year on a simple momentum strategy. CLAUDE.md doesn't fix this on its own — you still need the PIT data. But CLAUDE.md ensures that every generated backtest uses the PIT lookup function instead of loading a stale static file.

How CLAUDE.md Works with Claude Code

Claude Code reads the CLAUDE.md file in your project root at the start of every session. The contents become part of the context window for all subsequent interactions. Whatever is in that file, the model treats as established fact about your project — its conventions, its APIs, its methodology requirements.

The standard use of CLAUDE.md is project configuration: code style, testing framework, architecture conventions. The underutilized use is methodology encoding: writing down the complete research pipeline in enough detail that code generated by the model is compliant by construction.

# Install Claude Code
npm install -g @anthropic-ai/claude-code

# In your project directory, append skill files to CLAUDE.md:
cat ndx-pit-dataset/skills/SKILL_*.md >> CLAUDE.md

# Now open Claude Code — methodology is in context for this session
claude "build a momentum strategy on the NDX PIT data"
# Generated code: PIT universe lookup, CUSUM events, triple barrier labels,
# meta-label model, CPCV validation, Kelly sizing. No survivorship bias.
# No look-ahead. No standard k-fold.

This works because the skill files are not documentation — they're executable specifications. Each one contains the correct code pattern, the incorrect anti-pattern with an explanation of why it fails, and the economic reasoning behind every design decision. The model matches against these patterns when generating new code.

The Seven Skill Files

The complete López de Prado pipeline for Nasdaq-100 research requires seven distinct methodology components. Each one addresses a different failure mode that appears in LLM-generated code without it.

01 — SKILL_strategy_research.md: The Research Process

The most important skill file and the one most often skipped. It encodes the research process itself: economic hypothesis first, primary signal second, meta-label model third, CPCV validation fourth. Without this, the model jumps directly from a pattern to an optimized parameter set — data mining, not research. With it, the model generates code that tests a specific mechanism, not whichever indicator combination produces the highest in-sample Sharpe.

The red flag list this skill encodes: "I found a combination of 7 indicators that backtests at Sharpe 2.1." "The optimal lookback period is 23 days." "This works in every market except 2008." These are data-mining fingerprints. The model learns to flag them rather than reproduce them.

02 — SKILL_pit_dataset.md: Survivorship-Bias-Free Universe

The PIT skill encodes the exact API for universe construction from the point-in-time dataset. The model sees the correct pattern:

# CORRECT — PIT lookup for each backtest date
universe = pit_data[pit_data['date'] == date]['ticker'].tolist()

# WRONG — introduces 208 bps/year survivorship bias
universe = pd.read_csv('ndx100_tickers.csv')['ticker'].tolist()

When the model generates any code that requires a universe, it uses the first pattern. The anti-pattern with its bias measurement is in context; the model avoids it the same way it avoids using eval() when string formatting is in context as a known security vulnerability.

03 — SKILL_triple_barrier.md: Labels and Meta-Labeling

Triple barrier labeling is the correct way to construct labels for financial ML — it captures the trade's actual outcome (profit target hit, stop loss hit, or time expiry) rather than arbitrary fixed-horizon returns. But it's complex enough that LLM-generated implementations frequently contain errors in barrier construction, CUSUM event detection, or label alignment.

More importantly, this skill encodes meta-labeling: the two-model pipeline where a primary signal identifies trade candidates and a secondary model learns when the primary signal is trustworthy. Without this, the model treats the trading problem as a single binary classification task. With it, the model generates the full pipeline:

# Meta-labeling pipeline
# Step 1: Primary signal — simple, interpretable, hypothesis-driven
primary_long = features['mom_20d_zscore'] > 1.0

# Step 2: Generate events at signal changes using CUSUM filter
events = cusum_filter(daily_vol, threshold=0.02)
primary_events = events[primary_long.reindex(events, fill_value=False)]

# Step 3: Meta-label — did the primary signal's prediction verify?
meta_y = (primary_events['label'] == 1).astype(int)
meta_model = RandomForestClassifier(n_estimators=200, max_depth=4,
                                     class_weight='balanced')
meta_model.fit(X_meta, meta_y, sample_weight=uniqueness_weights)

# Step 4: Meta-probability drives position size (not a binary bet)
meta_prob = meta_model.predict_proba(X_test)[:, 1]

04 — SKILL_cpcv.md: Combinatorial Purged Cross-Validation

The most technically demanding skill file. It encodes purged cross-validation (removing training samples whose labels extend into the test period), embargo periods (preventing serial correlation leakage), and the combinatorial path generation that produces a Sharpe distribution instead of a point estimate.

Without this skill, the model generates standard k-fold cross-validation or a simple train/test split — both of which are wrong for financial time series. With it, the model generates CPCV with C(6,2) = 15 paths, computes the quality criterion (oos_sharpe.mean() / oos_sharpe.std()), and returns the full distribution for human evaluation.

05 — SKILL_feature_engineering.md: Features With Economic Rationale

The feature engineering skill does two things: it specifies the technical implementation (cross-sectional z-scoring, fractional differentiation for stationarity, rolling window alignment), and it provides the economic justification for each feature family.

Why does cross-sectional momentum z-scoring matter? Because raw momentum is contaminated by market-wide regime effects — in a bull market, everything has positive 20-day returns. Z-scoring strips the common component and isolates relative momentum. The model generates z-scored features by default because the skill explains why unnormalized momentum is the wrong input to a cross-sectional model.

06 — SKILL_position_sizing.md: Kelly, Meta-Probability, Investment Sizing

This is where the meta-label probability connects to portfolio construction. The skill encodes the Kelly criterion correctly — using meta-probability as the win probability, the profit target multiple as the reward, and the stop loss as the risk — and then applies the standard half-Kelly conservatism:

def kelly_fraction(p: float, b: float, q: float = None) -> float:
    """
    p = probability of winning (from meta-model)
    b = profit target / stop loss ratio
    q = 1 - p (loss probability)
    """
    if q is None:
        q = 1 - p
    f = (p * b - q) / b  # Kelly formula
    return max(f / 2, 0)  # half-Kelly, floored at zero

# Position size for each event
meta_prob = meta_model.predict_proba(X_event)[:, 1]
f_kelly = kelly_fraction(meta_prob, pt_sl_ratio)
position = min(f_kelly, max_position_size)  # cap at 2% per ticker

The skill also encodes portfolio-level constraints: 2% per ticker, 20% per sector, inverse-volatility weighting for equal risk contribution. The model generates position sizing that is conservative by default. Without this skill, models default to equal-weight or unconstrained mean-variance optimization — both wrong for different reasons.

07 — SKILL_regime_detection.md: Market State Conditioning

Momentum strategies have strongly regime-dependent performance. They work in trending, low-vol environments and reverse violently in risk-off, high-vol regimes. The regime detection skill encodes three layers: a binary 200-day MA filter (long-only when the index is above its 200d MA), a vol-expansion scalar (reduce position sizes as realized vol expands), and an optional 2-state HMM for continuous regime probability.

The critical implementation detail encoded in this skill: the T+1 lag rule. Regime signals computed at the close of day t are applied at the open of day t+1. Using end-of-day regime data to make same-day trading decisions introduces look-ahead. The skill has code examples showing correct and incorrect lag alignment.

A Complete Research Session

Here's what a session looks like with the full CLAUDE.md loaded:

# You:
"I want to test whether earnings surprise momentum persists in the NDX.
Hypothesis: stocks with positive earnings surprises outperform over the next
20 days because analyst estimate revisions lag the surprise. Primary signal:
SUE (standardized unexpected earnings) z-score above 1.0."

# Claude Code generates:
# 1. PIT universe construction using ndx_pit_daily.parquet
# 2. Earnings surprise feature from quarterly EPS data, z-scored cross-sectionally
# 3. CUSUM-filtered events at SUE threshold crossings
# 4. Triple barrier labels: 3% profit target, 2% stop, 20-day expiry
# 5. Meta-features: vol_ratio, index breadth, sector concentration, prior SUE
# 6. RandomForestClassifier with uniqueness weights, class_weight='balanced'
# 7. CPCV at C(6,2)=15 paths with 5-day embargo
# 8. Quality criterion report: oos_sharpe.mean() / oos_sharpe.std()
# 9. Kelly-sized positions with 2%/20% caps
# 10. Regime overlay: only trade when NDX > 200d MA

The entire pipeline in one prompt. No methodology arguments, no bias corrections, no re-explaining CPCV. The model's job is implementing a correct research pipeline for the hypothesis you provided. Your job is providing a hypothesis that has a real economic mechanism behind it.

The skill files and the data, together

The NDX PIT Dataset includes all seven CLAUDE.md skill files plus 4,900 days × 265 tickers of survivorship-bias-free Nasdaq-100 data. The skills tell Claude how to use the data correctly. The data makes the methodology valid. Neither is complete without the other.

What the Model Cannot Do

CLAUDE.md solves the implementation problem. It does not solve the research problem.

The economic hypothesis has to come from you. Why does earnings surprise momentum persist? Is it analyst herding, institutional inertia, or earnings quality persistence? The answer determines which features you build and what failure mode you're testing against. The model can stress-test your hypothesis once you've articulated it — "what would need to be true about the data for this mechanism to work?" — but it cannot generate the mechanism from first principles.

The CPCV result interpretation is yours too. A quality ratio of 1.3 with 11/15 positive paths is ambiguous. Whether that's worth trading depends on transaction costs, correlation with your existing positions, the size of the opportunity, and your confidence in the economic mechanism. These are judgment calls that require context the model doesn't have.

This division of labor is the right one. The model handles the mechanical complexity of implementing a research pipeline correctly — which is genuinely complex and where errors are genuinely consequential. You handle the judgment — which is where skill and domain knowledge actually matter. Before CLAUDE.md, a significant fraction of research time went to debugging bias errors, correcting cross-validation, and arguing with the model about methodology. That overhead goes to near zero. The time goes back to the hypothesis.

The practical summary

What CLAUDE.md changes in quantitative research

  • Every generated backtest uses PIT universe — 208 bps bias eliminated at source
  • Triple barrier labels and meta-labeling by default — not binary classification
  • CPCV instead of standard k-fold — Sharpe distribution, not a point estimate
  • Kelly-based sizing with caps — conservative and risk-aware by default
  • Regime overlay applied correctly — T+1 lag prevents look-ahead
  • Cannot generate the economic hypothesis — that's the researcher's job
  • Cannot interpret CPCV results in context — human judgment required
  • Cannot know historical index membership without PIT data provided