Machine learning and AI can accelerate idea generation and feature discovery for quantitative research, but they also introduce risks: leakage, overfitting, and opaque decisions. This article gives a practical recipe for using ML and LLMs safely for alpha discovery, describes reproducible feature-discovery pipelines, shows robust cross-validation strategies for time-series, and lays out governance controls needed before promoting a model to production.
Quick checklist:
Start by separating two different activities:
A good pipeline isolates discovery artifacts (notebooks, prompts, ad-hoc features) from production artifacts (canonical datasets, feature engineering code, CI-tested training scripts). Discovery informs production: the output is a small, well-tested feature set or model candidate that passes objective criteria.
Success criteria for promoting a discovery to production:
LLMs are excellent at accelerating literature review, summarizing research, and proposing candidate features or transformations. Use them as a research assistant, not a black-box model for alpha.
Safe LLM uses:
Dangerous LLM uses:
Examples: LLM prompt templates
Prompt: feature brainstorming
You are a quant researcher. Given a price timeseries with OHLCV, suggest 12 candidate features that may capture short-term mean-reversion and momentum on intraday timescales. For each feature, describe expected behavior under volatility regimes, how to compute it in pandas, and potential pitfalls. Keep suggestions actionable.
Prompt: transform natural language into code
Task: Given a pandas DataFrame `df` with columns ['open','high','low','close','volume'], produce a function `add_features(df)` that adds 10 features: returns over 1/5/15 bars, rolling volatilities, RSI-14, and a 2-window signed-volume imbalance. Return only code.
Use LLM outputs as starting points. Always validate generated code and features on real historical data.
A disciplined feature discovery pipeline has stages:
Practical implementation sketch (Python pseudocode)
1# 1. generate features (pseudo)
2candidates = generate_candidates(prices) # many transforms
3
4# 2. screen by information coefficient (IC)
5ics = {}
6for f in candidates:
7 ic = pearson_ic(feature[f], future_return)
8 ics[f] = ic
9
10# 3. filter by median IC and stability across folds
11selected = [f for f in ics if abs(median_ic(f)) > 0.02 and std_ic(f) < 0.015]
12
13# 4. check economic impact
14for f in selected:
15 backtest = run_walkforward_backtest(strategy=single_feature_strategy(f), costs=tc_model)
16 if backtest.net_return < threshold: discard(f)
17Key screening metrics:
Important: always condition on realistic execution assumptions. A feature with high raw IC but requiring massive turnover or unrealistic fills should be discarded.
Data leakage is the most common silent killer of research—models that seem great in-sample but fail out-of-sample because they indirectly encode future information.
Common leakage sources:
Cross-validation recommendations for time-series:
Purged K-Fold example (sketch using scikit-learn style)
1from mlfinlab.cross_validation import PurgedKFold
2
3pkf = PurgedKFold(n_splits=5, t1=t1_times, pct_embargo=0.01)
4for train, test in pkf.split(X):
5 model.fit(X.loc[train], y.loc[train])
6 preds = model.predict(X.loc[test])
7 eval(preds, y.loc[test])
8Also do these checks:
Productionization requires governance. For each candidate model/feature promoted, maintain:
Example model card fields:
Scenario: intraday mean-reversion signal on equity futures.
Steps and code sketches:
1# compute IC over rolling windows
2window = 252*6 # number of intraday bars in screening window
3ics = rolling_ic(feature, future_return, window=window)
4median_ic = np.nanmedian(ics)
5std_ic = np.nanstd(ics)
6Reject features with low median_ic or high std_ic.
1# run an event-driven backtest with slippage
2bt = Backtester(strategy=SignalStrategy(feature), costs=CostModel(taker=0.0005, maker=-0.0001))
3results = bt.run(start, end)
4print(results.net_return, results.sharpe)
5LLMs speed up ideation and boilerplate code generation. Use these guardrails:
Prompt examples (for authorship and reproducibility):
Prompt ID: brainstorm/mean_rev_001
Model: gpt-4o-research
Prompt: "Given an intraday orderbook snapshot and time-series of trades, propose 8 candidate imbalance features that could indicate short-term mean reversion. For each, provide a one-line computation and expected behavior under high/low volatility. Output JSON."
Prompt ID: codegen/rsi_imbalance
Model: gpt-4o-code
Prompt: "Write a function `compute_signed_imbalance(trades_df)` that computes signed volume imbalance over 1/5/15-bar windows. trades_df has columns [timestamp, price, size, side]. Use pandas and numpy. Include docstring and unit test."
Store prompt, model used, and LLM output in experiment metadata alongside run results.
Track experiments and artifacts using a registry (MLflow, DVC or in-house tooling). For each experiment store:
Automate regression tests: when data or feature code changes, re-run historical tests and compare key metrics to detect regressions.
In production, monitor:
Implement alerts on meaningful drifts (e.g., IC drop > 30% vs baseline or sudden increase in missing features).
This is a minimal example showing how to compute IC and run a rolling-stability filter. It's intentionally small — adapt and productionize with careful data handling.
1import numpy as np
2import pandas as pd
3
4def pearson_ic(series, future_returns):
5 return series.corr(future_returns)
6
7def rolling_ic(feature_series: pd.Series, future_returns: pd.Series, window: int=252):
8 # align indexes
9 combined = pd.concat([feature_series, future_returns], axis=1).dropna()
10 f = combined.iloc[:,0]
11 r = combined.iloc[:,1]
12 ics = []
13 for i in range(window, len(combined)):
14 ic = f.iloc[i-window:i].corr(r.iloc[i-window:i])
15 ics.append(ic)
16 return np.array(ics)
17
18# usage
19# ic_values = rolling_ic(feature['f1'], returns.shift(-horizon), window=252)
20# median_ic = np.nanmedian(ic_values)
21ML and LLMs are powerful accelerants for alpha discovery when used with discipline. Use LLMs for ideation and boilerplate, but always follow reproducible pipelines, robust cross-validation, and governance. Promote only those discoveries that survive rigorous economic and stability checks.
Technical Writer
NordVarg Team is a software engineer at NordVarg specializing in high-performance financial systems and type-safe programming.
Get weekly insights on building high-performance financial systems, latest industry trends, and expert tips delivered straight to your inbox.