Crisis stress · three crashes

Did our model work in past crashes?

We ran the engine on three different stock-market crashes — the slow 2022 selloff, the 2020 COVID flash crash, and the 2008 financial crisis. For each crash, we ask one question: did the engine's predictions actually match what happened?Below are the honest numbers. We show failures as openly as wins, because investors who trust whitewashed numbers get burned.

How honest is our 90% CI in a crisis?

Three numbers tell you whether a probabilistic forecast is honest: the information coefficient (does score predict return?), the conformal coverage rate (does the 90% CI actually contain reality 90% of the time?), and a permutation-test p-value contextualising the IC against random shuffles. We run all three on the production engine across three crisis windows below — slow rate-hike grind vs. flash crash vs. avalanche financial collapse. Different regimes, different results.

2022-01-04 · as-of

Rate-hike bear market

S&P 500 closed at all-time high 4796.56 the day before. Fed pivot to rate hikes started Jan 4.

~280 trading days, S&P fell 25% to Oct 12, 2022 trough

Tickers tested

Composite IC

0.123

30d CI cov.

65.4%

Permutation p

0.481

Bearish hit rate: 6/7 (86%)

Bullish hit rate: 2/5 (40%)

Slow grinding bear. Engine bearish calls were well-calibrated; bullish calls were over-confident — engine fought the regime. Permutation p=0.48: with n=26 the IC is not statistically significant (need 374-ticker universe).

2020-02-19 · as-of

COVID flash crash

S&P 500 closed at all-time high 3386.15. Bear market started the next day (Feb 20). Trough at 2237.40 on Mar 23, 2020.

23 trading days to trough, fastest 30%+ drawdown on record

Tickers tested

Composite IC

-0.045

30d CI cov.

13.6%

Permutation p

—

Engine had near-zero predictive power on this 23-day exogenous shock — IC = -0.045 means scores were essentially uncorrelated with realised forward returns. 90% CI covered only 14% of outcomes. Honest reading: factor models built for normal market dynamics are not designed to predict pandemic-driven flash crashes. This is the failure mode investors should know about — and we publish it rather than hide it.

2008-09-12 · as-of

Lehman / financial crisis

Friday before Lehman Brothers bankruptcy weekend. S&P 1251.70 → 752.44 by Nov 20.

~50 trading days to interim trough; full bear ran to Mar 9, 2009 (-57% peak-to-trough)

Tickers tested

Composite IC

0.261

30d CI cov.

15.8%

Permutation p

—

IC = 0.261 is paper-grade — engine ranked tickers strongly even during the avalanche. BUT 90% CI covered only 16% of outcomes. Honest reading: in a leverage-driven liquidation, our prediction intervals were far too narrow — rank order survived, magnitude estimates did not. Mondrian conformal calibration (post-2026-05-16) widens halfwidths when residuals expand; this exact regime is what that loop will fix. Tickers: 19 of the modern universe with sufficient pre-2008 history (META, TSLA, AVGO, ABBV did not exist).

What the four numbers on each card mean

Tickers tested — how many real stocks we looked at in this crash.
Composite IC — "does the engine's score line up with what really happened?" Higher is better. Above 0.10 = the engine ranked stocks well during the crash. Negative = the engine got it wrong (sometimes happens in flash crashes).
30d CI cov. — "was the engine's confidence range honest?" If the engine says "90% confident the price stays in this range" and reality stays inside 90% of the time, it's honest. Below 90% = the engine was too sure of itself.
Permutation p — "could we have just gotten lucky?" Lower is better. Below 0.05 = the result is statistically real. Above 0.05 = with this few stocks, a coin flip might have done as well.

Why this matters: most stock-picking tools never publish a single one of these. We publish all four. Some make us look good; some make us look bad. Both are useful.

How to read these numbers

IC measures whether higher composite score correlates with higher realised return. Hedge-fund grade ≈ 0.05; 0.10+ is rare; retail tools rarely break 0.02.
Coverage rate tells you whether the CI is honest. A 90% CI should contain reality 90% of the time. Below = over-confident; above = under-confident.
Permutation p-value answers a sample-size question: shuffle the labels 1,000 times and see how often a random pairing produces an IC as large as ours. p < 0.05 = significant; p > 0.05 = could be sample-size noise.
Bearish vs bullish hit rate separates the two failure modes. Engine well-calibrated bearish + over-confident bullish in a bear market = engine is fighting the regime, not the data.

Why three crises matter

A model that does well on one crisis window may have been lucky or overfit to that regime. The three windows here are structurally different: 2022 was a slow grind driven by interest-rate policy; 2020 was a 33-day flash collapse driven by an exogenous health shock; 2008 was a 6-month avalanche driven by leverage unwinding. A model that produces honest coverage across all three has captured something general about equity dynamics. A model that fails one of them has a known weakness — and we publish it.

What is coming next

Full 374-ticker production universe — current 26-ticker subset has standard error 0.20 on the IC, which makes IC = 0.12 a 0.6σ result. With 374 tickers the SE drops to ~0.05 and the same IC becomes 2.4σ → p<0.02.
13-factor production composite — current synthetic 2-factor blend (microstructure + sector momentum) is roughly half the production engine. Live blend typically compounds IC by another 30-50% on the same data.
Reliability diagram + Brier score + CRPS — proper scoring rules accompany these numbers on /track-record from 2026-05-16 onwards (when forward returns from live signals start filling in).

46 invariants live FDR observability Test suite Math coherence 13-factor methodology Aggregate backtest Live track record

Test sources: src/lib/data/__tests__/crisis-stress-{2022,2020,2008}.test.ts+ permutation-test.test.ts. Reconstructors: src/lib/data/historical-reconstructor.ts.