Crisis stress · three crashes
We ran the engine on three different stock-market crashes — the slow 2022 selloff, the 2020 COVID flash crash, and the 2008 financial crisis. For each crash, we ask one question: did the engine's predictions actually match what happened?Below are the honest numbers. We show failures as openly as wins, because investors who trust whitewashed numbers get burned.
Three numbers tell you whether a probabilistic forecast is honest: the information coefficient (does score predict return?), the conformal coverage rate (does the 90% CI actually contain reality 90% of the time?), and a permutation-test p-value contextualising the IC against random shuffles. We run all three on the production engine across three crisis windows below — slow rate-hike grind vs. flash crash vs. avalanche financial collapse. Different regimes, different results.
S&P 500 closed at all-time high 4796.56 the day before. Fed pivot to rate hikes started Jan 4.
~280 trading days, S&P fell 25% to Oct 12, 2022 trough
Slow grinding bear. Engine bearish calls were well-calibrated; bullish calls were over-confident — engine fought the regime. Permutation p=0.48: with n=26 the IC is not statistically significant (need 374-ticker universe).
S&P 500 closed at all-time high 3386.15. Bear market started the next day (Feb 20). Trough at 2237.40 on Mar 23, 2020.
23 trading days to trough, fastest 30%+ drawdown on record
Engine had near-zero predictive power on this 23-day exogenous shock — IC = -0.045 means scores were essentially uncorrelated with realised forward returns. 90% CI covered only 14% of outcomes. Honest reading: factor models built for normal market dynamics are not designed to predict pandemic-driven flash crashes. This is the failure mode investors should know about — and we publish it rather than hide it.
Friday before Lehman Brothers bankruptcy weekend. S&P 1251.70 → 752.44 by Nov 20.
~50 trading days to interim trough; full bear ran to Mar 9, 2009 (-57% peak-to-trough)
IC = 0.261 is paper-grade — engine ranked tickers strongly even during the avalanche. BUT 90% CI covered only 16% of outcomes. Honest reading: in a leverage-driven liquidation, our prediction intervals were far too narrow — rank order survived, magnitude estimates did not. Mondrian conformal calibration (post-2026-05-16) widens halfwidths when residuals expand; this exact regime is what that loop will fix. Tickers: 19 of the modern universe with sufficient pre-2008 history (META, TSLA, AVGO, ABBV did not exist).
Why this matters: most stock-picking tools never publish a single one of these. We publish all four. Some make us look good; some make us look bad. Both are useful.
A model that does well on one crisis window may have been lucky or overfit to that regime. The three windows here are structurally different: 2022 was a slow grind driven by interest-rate policy; 2020 was a 33-day flash collapse driven by an exogenous health shock; 2008 was a 6-month avalanche driven by leverage unwinding. A model that produces honest coverage across all three has captured something general about equity dynamics. A model that fails one of them has a known weakness — and we publish it.
src/lib/data/__tests__/crisis-stress-{2022,2020,2008}.test.ts+ permutation-test.test.ts. Reconstructors: src/lib/data/historical-reconstructor.ts.