Phase 3 clinical trial failure prediction — the Bayesian quant framework
How a Bayesian model built on industry base rates and live trial signals quantifies Phase 3 failure probability — with worked example, citations, and DeepVane's integration into equity scoring.
Phase 3 clinical trials fail about 42% of the time. That single number — the base rate — is the most important anchor in biotech investing, and the one most-often ignored when a Wall Street analyst writes a bullish initiation note. In this post we walk through the Bayesian framework DeepVane uses to update that base rate with trial-specific evidence, what kinds of signals carry real weight, and how the resulting posterior feeds directly into our equity scoring.
The base rate
Hay et al. 2014 (Clinical Development Success Rates for Investigational Drugs, Nature Biotechnology) compiled FDA data from 2003-2013 across 10,000+ trials. Phase 3 success rate: 58%, failure rate: 42%. Subsequent updates from BioMedTracker and DiMasi show the rate is remarkably stable — it varies by therapeutic area but not by year.
- Oncology: ~45% failure (higher)
- Cardiovascular: ~45% failure
- Neurology: ~50% failure (highest)
- Infectious disease: ~36% failure (lowest)
- Metabolic / endocrine: ~38% failure
The first honest question in any biotech investment thesis: what's my prior P(failure)? For most retail analyses the answer is "the literature says 42%". Everything else is updating that prior with evidence specific to this trial.
The Bayesian update mechanism
We don't need a complicated model — we need the right one. For each signal s_i we observe (enrollment anomaly, endpoint amendment, mechanism risk, cash burn), we compute a log-likelihood ratio:
LLR_i = log( P(s_i | trial fails) / P(s_i | trial succeeds) )The posterior log-odds of failure is the prior log-odds plus the sum of observed signal LLRs. Converting back to probability:
P(fail | signals) = sigmoid( log(prior / (1-prior)) + Σ LLR_i )This is Bayes' rule. Nothing more. The edge is in picking signals whose LLR is backed by a published study, not in the algebra.
Which signals actually carry weight
The five signal families DeepVane tracks, with the approximate LLR weight each contributes:
1. Enrollment velocity (LLR ≈ 0.6 when firing)
Trials that take 50%+ longer than their originally-posted enrollment window fail about twice as often as on-pace ones (Carlisle et al. 2015). The mechanism is straightforward: enrollment delays usually reflect eligibility criteria that are too narrow, a patient population that isn't responding, or site-level operational issues. We pull the original and current expected enrollment counts from ClinicalTrials.gov and flag trials where the variance exceeds 1.5×.
2. Endpoint amendments mid-trial (LLR ≈ 1.1 when firing)
Sponsors who amend the primary endpoint after patient enrollment has started are disproportionately those seeing their original endpoint fail. Not every amendment is bad — some reflect regulatory guidance updates — but the base rate of failure on amended-endpoint trials runs 65-70%. This is one of the strongest signals in the literature.
3. Mechanism of action risk (LLR ≈ 0.7 when firing)
PubMed searchable: for each indication + intervention pair, we query historical failure rates of the same mechanism class. A new NMDA receptor antagonist for depression? Prior class failure rate is 78%. A GLP-1 agonist for obesity? Prior class failure rate is ~15%. Class history dominates trial-specific randomness.
4. Cash burn vs trial remaining (LLR ≈ 0.4 when firing)
When a small-cap biotech runs out of cash before trial readout, they either dilute aggressively or abandon the trial — both bad outcomes for equity holders regardless of whether the underlying science works. We flag trials with < 6 months of primary-completion lead time at fewer than 15 sites as high cash-burn risk.
5. Principal investigator departure (LLR ≈ 0.9 when firing)
When the named PI on a trial leaves mid-study, the subsequent failure rate runs 60%+. Harder to scrape cleanly (PubMed co-author graphs work partially), but when confirmable it's a very strong signal.
A worked example
Suppose we have a mid-cap oncology ticker whose lead asset is in a Phase 3 for glioblastoma. Signals observed:
- Enrollment 55% behind schedule → LLR +0.6
- No endpoint amendment → LLR 0 (signal absent)
- Mechanism class (small-molecule BRAF inhibitor in GBM) has 72% historical failure rate → LLR +0.7
- Cash runway 8 months, trial readout 14 months → LLR +0.4
- No PI departure → LLR 0
Total LLR: +1.7. Prior log-odds of failure: log(0.45/0.55) = -0.20 (base rate for oncology is 45%). Posterior log-odds: -0.20 + 1.7 = 1.50. Posterior P(failure): sigmoid(1.50) = 0.82.
We moved from 45% prior to 82% posterior P(failure) on four live-data signals. The confidence bucket flips from "coin flip" to "85% likely to lose significant money on readout". That's actionable.
How this integrates with equity scoring
The posterior P(failure) for each pharma ticker's most-urgent Phase 3 trial feeds directly into the APEX composite via three pattern overrides:
- PHARMA_FAILURE_HIGH — posterior ≥ 55%. APEX composite pulled down in proportion to distance above the base rate, confidence boosted. Fires above most factor patterns because Phase 3 outcomes dominate biotech returns.
- PHARMA_FAILURE_LOW — posterior ≤ 25%. De-risked pipeline. Composite pushed up, bullish verdict.
- PHARMA_CATALYST_NEAR — primary completion within 90 days. Interval widens to reflect binary-outcome variance regardless of direction.
For non-pharma tickers the pharma columns are null and none of these patterns fire. The engine won't accidentally tag a semiconductor company with a phase 3 failure risk.
What we don't claim
We also don't pretend this works for Phase 1 or Phase 2 assets. Base rates shift materially (Phase 1 failure rate ~45%, Phase 2 is ~70%) and the signal → outcome linkages are weaker for earlier-stage trials where the outcome space is larger.
Where to see this live
Every pharma ticker on DeepVane shows its current posterior failure probability on the stock page when an active Phase 3 exists. The three pattern pages (PHARMA_FAILURE_HIGH, PHARMA_FAILURE_LOW, PHARMA_CATALYST_NEAR) list all tickers currently matching each pattern, refreshed weekly by the ClinicalTrials.gov + PubMed scan.
For the full engine architecture see Methodology and the What's defensible page explaining why the integration of literature base rates with live-data Bayesian updates is the moat, not the individual math layers.