APEX factor · NLP
NLP scores the tone of management's narrative in the 10-K MD&A section using a finance-specific dictionary. Negative-leaning language predicts negative forward returns; obscure or hedge-laden language predicts uncertainty. The factor reads what management is signalling, not what the spreadsheet says.
When a CEO writes 'we faced significant headwinds in our consumer segment, characterised by inventory recalibration and competitive repositioning' instead of 'sales fell because customers switched to a cheaper competitor', the obfuscation is the signal. Loughran-McDonald counts how many such hedge words, negative-finance words, and uncertain qualifiers appear relative to the section length. Li adds the readability dimension — sentences over 25 words, paragraphs over 100 words, Latin-derived rare vocabulary. Both compress to one truth: when management writes opaquely, the next four quarters tend to disappoint.
for each 10-K MD&A section: neg_frac = count(LM_negative_words) / count(words) unc_frac = count(LM_uncertain_words) / count(words) litig_frac = count(LM_litigious_words) / count(words) fog_index = 0.4 · ( avg_words_per_sentence + 100·hard_word_frac ) nlp_raw = -1 · ( w₁·neg_frac + w₂·unc_frac + w₃·litig_frac + w₄·fog_z ) nlp = z_score( nlp_raw ) // sign flipped: less negative = bullish
Sign flip again so the factor reads bullish-when-high. Fog index is z-scored against the universe before being blended — large companies tend to have systematically denser legal language, so the absolute Fog level matters less than where a ticker sits relative to its peers. The four blend weights are not disclosed, but the four anchors (negative, uncertain, litigious, readability) are public LM/Li categories.
The NLP cron runs Sundays 11:30 + 11:45 UTC. It pulls the latest 10-K from SEC EDGAR XBRL, extracts the Item 7 (MD&A) section via a regex over the embedded HTML, tokenises with the LM dictionaries (negative, positive, uncertain, litigious, modal, constraining), and scores. Coverage is currently ~74% of the universe — the regex fails on a long tail of non-standard 10-K formats, mostly older filings or REITs with irregular item-numbering. We're tuning the regex pass each month.
NLP is the strongest companion factor for PEAD — together they decompose earnings news into the number (PEAD) and the narrative around the number (NLP). Bullish PEAD + bullish NLP fires EARNINGS MOMENTUM CASCADE; bullish PEAD + bearish NLP fires EARNINGS DISSONANCE (Tetlock 2007's mechanism — when management hedges around a beat, the beat is suspect). NLP also pairs with Insider: officers buying while writing optimistically is doubly bullish (INSIDER + NARRATIVE CONFLUENCE pattern). Finally, NLP and Quality interact — high quality with deteriorating tone is an early warning the moat is cracking.
Three known failure modes. (1) Boilerplate inflation. Compliance counsel adds risk-factor language each year; the same company's 10-K has more LM-negative words in 2025 than in 2015 even if the business is unchanged. We partly mitigate via z-scoring across the universe, but absolute trend in negative tone has slowly drifted up. (2) Foreign filers. 20-F filings (used by ADRs) follow a different structure than 10-K, and our extractor's coverage is weaker there — pending fix. (3) Q-only signals. The factor only updates on annual filings; intra-year tone shifts (10-Q MD&A, conference calls, press releases) are not yet incorporated. Adding 10-Q tone is on the post-16-May roadmap.
Every ticker page shows the per-factor decomposition. The NLP score is one of twelve composing the 0–100 APEX composite.