Skip to content

Commit 68aee12

Browse files
committed
autosave: 2026-02-28 06:38:50
1 parent 0f6ceb7 commit 68aee12

File tree

4 files changed

+60
-65
lines changed

4 files changed

+60
-65
lines changed

_freeze/posts/2026-02-26-hallucinations-and-alignment/execute-results/html.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
"hash": "71b8016fe0ab7c3a6ee846c53d3c2d93",
33
"result": {
44
"engine": "knitr",
5-
"markdown": "---\ntitle: Hallucinations and Alignment\ndraft: true\nengine: knitr\nbibliography: ai.bib\n---\n\n\n\n\n\n\n\n\n\n## Introduction\n\nWhen an LLM answers a question, there are three possible outcomes: it **succeeds** (gives a correct answer), it **fails** (gives an incorrect answer, i.e. hallucinates), or it **abstains** (declines to answer). These three outcomes have different payoffs, and the optimal policy depends on the relative magnitudes of $\\pi_{\\text{succeed}}$, $\\pi_{\\text{fail}}$, and $\\pi_{\\text{abstain}}$.\n\nThis framing---treating hallucination control as a three-action decision problem---connects to classical pattern recognition theory [@chow1970optimum], the economics of decision under uncertainty (the Marschak-Machina triangle), and recent theoretical results showing that calibrated language models must hallucinate at a nonzero rate [@kalai2024calibrated].\n\n## Claims\n\n1. **Three-outcome framing.** Any question-answering system faces a choice among succeed, fail, and abstain. The value of the system depends on the probabilities of each outcome and their payoffs.\n\n2. **Payoff asymmetry determines policy.** The optimal abstain threshold depends on the ratio $\\pi_{\\text{succeed}} / |\\pi_{\\text{fail}}|$. When failure is very costly relative to success (e.g. medical or legal advice), the abstain region should expand.\n\n3. **Indifference-curve slope encodes preferences.** In the Marschak-Machina probability simplex, the slope of an expected-utility indifference curve is $-\\pi_{\\text{succeed}}/\\pi_{\\text{fail}}$ (after normalizing $\\pi_{\\text{abstain}}=0$). Different users/applications trace out different slopes.\n\n4. **Hallucination is structurally inevitable.** @kalai2024calibrated prove that calibrated LMs must hallucinate at a rate related to the fraction of training facts that appear exactly once. This means abstain/verification mechanisms are not optional add-ons but necessary components.\n\n5. **Chow's reject-option rule is the optimal policy.** Given the three payoffs, the Bayes-optimal rule is a posterior-threshold rule: predict only when confidence exceeds a threshold determined by the payoffs; otherwise abstain [@chow1970optimum].\n\n\n## Probability-Payoff Diagram\n\n\n\n\n\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](2026-02-26-hallucinations-and-alignment_files/figure-html/unnamed-chunk-1-1.png){width=672}\n:::\n:::\n\n\n\n\n\n\n\n\n\n## Marschak-Machina Diagram\n\nThe probability simplex has three vertices corresponding to the three pure outcomes: certain success ($p_s=1$), certain failure ($p_f=1$), and certain abstention ($p_a=1$). Any lottery over outcomes is a point in this triangle. Under expected utility with payoffs $\\pi_{\\text{succeed}}, \\pi_{\\text{fail}}, \\pi_{\\text{abstain}}$, indifference curves are parallel straight lines whose slope depends on the payoff ratios.\n\n\n\n\n\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](2026-02-26-hallucinations-and-alignment_files/figure-html/unnamed-chunk-2-1.png){width=672}\n:::\n:::\n\n\n\n\n\n\n\n\n\n### Indifference-curve slope derivation\n\nLet $p_s, p_f, p_a = 1-p_s-p_f$ denote the probabilities of succeed, fail, and abstain. Expected utility is\n\n$$\nU = \\pi_{\\text{succeed}}\\, p_s + \\pi_{\\text{fail}}\\, p_f + \\pi_{\\text{abstain}}\\, p_a.\n$$\n\n1. Substitute $p_a = 1 - p_s - p_f$:\n$$\nU = \\pi_{\\text{abstain}}\n + (\\pi_{\\text{succeed}}-\\pi_{\\text{abstain}})\\,p_s\n + (\\pi_{\\text{fail}}-\\pi_{\\text{abstain}})\\,p_f.\n$$\n\n2. Hold $U = \\bar U$ and solve for $p_f$:\n$$\np_f\n= \\frac{\\bar U - \\pi_{\\text{abstain}}}{\\pi_{\\text{fail}}-\\pi_{\\text{abstain}}}\n- \\frac{\\pi_{\\text{succeed}}-\\pi_{\\text{abstain}}}{\\pi_{\\text{fail}}-\\pi_{\\text{abstain}}}\\,p_s.\n$$\n\n3. The slope of the indifference curve in the $(p_s, p_f)$ plane is therefore:\n$$\n\\frac{dp_f}{dp_s}\\bigg|_{U=\\bar U}\n= -\\frac{\\pi_{\\text{succeed}}-\\pi_{\\text{abstain}}}{\\pi_{\\text{fail}}-\\pi_{\\text{abstain}}}.\n$$\n\n4. Normalizing $\\pi_{\\text{abstain}}=0$, this simplifies to:\n$$\n\\frac{dp_f}{dp_s}\\bigg|_{U=\\bar U}\n= -\\frac{\\pi_{\\text{succeed}}}{\\pi_{\\text{fail}}},\n\\qquad\np_f = \\frac{\\bar U}{\\pi_{\\text{fail}}} - \\frac{\\pi_{\\text{succeed}}}{\\pi_{\\text{fail}}}\\,p_s.\n$$\n\nThe slope depends only on the ratio of payoff differences relative to abstain. When failure is very costly ($|\\pi_{\\text{fail}}|$ large), the curves are flatter: the decision-maker tolerates little additional failure probability in exchange for more success probability.\n\n## Related Literature\n\n### Chow (1970): Optimal reject rules\n\n@chow1970optimum introduces the reject (abstain) option into classification and derives the optimal error-reject tradeoff. The key result is a posterior-threshold rule:\n\n> *\"The performance of a pattern recognition system is characterized by its error and reject tradeoff. This paper describes an optimum rejection rule and presents a general relation between the error and reject probabilities.\"*\n\nIn our notation, given payoffs $\\pi_{\\text{succeed}}, \\pi_{\\text{fail}}, \\pi_{\\text{abstain}}$ with $\\pi_{\\text{fail}} < \\pi_{\\text{abstain}} < \\pi_{\\text{succeed}}$, the Bayes-optimal rule is:\n\n$$\n\\max_y P(y \\mid x) \\ge \\frac{\\pi_{\\text{abstain}} - \\pi_{\\text{fail}}}{\\pi_{\\text{succeed}} - \\pi_{\\text{fail}}}\n\\quad\\Longrightarrow\\quad\n\\text{predict } \\arg\\max_y P(y \\mid x),\n$$\n\nand abstain otherwise. In the symmetric 0-1 loss case ($\\pi_{\\text{succeed}}=1$, $\\pi_{\\text{fail}}=0$, abstain cost $d$), this reduces to: predict iff $\\max_y P(y\\mid x) \\ge 1-d$.\n\nIn the Marschak-Machina triangle, this threshold corresponds to one of the indifference lines: the boundary between the region where prediction is preferred and the region where abstention is preferred.\n\n\n### Kalai and Vempala (2024): Calibrated models must hallucinate\n\n@kalai2024calibrated prove that hallucination is a structural inevitability for calibrated language models, not merely a failure of architecture or data quality:\n\n> *\"There is an inherent statistical lower-bound on the rate that pretrained language models hallucinate certain types of facts, having nothing to do with the transformer LM architecture or data quality. For 'arbitrary' facts whose veracity cannot be determined from the training data, we show that hallucinations must occur at a certain rate for language models that satisfy a statistical calibration condition.\"*\n\nTheir main bound is:\n\n$$\n\\text{Hallucination rate} \\ge \\widehat{MF} - \\text{Miscalibration} - O(1/\\sqrt{n}),\n$$\n\nwhere $\\widehat{MF}$ is the \"monofact\" rate (fraction of training facts appearing exactly once) and $n$ is the training set size. The implication for system design is that post-training interventions (RLHF, abstention mechanisms, retrieval augmentation) are not merely helpful add-ons but structurally necessary responses to a mathematical inevitability.\n",
5+
"markdown": "---\ntitle: Hallucinations and Alignment\ndraft: true\nengine: knitr\nbibliography: ai.bib\n---\n\n\n\n## Introduction\n\nWhen an LLM answers a question, there are three possible outcomes: it **succeeds** (gives a correct answer), it **fails** (gives an incorrect answer, i.e. hallucinates), or it **abstains** (declines to answer). These three outcomes have different payoffs, and the optimal policy depends on the relative magnitudes of $\\pi_{\\text{succeed}}$, $\\pi_{\\text{fail}}$, and $\\pi_{\\text{abstain}}$.\n\nThis framing---treating hallucination control as a three-action decision problem---connects to classical pattern recognition theory [@chow1970optimum], the economics of decision under uncertainty (the Marschak-Machina triangle), and recent theoretical results on why language models hallucinate [@kalai2025why].\n\n## Claims\n\n1. **Three-outcome framing.** Any question-answering system faces a choice among succeed, fail, and abstain. The value of the system depends on the probabilities of each outcome and their payoffs.\n\n2. **Payoff asymmetry determines policy.** The optimal abstain threshold depends on the ratio $\\pi_{\\text{succeed}} / |\\pi_{\\text{fail}}|$. When failure is very costly relative to success (e.g. medical or legal advice), the abstain region should expand.\n\n3. **Indifference-curve slope encodes preferences.** In the Marschak-Machina probability simplex, the slope of an expected-utility indifference curve is $-\\pi_{\\text{succeed}}/\\pi_{\\text{fail}}$ (after normalizing $\\pi_{\\text{abstain}}=0$). Different users/applications trace out different slopes.\n\n4. **Hallucination is structurally inevitable.** @kalai2025why show that even with error-free training data, pretraining objectives produce hallucinations, and binary evaluation benchmarks then reward guessing over abstention. Abstain/verification mechanisms are not optional add-ons but necessary components.\n\n5. **Chow's reject-option rule is the optimal policy.** Given the three payoffs, the Bayes-optimal rule is a posterior-threshold rule: predict only when confidence exceeds a threshold determined by the payoffs; otherwise abstain [@chow1970optimum].\n\n\n## Probability-Payoff Diagram\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](2026-02-26-hallucinations-and-alignment_files/figure-html/unnamed-chunk-1-1.png){width=672}\n:::\n:::\n\n\n\n## Marschak-Machina Diagram\n\nThe probability simplex has three vertices corresponding to the three pure outcomes: certain success ($p_s=1$), certain failure ($p_f=1$), and certain abstention ($p_a=1$). Any lottery over outcomes is a point in this triangle. Under expected utility with payoffs $\\pi_{\\text{succeed}}, \\pi_{\\text{fail}}, \\pi_{\\text{abstain}}$, indifference curves are parallel straight lines whose slope depends on the payoff ratios.\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](2026-02-26-hallucinations-and-alignment_files/figure-html/unnamed-chunk-2-1.png){width=672}\n:::\n:::\n\n\n\n### Indifference-curve slope derivation\n\nLet $p_s, p_f, p_a = 1-p_s-p_f$ denote the probabilities of succeed, fail, and abstain. Expected utility is\n\n$$\nU = \\pi_{\\text{succeed}}\\, p_s + \\pi_{\\text{fail}}\\, p_f + \\pi_{\\text{abstain}}\\, p_a.\n$$\n\n1. Substitute $p_a = 1 - p_s - p_f$:\n$$\nU = \\pi_{\\text{abstain}}\n + (\\pi_{\\text{succeed}}-\\pi_{\\text{abstain}})\\,p_s\n + (\\pi_{\\text{fail}}-\\pi_{\\text{abstain}})\\,p_f.\n$$\n\n2. Hold $U = \\bar U$ and solve for $p_f$:\n$$\np_f\n= \\frac{\\bar U - \\pi_{\\text{abstain}}}{\\pi_{\\text{fail}}-\\pi_{\\text{abstain}}}\n- \\frac{\\pi_{\\text{succeed}}-\\pi_{\\text{abstain}}}{\\pi_{\\text{fail}}-\\pi_{\\text{abstain}}}\\,p_s.\n$$\n\n3. The slope of the indifference curve in the $(p_s, p_f)$ plane is therefore:\n$$\n\\frac{dp_f}{dp_s}\\bigg|_{U=\\bar U}\n= -\\frac{\\pi_{\\text{succeed}}-\\pi_{\\text{abstain}}}{\\pi_{\\text{fail}}-\\pi_{\\text{abstain}}}.\n$$\n\n4. Normalizing $\\pi_{\\text{abstain}}=0$, this simplifies to:\n$$\n\\frac{dp_f}{dp_s}\\bigg|_{U=\\bar U}\n= -\\frac{\\pi_{\\text{succeed}}}{\\pi_{\\text{fail}}},\n\\qquad\np_f = \\frac{\\bar U}{\\pi_{\\text{fail}}} - \\frac{\\pi_{\\text{succeed}}}{\\pi_{\\text{fail}}}\\,p_s.\n$$\n\nThe slope depends only on the ratio of payoff differences relative to abstain. When failure is very costly ($|\\pi_{\\text{fail}}|$ large), the curves are flatter: the decision-maker tolerates little additional failure probability in exchange for more success probability.\n\n## Related Literature\n\n### Chow (1957, 1970): Optimal reject rules\n\n@chow1970optimum introduces the **reject option** (which he calls the \"Indecision class\" $I$) into pattern recognition and derives the optimal error-reject tradeoff. Chow's terminology maps directly onto ours:\n\n| Chow's term | Our term |\n|---|---|\n| correct recognition | succeed |\n| error (misclassification) | fail |\n| rejection / indecision | abstain |\n\nChow's setup uses a cost function over these three outcomes:\n\n> *\"Let $C(i|j)$ be the cost incurred by classifying in $G_i$ when the true class is $G_j$. Then $C(i|j) = 0$ if $i=j$ [correct], $= 1$ if $i \\ne j$ [error], $= t$ if $i = I$ [rejection].\"*\n\nThe key result (\"Chow's rule\") is that the Bayes-optimal classification rule is a posterior-threshold rule: accept and classify when confident enough, reject otherwise:\n\n> *\"The classification rule which minimizes the risk can be stated as: $x \\in A$ [accept] if $\\max_i P(G_i | x) \\ge 1-t$; $x \\in D_I$ [reject] if $\\max_i P(G_i | x) < 1-t$.\"*\n\nAnd this rule is optimal in a strong sense:\n\n> *\"Chow's rule is optimal in the sense that for some reject rate specified by the threshold $t$, no other rule can yield a lower error rate.\"*\n\nThe abstract summarizes:\n\n> *\"The performance of a pattern recognition system is characterized by its error and reject tradeoff. This paper describes an optimum rejection rule and presents a general relation between the error and reject probabilities.\"*\n\nThe threshold $t$ is related to the costs of the three outcomes as $t = (C_r - C_c)/(C_e - C_c)$, where $C_e, C_r, C_c$ are the costs of error, rejection, and correct recognition. In our payoff notation ($\\pi_{\\text{succeed}}, \\pi_{\\text{fail}}, \\pi_{\\text{abstain}}$), Chow's rule becomes: predict iff\n\n$$\n\\max_y P(y \\mid x) \\ge \\frac{\\pi_{\\text{abstain}} - \\pi_{\\text{fail}}}{\\pi_{\\text{succeed}} - \\pi_{\\text{fail}}}.\n$$\n\nIn the Marschak-Machina triangle, this threshold corresponds to one of the indifference lines: the boundary between the region where prediction is preferred and the region where abstention is preferred. Despite the direct relevance, Chow (1970) does not appear to be cited in the LLM hallucination literature.\n\n\n### Kalai, Nachum, Vempala, and Zhang (2025): Why language models hallucinate\n\n@kalai2025why argue that hallucinations are not a mysterious glitch but a predictable consequence of how models are trained and evaluated. Their central thesis is that the three-outcome structure (succeed, fail, abstain) is systematically distorted by binary evaluation:\n\n> *\"Language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty. ... Hallucinations need not be mysterious---they originate simply as errors in binary classification.\"*\n\nThe paper makes two distinct arguments:\n\n**1. Pretraining origin.** Even with error-free training data, the statistical objective of pretraining produces hallucinations. The authors reduce the problem to binary classification (\"Is-It-Valid\"), showing that\n\n$$\n\\text{generative error rate} \\gtrsim 2 \\cdot \\text{IIV misclassification rate}.\n$$\n\nFor arbitrary facts (like someone's birthday) where there is no learnable pattern, the hallucination rate after pretraining is at least the fraction of facts appearing exactly once in the training data.\n\n**2. Post-training persistence.** Even after RLHF and other interventions, hallucinations persist because nearly all evaluation benchmarks use binary grading that penalizes abstention:\n\n> *\"Binary evaluations of language models impose a false right-wrong dichotomy, award no credit to answers that express uncertainty, omit dubious details, or request clarification. ... Under binary grading, abstaining is strictly sub-optimal.\"*\n\nThe fix they propose is exactly the payoff structure from our Marschak-Machina framework: penalize errors more than abstentions, with an explicit confidence threshold $t$ stated in the prompt. Their proposed scoring rule awards $+1$ for a correct answer, $-t/(1-t)$ for an incorrect answer, and $0$ for abstaining---so that answering is optimal iff confidence exceeds $t$. This is Chow's reject-option rule rediscovered in the LLM evaluation context.\n",
66
"supporting": [],
77
"filters": [
88
"rmarkdown/pagebreak.lua"

0 commit comments

Comments
 (0)