Skip to content

Commit a3c71b4

Browse files
author
Marcel
committed
Tighten manuscript rigor: protocol details, stronger framing, and reviewer-defense clarifications
1 parent 5b8c275 commit a3c71b4

4 files changed

Lines changed: 166 additions & 12 deletions

File tree

paper/main.pdf

17 KB
Binary file not shown.

paper/main.tex

Lines changed: 166 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,13 @@ \section{Scope and Final Claim}
6565

6666
The main contribution is \emph{not} a universal scalar ``topology of truth.''
6767
The stronger result is a readout-bottleneck interpretation supported by multiple controls.
68+
To avoid over-claiming, we explicitly do \emph{not} claim:
69+
\begin{itemize}
70+
\item a universal geometry-of-truth detector across unrestricted tasks,
71+
\item a complete causal account of decoder behavior,
72+
\item or that topology alone is the primary predictive tool in natural reasoning traces.
73+
\end{itemize}
74+
The claim is intentionally task-conditioned and mechanism-narrow.
6875

6976
\section{Why the Original Idea Failed}
7077
The original working hypothesis was that correct reasoning might have a cleaner global geometric or topological signature than incorrect reasoning \cite{carlsson2009,bauer2021ripser}.
@@ -94,8 +101,19 @@ \section{Why the Original Idea Failed}
94101
It prevented a misleading positive claim and forced a cleaner task design.
95102

96103
\section{Experimental Trajectory}
104+
\subsection{Why These Three Phases Belong in One Paper}
105+
The paper is not a grab-bag chronology.
106+
It is a causal progression in experimental design:
107+
\begin{enumerate}
108+
\item \textbf{Phase A falsified} the original global-scalar hypothesis under realistic trace conditions.
109+
\item \textbf{Phase B diagnosed} the dominant confound in that setting (non-convergence under fixed decoding).
110+
\item \textbf{Phase C redesigned} the task to isolate semantics directly, enabling a sharper representation-vs-readout test.
111+
\end{enumerate}
112+
The positive result is only interpretable in light of that redesign logic.
113+
97114
\subsection{Phase A: Global Topology on GSM8K (Negative)}
98115
Small paired runs on Qwen3.5-0.8B and Qwen3.5-2B produced weak dynamic-$H_0$ signals on non-capped 2B traces.
116+
The primary dynamic pilot used 10 paired 2B samples, with one capped trace removed for the non-capped comparison set (\(n=9\): 4 correct, 5 wrong).
99117
Representative pilot numbers were:
100118
\begin{itemize}
101119
\item \texttt{h0\_entropy\_final AUC = 0.55}
@@ -112,6 +130,7 @@ \subsection{Phase A: Global Topology on GSM8K (Negative)}
112130
\end{itemize}
113131

114132
These numbers are not compatible with a strong correctness-prediction claim.
133+
We treat them as descriptive pilot outcomes, not inferential proof that topology is universally uninformative.
115134

116135
\subsection{Phase B: Fixed-Decoding GSM8K (Convergence Result)}
117136
A second branch held decoding fixed for Qwen3.5-2B and shifted focus from topology to operational failure mode on a GSM8K slice \cite{cobbe2021gsm8k}.
@@ -145,6 +164,8 @@ \subsection{Phase B: Fixed-Decoding GSM8K (Convergence Result)}
145164

146165
A matched sensitivity rerun at 640 tokens for the originally capped cases showed that both truncation and deeper non-convergence mattered:
147166
some failures were rescued by a longer budget, but many remained wrong even after receiving more room to continue.
167+
Concretely, among 75 originally capped wrong runs, 27 flipped to correct and 48 remained wrong; 24 stayed both capped and wrong at 640.
168+
This rules out the trivial interpretation that the cap signal is only an artifact of an arbitrary token budget.
148169

149170
\subsection{Phase C: Procedural Micro-World Semantics (Main Positive Result)}
150171
The project became scientifically cleaner only after replacing benchmark reasoning with a procedurally generated semantic task, in line with controlled-evaluation guidance from recent LM analysis literature \cite{liang2022holistic}.
@@ -159,6 +180,15 @@ \subsection{Phase C: Procedural Micro-World Semantics (Main Positive Result)}
159180

160181
Each world yields 72 examples: 9 propositions times 8 paraphrases.
161182
The main sweeps used 20 train worlds and 20 eval worlds.
183+
The generator itself produces larger split files (default 100/25/100 worlds for train/dev/eval), and analysis subsets are explicitly selected from those outputs.
184+
185+
Critically, anti-shortcut controls are built into generation:
186+
\begin{itemize}
187+
\item split-specific nonce lexicon pools are disjoint for entities/attributes/relations,
188+
\item eval paraphrases use template variants 4--7, while train uses 0--3 (dev uses overlap variants by design for intermediate stress),
189+
\item per-world proposition sampling is label-balanced over \texttt{True}/\texttt{False}/\texttt{Unknown}.
190+
\end{itemize}
191+
These details target reviewer concerns about lexical leakage and template-only clustering.
162192

163193
\section{Task Definition}
164194
Let a world be a finite structured state
@@ -182,24 +212,35 @@ \section{Task Definition}
182212
It is exact non-entailment under the generator.
183213

184214
\subsection{Worked Micro-World Example}
185-
Table~\ref{tab:world_example} shows one illustrative world/query slice.
215+
Table~\ref{tab:world_facts} shows a compact latent world.
216+
Table~\ref{tab:world_queries} then shows one proposition per label class, plus a second paraphrase for one query to make template variation explicit.
186217

187218
\begin{table}[H]
188219
\centering
189-
\caption{Worked micro-world example (illustrative format).}
190-
\label{tab:world_example}
191-
\begin{tabular}{p{0.28\linewidth} p{0.46\linewidth} p{0.14\linewidth}}
220+
\caption{Worked world facts (illustrative).}
221+
\label{tab:world_facts}
222+
\begin{tabular}{p{0.96\linewidth}}
192223
\toprule
193-
World fact set & Query statement & Gold label \\
224+
\texttt{mep is falm. grel is not falm. nalo foshes sop.} \\
225+
\bottomrule
226+
\end{tabular}
227+
\end{table}
228+
229+
\begin{table}[H]
230+
\centering
231+
\caption{Label semantics on the worked world.}
232+
\label{tab:world_queries}
233+
\begin{tabular}{p{0.38\linewidth} p{0.12\linewidth} p{0.42\linewidth}}
234+
\toprule
235+
Query statement & Gold label & Why \\
194236
\midrule
195-
\texttt{mep is falm}; \texttt{grel is not falm}; \texttt{nalo foshes sop} &
196-
\texttt{The object mep has property falm.} & True \\
237+
\texttt{The object mep has property falm.} & True & Explicit positive fact in world state. \\
197238
\addlinespace
198-
\texttt{mep is falm}; \texttt{grel is not falm}; \texttt{nalo foshes sop} &
199-
\texttt{The object grel has property falm.} & False \\
239+
\texttt{The object grel has property falm.} & False & Explicit negative fact (\texttt{grel is not falm}). \\
200240
\addlinespace
201-
\texttt{mep is falm}; \texttt{grel is not falm}; \texttt{nalo foshes sop} &
202-
\texttt{The relation fosh holds from mep to nalo.} & Unknown \\
241+
\texttt{The relation fosh holds from mep to nalo.} & Unknown & Neither positive nor negative fact provided for this ordered pair. \\
242+
\addlinespace
243+
\texttt{The ordered pair (mep, nalo) has relation fosh.} & Unknown & Same proposition as previous row, different paraphrase template. \\
203244
\bottomrule
204245
\end{tabular}
205246
\end{table}
@@ -225,6 +266,18 @@ \section{What Is Stored for Each Example}
225266
\end{itemize}
226267

227268
These are the local decision states of the model near label emission.
269+
In code:
270+
\begin{itemize}
271+
\item \texttt{final\_prompt}: last hidden vector of prompt tokens,
272+
\item \texttt{prompt\_tail\_mean}: mean of last \(\min(5,\text{prompt\_len})\) prompt vectors,
273+
\item \texttt{verdict\_token}: first generated-token hidden vector (zero vector if no generated token in non-sweep extraction),
274+
\item \texttt{verdict\_span\_mean}: mean of first \(\min(3,\text{gen\_len})\) generated vectors (zero vector if none).
275+
\end{itemize}
276+
277+
\subsection{Inference and Prompt Protocol}
278+
Micro-world inference runs use short deterministic decoding (\texttt{temperature=0}, \texttt{max\_new\_tokens=4}) with strict label parsing.
279+
``No-think'' in this paper refers to runs where internal reasoning mode was disabled via the \texttt{enable\_thinking=False} generation flag.
280+
We evaluate both default chat-style prompting and raw/base-label prompt paths as controls.
228281

229282
\section{Methods}
230283
\subsection{Linear Probe}
@@ -254,6 +307,14 @@ \subsection{Linear Probe}
254307

255308
This probe is intentionally weak and follows the standard linear-probe setup \cite{alain2016probes}.
256309
If it succeeds, the information is already arranged in hidden space in a directly readable linear form.
310+
Implementation details (from the committed probe script):
311+
\begin{itemize}
312+
\item feature standardization: \texttt{StandardScaler(with\_mean=True, with\_std=True)},
313+
\item classifier: multinomial logistic regression (\texttt{lbfgs}, \(C=1.0\), \texttt{max\_iter}=4000),
314+
\item class labels: fixed order \(\{\texttt{True},\texttt{False},\texttt{Unknown}\}\),
315+
\item no class weighting, no hidden-state PCA, no extra feature engineering.
316+
\end{itemize}
317+
Train/test separation is by world split manifests (train worlds for fitting, held-out eval worlds for reporting), not random sentence-level splitting.
257318

258319
\subsection{Within-World Geometry Gap}
259320
For one world and one state key, with distance \(d(\cdot,\cdot)\):
@@ -268,6 +329,20 @@ \subsection{Within-World Geometry Gap}
268329
\Delta = D_{\text{diff}} - D_{\text{same}}.
269330
\]
270331
A positive \(\Delta\) means same-label states are more tightly organized than different-label states.
332+
The main metric is cosine distance over \(L_2\)-normalized vectors:
333+
\[
334+
d_{\cos}(h_i,h_j)=1-\frac{h_i^\top h_j}{\lVert h_i\rVert_2 \lVert h_j\rVert_2}.
335+
\]
336+
All pairwise distances are computed within world and state key before aggregation.
337+
338+
\subsection{Sign-Test Reporting}
339+
World-level sign tests are reported as descriptive primary statistics (positive/zero/negative world counts).
340+
For calibration, one-sided exact binomial values under null \(p=0.5\) are:
341+
\begin{itemize}
342+
\item \(19/19\) positives: \(p=2^{-19}\approx 1.91\times10^{-6}\),
343+
\item \(20/20\) positives: \(p=2^{-20}\approx 9.54\times10^{-7}\).
344+
\end{itemize}
345+
These values are supportive but secondary to the held-out-world descriptive consistency.
271346

272347
\subsection{Verdict-Step Label-Logit Metrics}
273348
Let verdict-step logits for canonical label tokens be
@@ -283,13 +358,22 @@ \subsection{Verdict-Step Label-Logit Metrics}
283358
m_U = \ell_U - \max(\ell_T,\ell_F).
284359
\]
285360
If \(m_U < 0\), Unknown is under-ranked against the strongest non-Unknown candidate.
361+
Canonical label-token scoring uses first-token variants for each label string with and without leading space.
362+
For each label, we keep unique first-token IDs and use the \emph{maximum} first-token log-prob across those variants.
363+
This reduces tokenizer-surface artifacts from a single textual form.
286364

287365
\subsection{Layer Sweeps}
288366
For each layer \(\ell\), evaluate the same probe protocol and report
289367
\[
290368
R_U^{(\ell)} = \text{Unknown recall of the probe at layer }\ell.
291369
\]
292370
This reveals where non-entailment is maximally linearly recoverable in the network.
371+
Sweep implementation details:
372+
\begin{itemize}
373+
\item \texttt{prompt\_last}: extracted for every example and every layer,
374+
\item \texttt{verdict\_token}: extracted at the first generated token; examples with no generated token are excluded from this branch via a validity mask,
375+
\item non-finite activations are replaced with zero before probe fitting (\texttt{nan\_to\_num}) to keep full sweeps stable.
376+
\end{itemize}
293377

294378
\section{Main Micro-World Results}
295379
\subsection{Decoder Behavior vs Hidden-State Recoverability}
@@ -321,13 +405,17 @@ \subsection{Decoder Behavior vs Hidden-State Recoverability}
321405

322406
\subsection{World-Level Geometry Consistency}
323407
World-level same-vs-different label distance gaps were positive in every evaluated world for main state keys in both Qwen3.5-2B and Gemma-3-4B-it.
408+
This includes all three class-pair comparisons (\texttt{True--False}, \texttt{True--Unknown}, \texttt{False--Unknown}) in the aggregate label-pair summaries, not only pooled different-vs-same averages.
409+
So the signal is not driven by one class boundary alone.
324410

325411
\begin{figure}[H]
326412
\centering
327413
\includegraphics[width=0.72\linewidth]{figures/fig5_geometry_sign_tests.png}
328414
\caption{Fraction of worlds with positive same-vs-different label distance gap.}
329415
\end{figure}
330416

417+
For the headline state keys, sign counts are \(19/19\) (Qwen3.5-2B) and \(20/20\) (Gemma-3-4B-it), matching the exact-binomial sanity values in the methods section.
418+
331419
\subsection{Cross-Family Replication}
332420
The dissociation replicates across Qwen and Gemma:
333421
\begin{enumerate}
@@ -339,16 +427,29 @@ \subsection{Cross-Family Replication}
339427
\section{Mechanistic Controls}
340428
\subsection{Constrained Decoding}
341429
To test whether free-form decoding alone caused the issue, decoding was constrained to \{\texttt{True}, \texttt{False}, \texttt{Unknown}\}.
342-
This did not repair Unknown collapse in the main models.
430+
This did not repair Unknown collapse in the main models:
431+
\begin{itemize}
432+
\item Qwen3.5-2B and Qwen3.5-4B remained at decoder Unknown recall \(=0.0\),
433+
\item Gemma-3-4B-it remained low (Unknown recall \(=0.0125\), unchanged in matched constrained/unconstrained eval runs).
434+
\end{itemize}
435+
So the bottleneck is not reducible to unconstrained text drift.
343436

344437
\subsection{Prompt-Path Control}
345438
Raw-prompt controls were used to rule out chat-template-only explanations.
346439
The core dissociation remained.
440+
For Gemma instruct under raw prompting, decoder Unknown recall increased relative to the no-think default path, but remained materially below hidden-state recoverability, preserving the central mismatch.
347441

348442
\subsection{Base vs Instruct}
349443
Gemma base initially showed severe parse failures under an unsuitable prompt path.
350444
After repairing prompt format with a base-specific label format, the parse confound disappeared.
351445
Yet base still showed decoder Unknown collapse while probes recovered substantial Unknown signal.
446+
Numerically:
447+
\begin{itemize}
448+
\item raw base prompt path parse-failure rate was \(97.5\%\),
449+
\item repaired base-format path reduced parse failure to \(0\%\),
450+
\item repaired base-format decoder Unknown recall remained \(0.0\).
451+
\end{itemize}
452+
This isolates readout behavior from trivial formatting failures.
352453

353454
\section{Verdict-Step Label-Logit Analysis}
354455
For gold-\texttt{Unknown} decoder failures:
@@ -357,6 +458,8 @@ \section{Verdict-Step Label-Logit Analysis}
357458
\item Gemma-3-4B-pt (basefmt): mean \(P(\texttt{Unknown})=0.177\), mean margin \(m_U=-1.045\).
358459
\end{itemize}
359460
So Unknown is often present but not competitive enough at final label-token competition.
461+
Importantly, this is a \emph{readout-stage} diagnosis: Unknown is not absent from representation, but is systematically under-ranked at the verdict step on failures where gold is Unknown.
462+
That distinction is what links probe recoverability and emitted-label collapse.
360463

361464
\begin{figure}[H]
362465
\centering
@@ -371,6 +474,7 @@ \section{Layer Sweeps}
371474
\item Gemma-3-4B-pt (basefmt): \texttt{prompt\_last} 0.825 at layer 29, \texttt{verdict\_token} 0.733 at layer 28
372475
\end{itemize}
373476
This shows strong recoverable Unknown signal exists internally in both instruct and base variants, even when emitted behavior still collapses that class.
477+
The layer locations differ by model variant (early-mid for instruct vs late for basefmt in this slice), which supports a ``signal location and readout alignment'' view rather than a simple ``more scale always better'' view.
374478

375479
\begin{figure}[H]
376480
\centering
@@ -400,6 +504,10 @@ \section{Why This Should Survive Review}
400504
\item \textbf{Parse objection}: Gemma base rerun with repaired prompt format.
401505
\item \textbf{Free-decoding objection}: constrained decoding tested.
402506
\item \textbf{``No internal Unknown'' objection}: probes, geometry, logits, and layer sweeps all counter it.
507+
\item \textbf{Synthetic-task objection}: synthetic design is deliberate to obtain exact entailment labels and control lexical leakage; the claim is scoped to this setting.
508+
\item \textbf{Tokenization-artifact objection}: label-logit analysis uses multiple first-token variants (with/without leading space) and takes the strongest per-label candidate.
509+
\item \textbf{Lexical-clustering objection}: eval uses held-out lexical pools and template variants, and geometry is evaluated within world across paraphrases.
510+
\item \textbf{``Probes are not causal'' objection}: agreed; probe results establish information availability, while constrained decoding and verdict logits target the usage/readout side.
403511
\end{itemize}
404512

405513
No single analysis carries the paper; strength comes from triangulation.
@@ -416,6 +524,18 @@ \section{Practical Artifact Map}
416524
\item layer sweeps: \texttt{artifacts/micro\_world\_v1/layer\_sweep\_*/}
417525
\end{itemize}
418526

527+
\section{Implications Beyond This Benchmark}
528+
The main result has implications beyond this specific micro-world generator.
529+
If semantic non-entailment is recoverable internally while decoder outputs collapse it, then evaluation based only on emitted labels can underestimate a model's internal uncertainty structure.
530+
That matters for:
531+
\begin{itemize}
532+
\item abstention and selective prediction design,
533+
\item post-hoc confidence calibration,
534+
\item safety analysis of over-assertive outputs,
535+
\item readout-head or decoding-policy interventions that target decision alignment rather than representation learning.
536+
\end{itemize}
537+
This paper does not claim direct transfer to all tasks, but it motivates testing representation--readout gaps in other controlled domains.
538+
419539
\section{Limitations}
420540
\begin{itemize}
421541
\item The micro-world benchmark is synthetic, even though controlled and compositional.
@@ -521,4 +641,38 @@ \section{Appendix A: Full Reproduction Commands}
521641
--props-per-world 9 --paraphrases-per-prop 8
522642
\end{verbatim}
523643

644+
\section{Appendix B: Protocol Details}
645+
\subsection{Dataset Generation Protocol}
646+
The generator samples partial worlds with 4--6 entities, 2 attributes, and 2 relations per world by default.
647+
Per relation/attribute assignment, facts are sampled as explicit positive, explicit negative, or omitted (Unknown) states.
648+
For each world, proposition sampling is quota-balanced over \texttt{True}/\texttt{False}/\texttt{Unknown} labels before paraphrase rendering.
649+
650+
\subsection{Train/Test Separation}
651+
Probe training uses only train-world manifests (\texttt{status=ok}, label in \{\texttt{True},\texttt{False},\texttt{Unknown}\}).
652+
Evaluation uses held-out eval-world manifests.
653+
No sentence-level random split is used in the reported probe tables.
654+
655+
\subsection{Probe Fitting Defaults}
656+
All reported linear probes use:
657+
\begin{itemize}
658+
\item \texttt{StandardScaler} (mean/std normalization),
659+
\item \texttt{LogisticRegression(solver=lbfgs, C=1.0, max\_iter=4000)},
660+
\item three-class label set in fixed order \(\{\texttt{True},\texttt{False},\texttt{Unknown}\}\),
661+
\item zero-division-safe precision/recall/F1 reporting.
662+
\end{itemize}
663+
664+
\subsection{Layer Sweep Inclusion Rules}
665+
For \texttt{verdict\_token} sweeps, examples with no generated token are excluded via a validity mask.
666+
For \texttt{prompt\_last} sweeps, all \texttt{status=ok} examples are included.
667+
Non-finite activations are replaced with zero prior to fitting.
668+
669+
\section{Appendix C: Definitions and Notation}
670+
\begin{itemize}
671+
\item \textbf{Unknown (semantic class):} non-entailment in the generator's three-valued semantics.
672+
\item \textbf{Unknown (decoder output):} emitted label string parsed from model output.
673+
\item \textbf{Unknown recoverability:} recall of the Unknown class under a linear probe on hidden states.
674+
\item \textbf{Representation--decoder gap:} probe Unknown recall minus decoder Unknown recall on matched eval sets.
675+
\end{itemize}
676+
These are related but non-identical quantities, and they are reported separately throughout.
677+
524678
\end{document}

paper/paper.pdf

17 KB
Binary file not shown.
34.2 KB
Binary file not shown.

0 commit comments

Comments
 (0)