You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -65,6 +65,13 @@ \section{Scope and Final Claim}
65
65
66
66
The main contribution is \emph{not} a universal scalar ``topology of truth.''
67
67
The stronger result is a readout-bottleneck interpretation supported by multiple controls.
68
+
To avoid over-claiming, we explicitly do \emph{not} claim:
69
+
\begin{itemize}
70
+
\item a universal geometry-of-truth detector across unrestricted tasks,
71
+
\item a complete causal account of decoder behavior,
72
+
\item or that topology alone is the primary predictive tool in natural reasoning traces.
73
+
\end{itemize}
74
+
The claim is intentionally task-conditioned and mechanism-narrow.
68
75
69
76
\section{Why the Original Idea Failed}
70
77
The original working hypothesis was that correct reasoning might have a cleaner global geometric or topological signature than incorrect reasoning \cite{carlsson2009,bauer2021ripser}.
@@ -94,8 +101,19 @@ \section{Why the Original Idea Failed}
94
101
It prevented a misleading positive claim and forced a cleaner task design.
95
102
96
103
\section{Experimental Trajectory}
104
+
\subsection{Why These Three Phases Belong in One Paper}
105
+
The paper is not a grab-bag chronology.
106
+
It is a causal progression in experimental design:
107
+
\begin{enumerate}
108
+
\item\textbf{Phase A falsified} the original global-scalar hypothesis under realistic trace conditions.
109
+
\item\textbf{Phase B diagnosed} the dominant confound in that setting (non-convergence under fixed decoding).
110
+
\item\textbf{Phase C redesigned} the task to isolate semantics directly, enabling a sharper representation-vs-readout test.
111
+
\end{enumerate}
112
+
The positive result is only interpretable in light of that redesign logic.
113
+
97
114
\subsection{Phase A: Global Topology on GSM8K (Negative)}
98
115
Small paired runs on Qwen3.5-0.8B and Qwen3.5-2B produced weak dynamic-$H_0$ signals on non-capped 2B traces.
116
+
The primary dynamic pilot used 10 paired 2B samples, with one capped trace removed for the non-capped comparison set (\(n=9\): 4 correct, 5 wrong).
99
117
Representative pilot numbers were:
100
118
\begin{itemize}
101
119
\item\texttt{h0\_entropy\_final AUC = 0.55}
@@ -112,6 +130,7 @@ \subsection{Phase A: Global Topology on GSM8K (Negative)}
112
130
\end{itemize}
113
131
114
132
These numbers are not compatible with a strong correctness-prediction claim.
133
+
We treat them as descriptive pilot outcomes, not inferential proof that topology is universally uninformative.
The project became scientifically cleaner only after replacing benchmark reasoning with a procedurally generated semantic task, in line with controlled-evaluation guidance from recent LM analysis literature \cite{liang2022holistic}.
Each world yields 72 examples: 9 propositions times 8 paraphrases.
161
182
The main sweeps used 20 train worlds and 20 eval worlds.
183
+
The generator itself produces larger split files (default 100/25/100 worlds for train/dev/eval), and analysis subsets are explicitly selected from those outputs.
184
+
185
+
Critically, anti-shortcut controls are built into generation:
186
+
\begin{itemize}
187
+
\item split-specific nonce lexicon pools are disjoint for entities/attributes/relations,
188
+
\item eval paraphrases use template variants 4--7, while train uses 0--3 (dev uses overlap variants by design for intermediate stress),
189
+
\item per-world proposition sampling is label-balanced over \texttt{True}/\texttt{False}/\texttt{Unknown}.
190
+
\end{itemize}
191
+
These details target reviewer concerns about lexical leakage and template-only clustering.
162
192
163
193
\section{Task Definition}
164
194
Let a world be a finite structured state
@@ -182,24 +212,35 @@ \section{Task Definition}
182
212
It is exact non-entailment under the generator.
183
213
184
214
\subsection{Worked Micro-World Example}
185
-
Table~\ref{tab:world_example} shows one illustrative world/query slice.
215
+
Table~\ref{tab:world_facts} shows a compact latent world.
216
+
Table~\ref{tab:world_queries} then shows one proposition per label class, plus a second paraphrase for one query to make template variation explicit.
186
217
187
218
\begin{table}[H]
188
219
\centering
189
-
\caption{Worked micro-world example (illustrative format).}
If \(m_U < 0\), Unknown is under-ranked against the strongest non-Unknown candidate.
361
+
Canonical label-token scoring uses first-token variants for each label string with and without leading space.
362
+
For each label, we keep unique first-token IDs and use the \emph{maximum} first-token log-prob across those variants.
363
+
This reduces tokenizer-surface artifacts from a single textual form.
286
364
287
365
\subsection{Layer Sweeps}
288
366
For each layer \(\ell\), evaluate the same probe protocol and report
289
367
\[
290
368
R_U^{(\ell)} = \text{Unknown recall of the probe at layer }\ell.
291
369
\]
292
370
This reveals where non-entailment is maximally linearly recoverable in the network.
371
+
Sweep implementation details:
372
+
\begin{itemize}
373
+
\item\texttt{prompt\_last}: extracted for every example and every layer,
374
+
\item\texttt{verdict\_token}: extracted at the first generated token; examples with no generated token are excluded from this branch via a validity mask,
375
+
\item non-finite activations are replaced with zero before probe fitting (\texttt{nan\_to\_num}) to keep full sweeps stable.
376
+
\end{itemize}
293
377
294
378
\section{Main Micro-World Results}
295
379
\subsection{Decoder Behavior vs Hidden-State Recoverability}
@@ -321,13 +405,17 @@ \subsection{Decoder Behavior vs Hidden-State Recoverability}
321
405
322
406
\subsection{World-Level Geometry Consistency}
323
407
World-level same-vs-different label distance gaps were positive in every evaluated world for main state keys in both Qwen3.5-2B and Gemma-3-4B-it.
408
+
This includes all three class-pair comparisons (\texttt{True--False}, \texttt{True--Unknown}, \texttt{False--Unknown}) in the aggregate label-pair summaries, not only pooled different-vs-same averages.
409
+
So the signal is not driven by one class boundary alone.
\caption{Fraction of worlds with positive same-vs-different label distance gap.}
329
415
\end{figure}
330
416
417
+
For the headline state keys, sign counts are \(19/19\) (Qwen3.5-2B) and \(20/20\) (Gemma-3-4B-it), matching the exact-binomial sanity values in the methods section.
418
+
331
419
\subsection{Cross-Family Replication}
332
420
The dissociation replicates across Qwen and Gemma:
To test whether free-form decoding alone caused the issue, decoding was constrained to \{\texttt{True}, \texttt{False}, \texttt{Unknown}\}.
342
-
This did not repair Unknown collapse in the main models.
430
+
This did not repair Unknown collapse in the main models:
431
+
\begin{itemize}
432
+
\item Qwen3.5-2B and Qwen3.5-4B remained at decoder Unknown recall \(=0.0\),
433
+
\item Gemma-3-4B-it remained low (Unknown recall \(=0.0125\), unchanged in matched constrained/unconstrained eval runs).
434
+
\end{itemize}
435
+
So the bottleneck is not reducible to unconstrained text drift.
343
436
344
437
\subsection{Prompt-Path Control}
345
438
Raw-prompt controls were used to rule out chat-template-only explanations.
346
439
The core dissociation remained.
440
+
For Gemma instruct under raw prompting, decoder Unknown recall increased relative to the no-think default path, but remained materially below hidden-state recoverability, preserving the central mismatch.
347
441
348
442
\subsection{Base vs Instruct}
349
443
Gemma base initially showed severe parse failures under an unsuitable prompt path.
350
444
After repairing prompt format with a base-specific label format, the parse confound disappeared.
351
445
Yet base still showed decoder Unknown collapse while probes recovered substantial Unknown signal.
446
+
Numerically:
447
+
\begin{itemize}
448
+
\item raw base prompt path parse-failure rate was \(97.5\%\),
449
+
\item repaired base-format path reduced parse failure to \(0\%\),
450
+
\item repaired base-format decoder Unknown recall remained \(0.0\).
451
+
\end{itemize}
452
+
This isolates readout behavior from trivial formatting failures.
\item Gemma-3-4B-pt (basefmt): mean \(P(\texttt{Unknown})=0.177\), mean margin \(m_U=-1.045\).
358
459
\end{itemize}
359
460
So Unknown is often present but not competitive enough at final label-token competition.
461
+
Importantly, this is a \emph{readout-stage} diagnosis: Unknown is not absent from representation, but is systematically under-ranked at the verdict step on failures where gold is Unknown.
462
+
That distinction is what links probe recoverability and emitted-label collapse.
360
463
361
464
\begin{figure}[H]
362
465
\centering
@@ -371,6 +474,7 @@ \section{Layer Sweeps}
371
474
\item Gemma-3-4B-pt (basefmt): \texttt{prompt\_last} 0.825 at layer 29, \texttt{verdict\_token} 0.733 at layer 28
372
475
\end{itemize}
373
476
This shows strong recoverable Unknown signal exists internally in both instruct and base variants, even when emitted behavior still collapses that class.
477
+
The layer locations differ by model variant (early-mid for instruct vs late for basefmt in this slice), which supports a ``signal location and readout alignment'' view rather than a simple ``more scale always better'' view.
374
478
375
479
\begin{figure}[H]
376
480
\centering
@@ -400,6 +504,10 @@ \section{Why This Should Survive Review}
400
504
\item\textbf{Parse objection}: Gemma base rerun with repaired prompt format.
\item\textbf{``No internal Unknown'' objection}: probes, geometry, logits, and layer sweeps all counter it.
507
+
\item\textbf{Synthetic-task objection}: synthetic design is deliberate to obtain exact entailment labels and control lexical leakage; the claim is scoped to this setting.
508
+
\item\textbf{Tokenization-artifact objection}: label-logit analysis uses multiple first-token variants (with/without leading space) and takes the strongest per-label candidate.
509
+
\item\textbf{Lexical-clustering objection}: eval uses held-out lexical pools and template variants, and geometry is evaluated within world across paraphrases.
510
+
\item\textbf{``Probes are not causal'' objection}: agreed; probe results establish information availability, while constrained decoding and verdict logits target the usage/readout side.
403
511
\end{itemize}
404
512
405
513
No single analysis carries the paper; strength comes from triangulation.
The main result has implications beyond this specific micro-world generator.
529
+
If semantic non-entailment is recoverable internally while decoder outputs collapse it, then evaluation based only on emitted labels can underestimate a model's internal uncertainty structure.
530
+
That matters for:
531
+
\begin{itemize}
532
+
\item abstention and selective prediction design,
533
+
\item post-hoc confidence calibration,
534
+
\item safety analysis of over-assertive outputs,
535
+
\item readout-head or decoding-policy interventions that target decision alignment rather than representation learning.
536
+
\end{itemize}
537
+
This paper does not claim direct transfer to all tasks, but it motivates testing representation--readout gaps in other controlled domains.
538
+
419
539
\section{Limitations}
420
540
\begin{itemize}
421
541
\item The micro-world benchmark is synthetic, even though controlled and compositional.
@@ -521,4 +641,38 @@ \section{Appendix A: Full Reproduction Commands}
521
641
--props-per-world 9 --paraphrases-per-prop 8
522
642
\end{verbatim}
523
643
644
+
\section{Appendix B: Protocol Details}
645
+
\subsection{Dataset Generation Protocol}
646
+
The generator samples partial worlds with 4--6 entities, 2 attributes, and 2 relations per world by default.
647
+
Per relation/attribute assignment, facts are sampled as explicit positive, explicit negative, or omitted (Unknown) states.
648
+
For each world, proposition sampling is quota-balanced over \texttt{True}/\texttt{False}/\texttt{Unknown} labels before paraphrase rendering.
649
+
650
+
\subsection{Train/Test Separation}
651
+
Probe training uses only train-world manifests (\texttt{status=ok}, label in \{\texttt{True},\texttt{False},\texttt{Unknown}\}).
652
+
Evaluation uses held-out eval-world manifests.
653
+
No sentence-level random split is used in the reported probe tables.
0 commit comments