docs(paper): integrate full E2b Zipf curve into §6.4 — N=100k datapoint landed

cdeust · cdeust · commit fa9c1017c76e · 2026-05-01T18:23:47.000+02:00
E2b Zipf campaign completed (PID 62255). The full N=1k/10k/100k curve at
K=5000 access events α=1.5 reveals the regime story sharper than the
2-point partial:

| N    | K/N   | full R@10 | flat R@10 | full MRR | flat MRR |
| 10^3 | 5.0   | 1.000     | 1.000     | 1.000    | 0.980    |
| 10^4 | 0.5   | 1.000     | 1.000     | 0.985    | 1.000    |
| 10^5 | 0.05  | 1.000     | 0.970     | 0.910    | 0.970    |

R@10 vs MRR are complementary metrics:
- R@10: cortex_full holds 1.000 across full K/N range (always finds answer);
  flat starts missing at K/N=0.05.
- MRR: cortex_full degrades monotonically with K/N (1.000 -&gt; 0.985 -&gt; 0.910),
  consistent with regime parameter 2 — heat is signal only when items have
  differential access histories.

Updated both:
- markdown source (\§6.4 zipf bullet rewritten with full table)
- LaTeX main.tex (booktabs table + complementary-stories prose)

PDF rebuilt clean: 0 missing refs, 0 missing citations.
diff --git a/docs/arxiv-thermodynamic/main.tex b/docs/arxiv-thermodynamic/main.tex
@@ -586,11 +586,33 @@ \subsection{Operating regime}
   behaviour of regime parameter~3 (no structure $\Rightarrow$ heat
   is irrelevant) and confirms the experiment is well-controlled.
 \item \emph{Synthetic Zipf-$\alpha{=}1.5$ with $K{=}5{,}000$ access
-  events.}  At $N{=}1{,}000$ ($K/N{=}5$) cortex\_full reaches MRR
-  1.000 vs cortex\_flat 0.980; at $N{=}10{,}000$ ($K/N{=}0.5$) the
-  gap inverts to flat 1.000 vs full 0.985.  The lift is
-  non-monotonic in $N$ alone---it tracks $K/N$, the access density
-  (regime parameter~2).
+  events, full curve}:
+  \begin{center}
+  \small
+  \begin{tabular}{rrcccc}
+  \toprule
+  $N$    & $K/N$ & full R@10 & flat R@10 & full MRR & flat MRR \\
+  \midrule
+  $10^3$ & 5.0   & 1.000     & 1.000     & \textbf{1.000} & 0.980 \\
+  $10^4$ & 0.5   & 1.000     & 1.000     & 0.985          & \textbf{1.000} \\
+  $10^5$ & 0.05  & \textbf{1.000} & 0.970 & 0.910          & \textbf{0.970} \\
+  \bottomrule
+  \end{tabular}
+  \end{center}
+  R@10 and MRR tell complementary stories.  \emph{R@10}:
+  cortex\_full holds 1.000 across the entire $K/N$ range---Cortex
+  never fails to retrieve the gold answer; flat starts missing at
+  $K/N{=}0.05$.  \emph{MRR}: cortex\_full's ranking quality degrades
+  monotonically with falling access density ($1.000 \to 0.985 \to
+  0.910$), exactly what regime parameter~2 predicts: heat is signal
+  only when items have differential access histories, and at
+  $K/N{=}0.05$ most items have zero accesses so the heat
+  distribution flattens.  Flat retrieval, having no heat signal,
+  is unaffected by $K/N$ and therefore wins on MRR in the sparse
+  tail.  Production deployment (revisit-heavy chat sessions) sits
+  at $K/N \gg 1$, where full's MRR also lifts; the published BEAM
+  Overall claim was measured in that regime, not in the
+  $K/N \to 0$ tail.
 \end{itemize}
 
 \paragraph{What this means for deployment.}  Cortex serves a
diff --git a/docs/papers/thermodynamic-memory-vs-flat-importance.md b/docs/papers/thermodynamic-memory-vs-flat-importance.md
@@ -174,7 +174,15 @@ The headline numbers above are not "Cortex always wins." They are measurements *
 **Empirical observations consistent with this regime.** Independent campaigns within our verification suite (`benchmarks/lib/e2_subsample_runner.py`, `benchmarks/lib/e2_zipf_runner.py`, `benchmarks/lib/latency_runner.py`) report:
 - *Subsampled real benchmark below threshold.* On LongMemEval-S subsampled to $N \in \{500, 1000\}$, cortex_full does not consistently beat cortex_flat (MRR within ±6pp either way). At small subsamples the corpus loses most of its session structure; the result is consistent with regime parameter 1 (cold-start).
 - *Synthetic uniform-random corpus.* cortex_full and cortex_flat produce metrics identical to four decimal places at every $N \in \{10^3, 10^4, 10^5\}$. This is the predicted behaviour of regime parameter 3 (no structure → heat is irrelevant) and confirms the experiment is well-controlled.
-- *Synthetic Zipf-α=1.5 with $K=5{,}000$ access events.* At $N=1{,}000$ ($K/N=5$) cortex_full reaches MRR 1.000 vs cortex_flat 0.980; at $N=10{,}000$ ($K/N=0.5$) the gap inverts to flat 1.000 vs full 0.985. The lift is non-monotonic in $N$ alone — it tracks $K/N$, the access density (regime parameter 2).
+- *Synthetic Zipf-α=1.5 with $K=5{,}000$ access events, full curve.* The two metrics tell complementary stories:
+
+| $N$ | $K/N$ | full R@10 | flat R@10 | full MRR | flat MRR |
+|---|---|---|---|---|---|
+| $10^3$ | 5.0 | 1.000 | 1.000 | **1.000** | 0.980 |
+| $10^4$ | 0.5 | 1.000 | 1.000 | 0.985 | **1.000** |
+| $10^5$ | 0.05 | **1.000** | 0.970 | 0.910 | **0.970** |
+
+  *R@10:* cortex_full holds 1.000 across the entire $K/N$ range — Cortex never fails to retrieve the gold answer; flat starts missing at $K/N=0.05$ ($N=10^5$). *MRR:* cortex_full's ranking quality degrades monotonically with falling access density (1.000 → 0.985 → 0.910), exactly what regime parameter 2 predicts: heat is signal only when items have differential access histories, and at $K/N=0.05$ most items have zero accesses, so the heat distribution flattens and stops discriminating. Flat retrieval, having no heat signal to begin with, is unaffected by $K/N$ and therefore wins on MRR at sparse $K/N$. Production deployment (revisit-heavy chat sessions) sits at $K/N \gg 1$, where full's MRR also lifts; the published BEAM Overall claim was measured in that regime, not in the $K/N \to 0$ tail.
 
 **What this means for deployment.** Cortex serves a multi-thousand-user production install at $N$ ranging from $10^4$ to $10^6$ per active user, with realistic conversational access patterns ($K/N \gg 1$, heterogeneous topics). This is the regime where the headline numbers were measured. Users in the cold-start regime ($N < 10^3$, no access history yet) get vector-baseline retrieval quality, which is also what flat RAG would give them; once they cross $N \approx 10^4$ with accumulated access history, the thermodynamic stack contributes the lift reported in §6.