You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| Knowledge updates | 0.925| 100.0% | Heat decay naturally surfaces the newest version of a fact |
143
+
| Temporal reasoning | 0.926|98.5% | Time anchors embedded directly in memory content |
144
+
| Single-session (user) | 0.814| 94.3% | User phrasing varies more than assistant responses |
145
+
| Single-session (preference) | 0.668|93.3% | Preferences are implicit — harder to retrieve by keyword |
146
146
147
147
Knowledge updates scored highest because heat-based decay naturally pushes newer information above older versions of the same fact. This wasn't designed for the benchmark. It's just how the thermodynamic model works.
148
148
@@ -152,16 +152,16 @@ LoCoMo (Maharana et al., ACL 2024): 1,986 questions across 10 conversations, inc
152
152
153
153
|| Cortex | What it means |
154
154
|---|---|---|
155
-
| Recall@10 |**92.6%**| Right memory in top 10 over 9 times out of 10 |
156
-
| MRR |**0.794**| Correct answer is typically the first result |
155
+
| Recall@10 |**94.2%**| Right memory in top 10 over 9 times out of 10 (n=1986, BASELINE_NO_CONSOLIDATION)|
156
+
| MRR |**0.8278**| Correct answer is typically the first result |
| Single-hop | 0.741|94.0% | Direct factual questions — strong but room to improve |
164
+
| Temporal | 0.577|78.3% | "When did X happen?" is the hardest category — needs better time-series matching |
165
165
166
166
No LLM at query time. No API calls. Just a 22MB embedding model, PostgreSQL with pgvector, and neuroscience algorithms doing the heavy lifting. Five retrieval signals fused server-side (vector similarity, full-text search, trigram matching, thermodynamic heat, recency), then reranked by a cross-encoder.
167
167
@@ -383,6 +383,16 @@ ruff format --check . # Format
383
383
384
384
---
385
385
386
+
## Verification
387
+
388
+
Every benchmark headline number above is backed by a per-mechanism ablation campaign on the appropriate benchmark for each mechanism's mechanism-of-action. The campaign comprises three artefact sets at full n on a single-seed protocol with code SHAs, dirty flags, manifests, and per-row JSON outputs preserved alongside the writeups:
389
+
390
+
-**LongMemEval-S, 17 rows, n=500** — `tasks/e1-v3-results.md`. Per-mechanism deltas across the integrated stack at the calibrated equilibrium; category-specialization analysis.
-**LoCoMo, 14 rows, n=1986 (post-plasticity-fix bytes)** — `tasks/e1-v3-locomo-results-post-fix.md`. Re-run on commit `2f45bcb` (descendant of plasticity result-shape fix `5f737fe`); cadence-fix anchor agreement re-validated identically (ΔvsNO = +0.0014); two consolidation-only rows (HOMEOSTATIC_PLASTICITY, SCHEMA_ENGINE) recover positive contributions previously masked by the contract bug.
393
+
394
+
Total: 45 per-mechanism evidence rows. The full paper, including the §6.3 per-mechanism evidence section and §6.3.4.1 plasticity-fix re-run subsection, is at `docs/arxiv-thermodynamic/main.pdf`.
0 commit comments