|
| 1 | +# E1 v3 LoCoMo — STOP gate triggered on smoke |
| 2 | + |
| 3 | +**Status:** STOPPED at smoke per task validation gate. Awaiting human decision on |
| 4 | +which design path to pursue. Harness wiring committed; full sweep NOT launched. |
| 5 | + |
| 6 | +**Date:** 2026-05-02 |
| 7 | +**Code base SHA at smoke:** ca7f9d40888065bb3b30f17e0d56bd3e68417490 (tree dirty |
| 8 | +with concurrent agent's `mcp_server/handlers/remember*.py` and tests — those |
| 9 | +files do NOT touch the benchmark code path; benchmark uses |
| 10 | +`benchmarks.lib.bench_db.BenchmarkDB` → `core.memory_ingest.ingest_memories_batch` |
| 11 | ++ `core.pg_recall.recall`, never the `remember` handler). |
| 12 | + |
| 13 | +## Pre-registered validation gate |
| 14 | + |
| 15 | +From task #55 spec: |
| 16 | +> BASELINE LoCoMo MRR: established baseline from CLAUDE.md is 0.794 R@10=0.926. |
| 17 | +> With `--with-consolidation` enabled, this should be APPROXIMATELY similar |
| 18 | +> (consolidation may shift it slightly, but should be within ±0.05 MRR). If |
| 19 | +> WAY off — STOP and diagnose. |
| 20 | +
|
| 21 | +## Smoke result (n=1 conversation, 197 questions) |
| 22 | + |
| 23 | +| Run | MRR | R@10 | Wall (s) | |
| 24 | +|---|---|---|---| |
| 25 | +| `--limit 1` (no consolidation) | **0.866** | **99.0%** | 221.7 | |
| 26 | +| `--limit 1 --with-consolidation` | **0.222** | **54.8%** | 176.3 (incl. 127.7s consol) | |
| 27 | + |
| 28 | +Δ MRR = **−0.644** vs the no-consolidation anchor. |
| 29 | +Δ MRR = **−0.572** vs the published 0.794 (CLAUDE.md headline). |
| 30 | + |
| 31 | +This is **WAY off** the ±0.05 tolerance. Stop-and-diagnose triggered. |
| 32 | + |
| 33 | +## Diagnosis |
| 34 | + |
| 35 | +**Root cause: compression cycle fires on LoCoMo's old timestamps.** |
| 36 | + |
| 37 | +LoCoMo session dates are from May–November 2023 (real conversation timestamps |
| 38 | +preserved by the dataset). Current wall date is 2026-05-02, so loaded memories |
| 39 | +have `created_at` ≈ 3 years old. |
| 40 | + |
| 41 | +Consolidation defaults |
| 42 | +(`mcp_server/infrastructure/memory_config.py`): |
| 43 | +- `COMPRESSION_GIST_AGE_HOURS = 168.0` (7 days) → memories older than 7d get |
| 44 | + full-text → gist replacement. |
| 45 | +- `COMPRESSION_TAG_AGE_HOURS = 720.0` (30 days) → memories older than 30d get |
| 46 | + gist → tag replacement. |
| 47 | + |
| 48 | +LoCoMo memories are 3 years old at ingest time, so compression IMMEDIATELY |
| 49 | +collapses them to tags after one consolidation pass. Recall against the verbatim |
| 50 | +question text then misses, because the original session content has been |
| 51 | +replaced by terse tag form. |
| 52 | + |
| 53 | +This is a real architectural collision between: |
| 54 | +- Production cadence assumption: memories accumulate over wall-clock time, |
| 55 | + consolidation gradually compresses old ones. |
| 56 | +- Benchmark loading pattern: load N sessions instantly, run consolidation, query. |
| 57 | + If session timestamps are old, consolidation treats them as fully aged. |
| 58 | + |
| 59 | +This is the SAME shape of architectural mismatch the LME-S audit identified, but |
| 60 | +in the opposite direction: LME-S exercised consolidation-only mechanisms not at |
| 61 | +all (so they showed Δ=0); LoCoMo exercises them so aggressively at first contact |
| 62 | +that they destroy the retrieval signal. Neither benchmark, as currently |
| 63 | +instrumented, isolates per-mechanism effects honestly. |
| 64 | + |
| 65 | +## Why pre-spec'd "single BASELINE_WITH_CONSOLIDATION" design is unsafe |
| 66 | + |
| 67 | +If we take the 0.222 anchor as baseline, then any ablation that disables a stage |
| 68 | +contributing to the destruction (e.g., MICROGLIAL_PRUNING, compression-adjacent |
| 69 | +parts of CASCADE) will show a LARGE POSITIVE ΔMRR — but that signal will |
| 70 | +conflate two distinct effects: |
| 71 | + |
| 72 | +1. "Mechanism X contributes to retrieval" (the desired ablation reading). |
| 73 | +2. "Mechanism X is the proximate cause of LoCoMo-timestamp-driven destruction" |
| 74 | + (a benchmark-instrumentation artifact). |
| 75 | + |
| 76 | +Reporting these as 1-row mechanism contributions in §6.3 of the paper would |
| 77 | +overstate the case. The corresponding LME-S rows for the same mechanisms showed |
| 78 | +Δ=0 because consolidation never fired; reporting Δ ≈ +0.4 on LoCoMo without the |
| 79 | +context "the ablation rescued recall from a benchmark-induced collapse" is |
| 80 | +not honest evidence for the paper claim. |
| 81 | + |
| 82 | +## Decision matrix (FOR HUMAN REVIEW BEFORE SWEEP LAUNCH) |
| 83 | + |
| 84 | +### Option A — Single BASELINE_WITH_CONSOLIDATION (as spec'd) — 13 rows |
| 85 | +- Honors task spec literally. |
| 86 | +- Anchor MRR ≈ 0.22. |
| 87 | +- Risk: ablation deltas conflate mechanism contribution with destruction-rescue. |
| 88 | + Each row's writeup MUST disclose this confound; paper §6.3 cannot use these |
| 89 | + numbers as standalone evidence for "mechanism X contributes Y MRR." |
| 90 | +- Useful if the goal is documentation of LoCoMo+consolidation interaction |
| 91 | + rather than per-mechanism contribution claim. |
| 92 | + |
| 93 | +### Option B — Two baselines — 14 rows (RECOMMENDED) |
| 94 | +- BASELINE_NO_CONSOLIDATION (≈0.866 anchor): used for all longitudinal |
| 95 | + read-path mechanisms (RECONSOLIDATION, CO_ACTIVATION, ADAPTIVE_DECAY) since |
| 96 | + these mechanisms do NOT depend on a consolidation pass — they accumulate |
| 97 | + state via cross-question reads. |
| 98 | +- BASELINE_WITH_CONSOLIDATION (≈0.22 anchor): used for the 8 consolidation-only |
| 99 | + mechanisms (CASCADE, INTERFERENCE, HOMEOSTATIC_PLASTICITY, SYNAPTIC_PLASTICITY, |
| 100 | + MICROGLIAL_PRUNING, TWO_STAGE_MODEL, EMOTIONAL_DECAY, TRIPARTITE_SYNAPSE) and |
| 101 | + SCHEMA_ENGINE — these can only fire if consolidation fires. |
| 102 | +- Ablation deltas reported against the relevant baseline, with the architectural |
| 103 | + finding stated alongside. Honest per-mechanism evidence preserved. |
| 104 | +- Sweep cost: 14 rows × ~30 min/row = ~7 hours. |
| 105 | + |
| 106 | +### Option C — Decouple consolidation cadence from wall-clock age |
| 107 | +- Change `COMPRESSION_GIST_AGE_HOURS`/`COMPRESSION_TAG_AGE_HOURS` to gate on |
| 108 | + access-count or relative recency rather than absolute timestamp diff. |
| 109 | +- Or: in benchmark mode, override `created_at` to be relative to the benchmark |
| 110 | + start so memories appear "fresh" to consolidation. |
| 111 | +- Out of scope for task #55 — task explicitly says "DO NOT modify |
| 112 | + recall_pipeline.py constants — Phase A+B calibration locked." Same prohibition |
| 113 | + reasonably extends to consolidation cadence constants without a separate task. |
| 114 | + |
| 115 | +### Option D — Drop the LoCoMo half — keep LME-S only |
| 116 | +- Reverts to the 17-row evidence base from `de1d316`. |
| 117 | +- Honest: "we couldn't isolate consolidation-only mechanism contributions on |
| 118 | + any benchmark currently in our suite without confound." |
| 119 | +- The mechanisms remain in the codebase, but unsupported by ablation evidence |
| 120 | + in §6.3. The paper would say so. |
| 121 | + |
| 122 | +## Wall-budget data (pre-recorded in case we proceed) |
| 123 | + |
| 124 | +- Per-conversation wall (with consolidation): 176.3s |
| 125 | + - Of which consolidation: 127.7s |
| 126 | + - Of which ingest+QA: 48.6s |
| 127 | +- Per-conversation wall (no consolidation): 221.7s (more time spent in QA loop |
| 128 | + presumably because no compressed records — denser candidate set). |
| 129 | +- Full LoCoMo (10 conversations) per row, with consolidation: ≈30 min. |
| 130 | +- Full LoCoMo (10 conversations) per row, no consolidation: ≈37 min. |
| 131 | +- Option A (13 rows × 30 min) ≈ 6.5 h. |
| 132 | +- Option B (1 NO + 1 WITH baseline + 3 longitudinal vs NO + 9 consol vs WITH = |
| 133 | + 14 rows: 4 × 37 min + 10 × 30 min) ≈ 7.5 h. |
| 134 | + |
| 135 | +## Smoke artifacts |
| 136 | + |
| 137 | +- `benchmarks/results/ablation/locomo_v3_smoke/SMOKE_NO_CONSOLIDATION.json` |
| 138 | +- `benchmarks/results/ablation/locomo_v3_smoke/SMOKE_WITH_CONSOLIDATION.json` |
| 139 | + |
| 140 | +## Harness wiring committed (independent of sweep decision) |
| 141 | + |
| 142 | +`benchmarks/locomo/run_benchmark.py` now supports `--with-consolidation`, |
| 143 | +`--ablate MECH`, `--results-out PATH`, mirroring the LME-S harness signature. |
| 144 | +Manifest block in `--results-out` JSON includes `with_consolidation`, |
| 145 | +`ablate_mechanism`, `ablate_env_var`, `n_conversations`, `n_questions`, |
| 146 | +`consolidation_call_count`, `consolidation_total_wall_s`. |
| 147 | + |
| 148 | +## What I have NOT done |
| 149 | + |
| 150 | +- Have not launched the 13-row sweep. |
| 151 | +- Have not written `benchmarks/lib/run_e1_v3_locomo.py` driver yet (depends on |
| 152 | + baseline-design decision). |
| 153 | +- Have not modified retrieval constants or consolidation-cadence constants |
| 154 | + (per task constraint and Zetetic standard — no source for arbitrary |
| 155 | + constant change). |
| 156 | +- Have not committed any code from the dirty tree (concurrent agent's work). |
| 157 | + |
| 158 | +## Recommendation |
| 159 | + |
| 160 | +Option B (two baselines, 14 rows). Strongest scientific design that respects |
| 161 | +the architectural finding and the task constraint not to mutate constants. The |
| 162 | +extra row is honest acknowledgment that LoCoMo+consolidation has a confound, |
| 163 | +not a workaround. |
| 164 | + |
| 165 | +If the human reviewer prefers Option A for §6.3 page-budget reasons, document |
| 166 | +the confound in every consolidation-only row's writeup explicitly. |
| 167 | + |
| 168 | +If Option D, document why the LME-S evidence stands alone. |
0 commit comments