Skip to content

Commit 308ed41

Browse files
committed
chore: post-v3.15.1 audit-trail (uv.lock bump + LoCoMo investigation notes)
1 parent ff1a64a commit 308ed41

3 files changed

Lines changed: 293 additions & 1 deletion

File tree

tasks/e1-v3-locomo-resume-stop.md

Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
# E1 v3 LoCoMo — STOP at sweep launch (concurrent product fix in flight)
2+
3+
**Status:** STOPPED before launching 14-row sweep. Harness flags committed
4+
(`b68c5ac`); driver written (`benchmarks/lib/run_e1_v3_locomo.py`,
5+
uncommitted). Awaiting human decision.
6+
7+
**Date:** 2026-05-03
8+
**Code base SHA at stop:** `b68c5ac` (HEAD on main).
9+
10+
## What was done
11+
12+
1. Step 1 committed cleanly: `b68c5ac feat(verif): --ablate +
13+
--with-consolidation flags for LoCoMo harness`. 167 LOC, no push.
14+
2. Step 2 written: `benchmarks/lib/run_e1_v3_locomo.py` driver, 14 rows,
15+
two-baseline design per Option B in `tasks/e1-v3-locomo-smoke-finding.md`.
16+
Mirrors LME-S driver structure (commit `ca7f9d4`). Currently uncommitted.
17+
3. First launch attempt at 12:27:12 UTC: subprocess started running but
18+
exited inside ~10 minutes with no traceback, no per-conversation
19+
progress lines. Likely the parent shell's session was reaped under
20+
the Bash tool's worktree boundary; this is a launch-mechanism issue
21+
independent of the design.
22+
4. Second launch attempt at ~12:35 UTC: driver's dirty-tree gate fired
23+
(`[FATAL] tree is dirty; refusing to launch`). Inspection showed 6
24+
files modified in `mcp_server/core/` and `mcp_server/infrastructure/`
25+
that did NOT exist when the harness commit was made (`b68c5ac`).
26+
27+
## What stopped me
28+
29+
A concurrent worktree agent is mid-flight implementing the exact follow-up
30+
product fix referenced in `e1-v3-locomo-smoke-finding.md`:
31+
32+
```
33+
mcp_server/core/compression.py | 46 +++++++++++++++++++++---------
34+
mcp_server/core/decay_cycle.py | 52 ++++++++++++++++++++++++++++------
35+
mcp_server/core/write_gate.py | 10 +++++--
36+
mcp_server/core/write_post_store.py | 9 +++++-
37+
mcp_server/infrastructure/pg_schema.py | 27 ++++++++++++++++++
38+
mcp_server/infrastructure/pg_store.py | 2 +-
39+
6 files changed, 119 insertions(+), 27 deletions(-)
40+
```
41+
42+
Diff sample from `compression.py`:
43+
44+
> Compression cadence asks "has this memory had time to be revisited
45+
> in MY system" — that is elapsed time since ingest, NOT elapsed time
46+
> since the original event. Backfilled / imported memories carry a
47+
> backdated created_at (e.g. a 2023 conversation imported in 2026);
48+
> using created_at would compress them on the first consolidation
49+
> pass, before retrieval ever runs (see
50+
> tasks/e1-v3-locomo-smoke-finding.md).
51+
52+
The concurrent agent is migrating consolidation cadence from `created_at`
53+
(wall-clock-relative) to `ingested_at` (corpus-age-relative). This is
54+
**the exact fix** the smoke finding pointed at, and it directly
55+
invalidates the design premise of the two-baseline sweep:
56+
57+
- BASELINE_WITH_CONSOLIDATION on the current SHA (`b68c5ac`) shows
58+
MRR=0.222 (smoke). The 8 consolidation-only rows are designed to
59+
measure mechanism contribution within that collapse regime.
60+
- After the concurrent agent's fix lands, BASELINE_WITH_CONSOLIDATION
61+
will likely return to ~0.866 (no first-pass corpus collapse). The
62+
collapse-regime ablation deltas become obsolete — and worse, would
63+
be reported in the paper as "mechanism contributions" when they are
64+
actually "rescue from a bug that was fixed before publication."
65+
66+
## Why I will not run the sweep on the current SHA
67+
68+
Three problems, any one of which is sufficient to stop:
69+
70+
1. **Tree is dirty.** Driver's pre-flight dirty-check refuses (matches
71+
the LME-S driver's gate). Sweep cannot launch without bypassing the
72+
gate, which would defeat the purpose of recording a single SHA.
73+
2. **Code is mid-flight.** The 7h sweep would race a concurrent fix
74+
touching the very pipeline the sweep measures. Mid-sweep file
75+
changes would yield un-attributable deltas across rows.
76+
3. **Design premise is dissolving.** The smoke evidence (0.222) is
77+
the artifact of a soon-to-be-fixed bug. Locking in 13 hours of
78+
measurement against that artifact and writing it into the paper
79+
would be precisely the kind of "confident wrong number that
80+
destroys trust" the Zetetic standard prohibits.
81+
82+
## Recommended next actions (FOR HUMAN DECISION)
83+
84+
### Option α — wait for the concurrent fix to land, then re-run
85+
- Concurrent agent commits the `ingested_at` fix and merges to main.
86+
- Re-smoke `--with-consolidation` to confirm BASELINE_WITH_CONSOLIDATION
87+
no longer collapses. Expected: MRR back near no-consolidation anchor
88+
(≈0.866), modulo whatever the 9 consolidation-only mechanisms actually
89+
contribute.
90+
- If the new BASELINE_WITH_CONSOLIDATION is healthy, the two-baseline
91+
design simplifies: both baselines should be near each other, and the
92+
full 14 rows measure honest per-mechanism contributions.
93+
- Then commit the driver and launch.
94+
95+
### Option β — collapse to single baseline post-fix
96+
- Once the fix lands, BASELINE_NO_CONSOLIDATION ≈ BASELINE_WITH_CONSOLIDATION
97+
may make the two-baseline design unnecessary. Could simplify to a
98+
13-row design (one BASELINE + 12 mechanism rows) like LME-S.
99+
- Cheaper sweep (~6.5h). Cleaner paper §6.3 narrative.
100+
101+
### Option γ — proceed now, document the regime
102+
- Run the 14 rows on `b68c5ac` regardless. The numbers honestly describe
103+
the consolidation pipeline as it currently exists. The paper §6.3 must
104+
then disclose that the "consolidation regime" measured here was
105+
superseded by the ingested_at fix in a follow-up commit.
106+
- This keeps deadline pressure but ships ablation evidence about a code
107+
state that no longer exists in main. Not recommended.
108+
109+
### Option δ — drop the LoCoMo half
110+
- Reverts to LME-S only as in `de1d316`. Mechanisms remain in the
111+
codebase but are not supported by ablation evidence in §6.3.
112+
113+
## Driver artifact (uncommitted, ready for relaunch)
114+
115+
`benchmarks/lib/run_e1_v3_locomo.py` exists and parses cleanly. 14 rows,
116+
two-baseline design, per-category breakdown, sanity gate against
117+
CLAUDE.md headline (0.794 ±0.05). When the human picks Option α/β, the
118+
driver can be committed (with adjusted ROWS list for β) and launched.
119+
120+
## Recommendation
121+
122+
**Option α.** The concurrent fix directly addresses the smoke-surfaced
123+
collision; running the sweep before it lands measures a stale state.
124+
~30 min wait + 7h sweep is cheaper than ~7h sweep + retraction.
Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
# E1 v3 LoCoMo — STOP gate triggered on smoke
2+
3+
**Status:** STOPPED at smoke per task validation gate. Awaiting human decision on
4+
which design path to pursue. Harness wiring committed; full sweep NOT launched.
5+
6+
**Date:** 2026-05-02
7+
**Code base SHA at smoke:** ca7f9d40888065bb3b30f17e0d56bd3e68417490 (tree dirty
8+
with concurrent agent's `mcp_server/handlers/remember*.py` and tests — those
9+
files do NOT touch the benchmark code path; benchmark uses
10+
`benchmarks.lib.bench_db.BenchmarkDB``core.memory_ingest.ingest_memories_batch`
11+
+ `core.pg_recall.recall`, never the `remember` handler).
12+
13+
## Pre-registered validation gate
14+
15+
From task #55 spec:
16+
> BASELINE LoCoMo MRR: established baseline from CLAUDE.md is 0.794 R@10=0.926.
17+
> With `--with-consolidation` enabled, this should be APPROXIMATELY similar
18+
> (consolidation may shift it slightly, but should be within ±0.05 MRR). If
19+
> WAY off — STOP and diagnose.
20+
21+
## Smoke result (n=1 conversation, 197 questions)
22+
23+
| Run | MRR | R@10 | Wall (s) |
24+
|---|---|---|---|
25+
| `--limit 1` (no consolidation) | **0.866** | **99.0%** | 221.7 |
26+
| `--limit 1 --with-consolidation` | **0.222** | **54.8%** | 176.3 (incl. 127.7s consol) |
27+
28+
Δ MRR = **−0.644** vs the no-consolidation anchor.
29+
Δ MRR = **−0.572** vs the published 0.794 (CLAUDE.md headline).
30+
31+
This is **WAY off** the ±0.05 tolerance. Stop-and-diagnose triggered.
32+
33+
## Diagnosis
34+
35+
**Root cause: compression cycle fires on LoCoMo's old timestamps.**
36+
37+
LoCoMo session dates are from May–November 2023 (real conversation timestamps
38+
preserved by the dataset). Current wall date is 2026-05-02, so loaded memories
39+
have `created_at` ≈ 3 years old.
40+
41+
Consolidation defaults
42+
(`mcp_server/infrastructure/memory_config.py`):
43+
- `COMPRESSION_GIST_AGE_HOURS = 168.0` (7 days) → memories older than 7d get
44+
full-text → gist replacement.
45+
- `COMPRESSION_TAG_AGE_HOURS = 720.0` (30 days) → memories older than 30d get
46+
gist → tag replacement.
47+
48+
LoCoMo memories are 3 years old at ingest time, so compression IMMEDIATELY
49+
collapses them to tags after one consolidation pass. Recall against the verbatim
50+
question text then misses, because the original session content has been
51+
replaced by terse tag form.
52+
53+
This is a real architectural collision between:
54+
- Production cadence assumption: memories accumulate over wall-clock time,
55+
consolidation gradually compresses old ones.
56+
- Benchmark loading pattern: load N sessions instantly, run consolidation, query.
57+
If session timestamps are old, consolidation treats them as fully aged.
58+
59+
This is the SAME shape of architectural mismatch the LME-S audit identified, but
60+
in the opposite direction: LME-S exercised consolidation-only mechanisms not at
61+
all (so they showed Δ=0); LoCoMo exercises them so aggressively at first contact
62+
that they destroy the retrieval signal. Neither benchmark, as currently
63+
instrumented, isolates per-mechanism effects honestly.
64+
65+
## Why pre-spec'd "single BASELINE_WITH_CONSOLIDATION" design is unsafe
66+
67+
If we take the 0.222 anchor as baseline, then any ablation that disables a stage
68+
contributing to the destruction (e.g., MICROGLIAL_PRUNING, compression-adjacent
69+
parts of CASCADE) will show a LARGE POSITIVE ΔMRR — but that signal will
70+
conflate two distinct effects:
71+
72+
1. "Mechanism X contributes to retrieval" (the desired ablation reading).
73+
2. "Mechanism X is the proximate cause of LoCoMo-timestamp-driven destruction"
74+
(a benchmark-instrumentation artifact).
75+
76+
Reporting these as 1-row mechanism contributions in §6.3 of the paper would
77+
overstate the case. The corresponding LME-S rows for the same mechanisms showed
78+
Δ=0 because consolidation never fired; reporting Δ ≈ +0.4 on LoCoMo without the
79+
context "the ablation rescued recall from a benchmark-induced collapse" is
80+
not honest evidence for the paper claim.
81+
82+
## Decision matrix (FOR HUMAN REVIEW BEFORE SWEEP LAUNCH)
83+
84+
### Option A — Single BASELINE_WITH_CONSOLIDATION (as spec'd) — 13 rows
85+
- Honors task spec literally.
86+
- Anchor MRR ≈ 0.22.
87+
- Risk: ablation deltas conflate mechanism contribution with destruction-rescue.
88+
Each row's writeup MUST disclose this confound; paper §6.3 cannot use these
89+
numbers as standalone evidence for "mechanism X contributes Y MRR."
90+
- Useful if the goal is documentation of LoCoMo+consolidation interaction
91+
rather than per-mechanism contribution claim.
92+
93+
### Option B — Two baselines — 14 rows (RECOMMENDED)
94+
- BASELINE_NO_CONSOLIDATION (≈0.866 anchor): used for all longitudinal
95+
read-path mechanisms (RECONSOLIDATION, CO_ACTIVATION, ADAPTIVE_DECAY) since
96+
these mechanisms do NOT depend on a consolidation pass — they accumulate
97+
state via cross-question reads.
98+
- BASELINE_WITH_CONSOLIDATION (≈0.22 anchor): used for the 8 consolidation-only
99+
mechanisms (CASCADE, INTERFERENCE, HOMEOSTATIC_PLASTICITY, SYNAPTIC_PLASTICITY,
100+
MICROGLIAL_PRUNING, TWO_STAGE_MODEL, EMOTIONAL_DECAY, TRIPARTITE_SYNAPSE) and
101+
SCHEMA_ENGINE — these can only fire if consolidation fires.
102+
- Ablation deltas reported against the relevant baseline, with the architectural
103+
finding stated alongside. Honest per-mechanism evidence preserved.
104+
- Sweep cost: 14 rows × ~30 min/row = ~7 hours.
105+
106+
### Option C — Decouple consolidation cadence from wall-clock age
107+
- Change `COMPRESSION_GIST_AGE_HOURS`/`COMPRESSION_TAG_AGE_HOURS` to gate on
108+
access-count or relative recency rather than absolute timestamp diff.
109+
- Or: in benchmark mode, override `created_at` to be relative to the benchmark
110+
start so memories appear "fresh" to consolidation.
111+
- Out of scope for task #55 — task explicitly says "DO NOT modify
112+
recall_pipeline.py constants — Phase A+B calibration locked." Same prohibition
113+
reasonably extends to consolidation cadence constants without a separate task.
114+
115+
### Option D — Drop the LoCoMo half — keep LME-S only
116+
- Reverts to the 17-row evidence base from `de1d316`.
117+
- Honest: "we couldn't isolate consolidation-only mechanism contributions on
118+
any benchmark currently in our suite without confound."
119+
- The mechanisms remain in the codebase, but unsupported by ablation evidence
120+
in §6.3. The paper would say so.
121+
122+
## Wall-budget data (pre-recorded in case we proceed)
123+
124+
- Per-conversation wall (with consolidation): 176.3s
125+
- Of which consolidation: 127.7s
126+
- Of which ingest+QA: 48.6s
127+
- Per-conversation wall (no consolidation): 221.7s (more time spent in QA loop
128+
presumably because no compressed records — denser candidate set).
129+
- Full LoCoMo (10 conversations) per row, with consolidation: ≈30 min.
130+
- Full LoCoMo (10 conversations) per row, no consolidation: ≈37 min.
131+
- Option A (13 rows × 30 min) ≈ 6.5 h.
132+
- Option B (1 NO + 1 WITH baseline + 3 longitudinal vs NO + 9 consol vs WITH =
133+
14 rows: 4 × 37 min + 10 × 30 min) ≈ 7.5 h.
134+
135+
## Smoke artifacts
136+
137+
- `benchmarks/results/ablation/locomo_v3_smoke/SMOKE_NO_CONSOLIDATION.json`
138+
- `benchmarks/results/ablation/locomo_v3_smoke/SMOKE_WITH_CONSOLIDATION.json`
139+
140+
## Harness wiring committed (independent of sweep decision)
141+
142+
`benchmarks/locomo/run_benchmark.py` now supports `--with-consolidation`,
143+
`--ablate MECH`, `--results-out PATH`, mirroring the LME-S harness signature.
144+
Manifest block in `--results-out` JSON includes `with_consolidation`,
145+
`ablate_mechanism`, `ablate_env_var`, `n_conversations`, `n_questions`,
146+
`consolidation_call_count`, `consolidation_total_wall_s`.
147+
148+
## What I have NOT done
149+
150+
- Have not launched the 13-row sweep.
151+
- Have not written `benchmarks/lib/run_e1_v3_locomo.py` driver yet (depends on
152+
baseline-design decision).
153+
- Have not modified retrieval constants or consolidation-cadence constants
154+
(per task constraint and Zetetic standard — no source for arbitrary
155+
constant change).
156+
- Have not committed any code from the dirty tree (concurrent agent's work).
157+
158+
## Recommendation
159+
160+
Option B (two baselines, 14 rows). Strongest scientific design that respects
161+
the architectural finding and the task constraint not to mutate constants. The
162+
extra row is honest acknowledgment that LoCoMo+consolidation has a confound,
163+
not a workaround.
164+
165+
If the human reviewer prefers Option A for §6.3 page-budget reasons, document
166+
the confound in every consolidation-only row's writeup explicitly.
167+
168+
If Option D, document why the LME-S evidence stands alone.

uv.lock

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)