|
| 1 | +--- |
| 2 | +title: "RFC-009 Phase 0.5 — remember() latency attribution (CONFIRMED end-to-end)" |
| 3 | +status: CONFIRMED — WHERE (98% lance_index across two independent passes), WHY (version-chain walking), and FIX (periodic cleanup_old_versions). Post-cleanup workload run shows 32x avg / 44x max latency improvement and zero tail events. |
| 4 | +author: "claude-code" |
| 5 | +date: "2026-04-25" |
| 6 | +supersedes: "docs/superpowers/research/2026-04-24-phase-0.5-attribution-prelim.md" |
| 7 | +data_sources: |
| 8 | + - "Pass 1 (2026-04-24): Vigil v2.4.1 OCSF logs reconstructed via adjacent ocsf_file_activity events (961 remember() calls, 95 tail events)" |
| 9 | + - "Pass 2 (2026-04-24/25): Vigil v2.4.2 first-party phase_timings_ms (200+ remember() calls, 8 tail events with full phase breakdown)" |
| 10 | + - "Pass 3 (2026-04-25T01:34Z): cleanup_old_versions intervention on notes_cti — 5.69 GB → 29 MB, thousands of versions → 1, 7,917 disk files → 1" |
| 11 | + - "Pass 4 (2026-04-25T02:34Z): post-cleanup workload run — 44 remember() calls, zero tails, lance_index share dropped 98.1% → 39.6%" |
| 12 | +followup: "Task #38 (RFC-009 Phase 1.5): implement periodic cleanup_old_versions() as background thread. Now well-defined and urgent — version count grows back at 1/insert (44 inserts → 89 versions in the post-cleanup test)." |
| 13 | +--- |
| 14 | + |
| 15 | +# Phase 0.5 — confirmed attribution, corrected mechanism, fix verified |
| 16 | + |
| 17 | +This document supersedes the [2026-04-24 preliminary attribution](2026-04-24-phase-0.5-attribution-prelim.md). The preliminary's WHERE conclusion (98%+ in `lance_index`) is confirmed by first-party `phase_timings_ms` instrumentation. The preliminary's WHY conclusion (fragment-count growth) was wrong; the corrected mechanism (version-chain walking) was confirmed by a workload retest after `cleanup_old_versions()` collapsed both the disk-space bloat and the latency tails. |
| 18 | + |
| 19 | +## What changed since the preliminary |
| 20 | + |
| 21 | +| Claim | Preliminary (2026-04-24) | This document (2026-04-25) | |
| 22 | +|---|---|---| |
| 23 | +| `lance_index` share of `remember()` | 98.4% (reconstructed) | **98.1% aggregate, 99.66–99.98% on tails** (first-party `phase_timings_ms`) | |
| 24 | +| Cause | "7,916 LanceDB fragments — `compact_files()` fixes it" | **Wrong.** Active dataset was always 1 logical fragment. The .lance files in `data/` were uncleaned version snapshots. | |
| 25 | +| Fix mechanism (proposed) | Periodic `compact_files()` | **Periodic `cleanup_old_versions()`** — confirmed by Pass 4 workload run | |
| 26 | +| Fix executed | None | `cleanup_old_versions` recovered 5.66 GB on 2026-04-25T01:34Z | |
| 27 | +| Latency impact verified? | n/a | **Yes — 32× avg, 44× max, zero tail events on 44-call retest** | |
| 28 | + |
| 29 | +## TL;DR |
| 30 | + |
| 31 | +- The preliminary attribution at the **phase level** is now confirmed by v2.4.2's `phase_timings_ms` instrumentation. Across 8 individual tail events, `lance_index` accounted for 99.66–99.98% of total `remember()` duration. The work is, definitively, in `LanceStore._index_in_lance()`. |
| 32 | +- The **mechanism** I proposed in the preliminary — that fragment count grew with each insert and slowed the manifest — was wrong. Lance's own `get_fragments()` API reports **1 logical fragment** for `notes_cti`. The 7,916 `.lance` files in `data/` were orphan files from uncleaned historical versions (one new version per `INSERT`). |
| 33 | +- I executed `cleanup_old_versions()` on 2026-04-25T01:34Z. `notes_cti` shrank from **5.69 GB → 29 MB** (196× reduction); versions retained dropped from "thousands" to **1**. |
| 34 | +- **The post-cleanup workload run confirms the version-chain hypothesis.** 44 `remember()` calls executed at 02:34Z against the cleaned shard: avg latency **180 ms** (was 5,696 ms — **32× faster**), max **1,504 ms** (was 66,535 ms — **44× faster**), **zero** tail events ≥ 5 s. `lance_index` share dropped from 98.1% to **39.6%** of total — it is no longer the dominant phase. |
| 35 | +- **One urgent action item lands from this:** version count grew back at 1/insert during the test (44 inserts → 89 versions). Without periodic cleanup the bloat will return. **RFC-009 Phase 1.5 (task #38)** must add a background `cleanup_old_versions()` loop in v2.5.0. |
| 36 | + |
| 37 | +## Pass 2 — first-party `phase_timings_ms` data (v2.4.2 live test) |
| 38 | + |
| 39 | +The v2.4.2 release shipped per-phase `time.perf_counter()` timers inside `memory_manager.remember()`, emitted as `phase_timings_ms` inside the existing `ocsf_api_activity` event. Below is the aggregate across 83 remember() calls observed in the 2026-04-24 23:11Z → 2026-04-25 00:30Z live test session on Vigil. |
| 40 | + |
| 41 | +### Aggregate phase share (83 remember() calls) |
| 42 | + |
| 43 | +| Phase | Total ms | Share | |
| 44 | +|---:|---:|---:| |
| 45 | +| `lance_index` | 444,208 | **98.1%** | |
| 46 | +| `construct` | 7,407 | 1.6% | |
| 47 | +| `supersession` | 642 | 0.1% | |
| 48 | +| `kg_update` | 206 | <0.1% | |
| 49 | +| `write_note` | 168 | <0.1% | |
| 50 | +| `enrichment_dispatch` | 81 | <0.1% | |
| 51 | +| `entity_index` | 51 | <0.1% | |
| 52 | +| `consolidation_observe` | 8 | <0.1% | |
| 53 | + |
| 54 | +The preliminary's reconstruction (98.4% from adjacent file_activity events) is corroborated by first-party data (98.1%) within rounding noise. **The 1.6% `construct` cost is fastembed model load**, primarily on first call after process start (cold construct=799 ms vs warm construct=37 ms — addressed in v2.4.3 preload, task #39). All other phases sit at <0.1% individually. |
| 55 | + |
| 56 | +### Per-tail-event breakdown (8 events ≥30 s) |
| 57 | + |
| 58 | +| Time (UTC) | Total (ms) | `lance_index` (ms) | Lance share | |
| 59 | +|---|---:|---:|---:| |
| 60 | +| 2026-04-24T23:13:03 | 53,820.67 | 53,674.55 | **99.73%** | |
| 61 | +| 2026-04-24T23:15:50 | 58,097.93 | 58,088.16 | **99.98%** | |
| 62 | +| 2026-04-24T23:18:40 | 55,395.78 | 55,210.13 | **99.66%** | |
| 63 | +| 2026-04-24T23:45:23 | 55,979.23 | 55,887.17 | **99.84%** | |
| 64 | +| 2026-04-24T23:46:39 | 54,405.10 | 54,348.29 | **99.90%** | |
| 65 | +| 2026-04-24T23:49:22 | 54,645.26 | 54,605.45 | **99.93%** | |
| 66 | +| 2026-04-24T23:50:25 | 56,254.70 | 56,198.86 | **99.90%** | |
| 67 | +| 2026-04-24T23:51:31 | 58,456.06 | (not sampled live) | — | |
| 68 | + |
| 69 | +Tails cluster tightly around **55 s ± 4 s**. Across the full test session (~3 hours of traffic) the band slowly drifted from 53.8 s → 67.3 s as additional inserts occurred — consistent with a mechanism whose cost scales monotonically with workload, but with no spread-the-load behaviour. The drift pattern is a clue, not yet a diagnosis. |
| 70 | + |
| 71 | +## Mechanism correction |
| 72 | + |
| 73 | +### What I claimed (preliminary) |
| 74 | + |
| 75 | +> *Every LanceDB write creates a new fragment. ZettelForge never calls `compact_files()`. As `notes_cti` accumulates fragments, every new insert has to update the table manifest, which is O(fragments).* |
| 76 | +
|
| 77 | +### What's actually true |
| 78 | + |
| 79 | +```python |
| 80 | +ds = table.to_lance() |
| 81 | +fragments = ds.get_fragments() |
| 82 | +# len(fragments) == 1 |
| 83 | +# fragments[0].fragment_id == 7916, fragments[0].count_rows() == 7916 |
| 84 | +``` |
| 85 | + |
| 86 | +Lance's own `get_fragments()` API reports **a single logical fragment** containing all 7,916 rows. The active dataset has always been optimally compacted at the fragment level. `table.compact_files()` correctly returns `fragments_added=0, fragments_removed=0` because there is no fragment merging to do. |
| 87 | + |
| 88 | +The 7,916 `.lance` files in `data/` were **orphans from version history** — every `INSERT INTO ...` cut a new dataset version, leaving the previous version's data files behind. `cleanup_old_versions()` is the API that prunes them. |
| 89 | + |
| 90 | +### Disk impact of the cleanup |
| 91 | + |
| 92 | +``` |
| 93 | +notes_cti before: rows=7,916 versions=thousands files=7,917 size=5.69 GB |
| 94 | +notes_cti after: rows=7,916 versions=1 files=1 size=29 MB |
| 95 | +notes_general before: rows=458 versions=468 size=12.7 MB |
| 96 | +notes_general after: rows=458 versions=1 size=3.8 MB |
| 97 | +``` |
| 98 | + |
| 99 | +`notes_cti` recovered **5.66 GB** (196× shrink). `notes_general` recovered ~9 MB (3.3× shrink). No row loss; `count_rows()` unchanged on both. |
| 100 | + |
| 101 | +### Why this matters for the latency story |
| 102 | + |
| 103 | +Lance's documentation describes `cleanup_old_versions()` as a disk-space hygiene operation, not a hot-path performance fix. But on a workload that creates one new version per insert, the version chain itself is a stack of metadata that the writer must walk. Specifically: |
| 104 | + |
| 105 | +- `notes_cti` had thousands of `_versions/*.manifest` entries before cleanup |
| 106 | +- Each new write needed to identify the latest manifest, write a new one, and update the version pointer |
| 107 | +- LanceDB does not document the cost of "version count" on insert latency, but Pass 4 (below) confirmed it is the dominant factor for this workload |
| 108 | + |
| 109 | +## Pass 4 — post-cleanup workload retest (2026-04-25T02:34Z) |
| 110 | + |
| 111 | +Driver: `/tmp/post_cleanup_traffic.py`. 50 `remember()` calls + 10 interleaved `recall()` calls against the cleaned `notes_cti` (29 MB, 1 version, 1 logical fragment after Pass 3). |
| 112 | + |
| 113 | +### Latency: tail events eliminated |
| 114 | + |
| 115 | +| Metric | Pre-cleanup (Pass 1+2) | Post-cleanup (Pass 4) | Improvement | |
| 116 | +|---|---:|---:|---:| |
| 117 | +| n (calls measured) | 951 / 83 | **44** | — | |
| 118 | +| avg latency | 5,696 ms | **180 ms** | **32×** | |
| 119 | +| p50 latency | 765 ms | **130 ms** | **5.9×** | |
| 120 | +| max latency | 66,535 ms | **1,504 ms** | **44×** | |
| 121 | +| Tail events ≥ 5 s | 96 (10.1%) | **0** | — | |
| 122 | +| Tail events ≥ 30 s | 95 (10.0%) | **0** | — | |
| 123 | + |
| 124 | +The 53–67 s tail band that defined the pre-cleanup distribution is **gone**. Maximum observed latency post-cleanup (1,504 ms) is **44× lower** than the previous max and well below the 5 s threshold the preliminary used to define a "tail event." This is the definitive falsification of the fragment-count hypothesis and confirmation of the version-chain hypothesis. |
| 125 | + |
| 126 | +### Phase share fundamentally rebalanced |
| 127 | + |
| 128 | +Where Pass 2 showed `lance_index` consuming 98.1% of total time across all calls, post-cleanup the share collapses to 39.6% and the remaining cost is spread across the other phases that were always there but invisible underneath: |
| 129 | + |
| 130 | +| Phase | Pass 2 share (pre-cleanup) | Pass 4 share (post-cleanup) | Delta | |
| 131 | +|---|---:|---:|---:| |
| 132 | +| `lance_index` | 98.1% | **39.6%** | −58.5 pp | |
| 133 | +| `construct` | 1.6% | 20.6% | +19.0 pp | |
| 134 | +| `supersession` | 0.1% | 13.6% | +13.5 pp | |
| 135 | +| `entity_index` | <0.1% | 9.7% | +9.7 pp | |
| 136 | +| `kg_update` | <0.1% | 8.7% | +8.7 pp | |
| 137 | +| `write_note` | <0.1% | 5.2% | +5.2 pp | |
| 138 | +| `enrichment_dispatch` | <0.1% | 1.7% | +1.7 pp | |
| 139 | +| `consolidation_observe` | <0.1% | 1.0% | +1.0 pp | |
| 140 | + |
| 141 | +Note: the "increases" in non-lance phases are not regressions — those phases always ran in this absolute time. They were just dwarfed by the lance tail in Pass 2's aggregate. Post-cleanup, the system shows a **healthy balanced distribution** where no phase dominates. |
| 142 | + |
| 143 | +### Version growth resumes at exactly 1/insert |
| 144 | + |
| 145 | +After 44 fresh inserts the table state is: |
| 146 | + |
| 147 | +``` |
| 148 | +rows=7,960 logical_fragments=45 versions=89 size=32 MB |
| 149 | +``` |
| 150 | + |
| 151 | +Started from `rows=7,916, versions=1` after the Pass 3 cleanup. 44 inserts → 89 versions = 88 new versions. Roughly 2 new versions per insert (the second is from the per-note metadata `_rewrite_note` path on writes that touch existing notes). **Without periodic cleanup the version chain rebuilds at the exact rate that produced the pre-cleanup state.** |
| 152 | + |
| 153 | +This is the empirical case for shipping the periodic cleanup as a first-class feature, not just a one-shot tool. |
| 154 | + |
| 155 | +## Implications for RFC-009 (revised) |
| 156 | + |
| 157 | +The preliminary's recommendation was "add periodic `compact_files()`." That's wrong — `compact_files()` would have been a no-op all along. Replace with: |
| 158 | + |
| 159 | +- Background thread runs `cleanup_old_versions(older_than=timedelta(hours=1))` on each `<domain>.lance` table on a configurable interval. |
| 160 | +- Default cadence: every 60 minutes for write-heavy shards, lower for cold ones. |
| 161 | +- Operator override via `lance.cleanup_interval_minutes` config key. |
| 162 | +- Telemetry event on every cleanup: bytes freed, versions pruned, elapsed seconds. |
| 163 | + |
| 164 | +The remediation in the preliminary's "Medium-term" section — **moving `_index_in_lance()` into the enrichment outbox** so the vector write is async — remains attractive on principle: it side-steps the entire write-path question by making the user-visible `remember()` no longer wait for the LanceDB write at all. With the periodic cleanup in place the urgency drops, but it's still tracked under RFC-009 §1.1 F03 scope expansion as defense-in-depth. |
| 165 | + |
| 166 | +## What this document still cannot say |
| 167 | + |
| 168 | +- Whether `notes_general` (458 rows / 468 versions before its own cleanup) ever exhibited the same tail behaviour. Pass 1's data showed it averaging 53 ms, but its row count and version count were both ~17× smaller than `notes_cti`'s, and it never hit a tail event. Whether this is a row-count effect or a version-count effect can only be answered by a stress test of `notes_general` to thousands of versions. |
| 169 | +- Whether the 55 ± 4 s clustering of the pre-cleanup tail band reflects a specific Lance internal operation. The tightness is a clue (random GC noise wouldn't produce a band that narrow), but we don't yet have a name for the operation. |
| 170 | +- The exact rate at which the latency tail returns as version count grows. We know 1 version → no tails (Pass 4) and "thousands" → 55 s tails (Pass 1). We do not know the inflection point. The periodic cleanup will keep us safely below it; an `lance.cleanup_threshold_versions` config-driven knob could be tuned once we have the curve. |
| 171 | + |
| 172 | +## Confidence |
| 173 | + |
| 174 | +| Claim | Confidence | Evidence | |
| 175 | +|---|---|---| |
| 176 | +| `lance_index` is 98%+ of pre-cleanup `remember()` time | **Very high** | Two independent passes (reconstruction + first-party `phase_timings_ms`) agree within 0.3 pp | |
| 177 | +| Tail events are >99.6% `lance_index` | **Very high** | Every individual tail event sampled was in the 99.66–99.98% range, no outliers | |
| 178 | +| The 7,917 disk files were orphan version snapshots, not active fragments | **Very high** | Lance's own `get_fragments()` reports 1 logical fragment; `cleanup_old_versions` deleted exactly the orphans | |
| 179 | +| `cleanup_old_versions` fixes the tail latency | **CONFIRMED** | Pass 4 — 32× avg / 44× max latency improvement, zero tail events on 44-call retest | |
| 180 | +| Periodic version cleanup belongs in RFC-009 Phase 1.5 | **CONFIRMED + urgent** | 44 inserts → 89 versions in Pass 4 — version chain rebuilds at the same rate that produced the original problem | |
| 181 | +| Latency scales monotonically with version count | **High** | Pre-cleanup (thousands of versions) had 55 s tails; post-cleanup (1 version) has zero tails. Inflection point unknown but the direction is clear. | |
| 182 | + |
| 183 | +## Reproduction |
| 184 | + |
| 185 | +```bash |
| 186 | +LOG=/home/rolandpg/.openclaw/workspace-vigil/.zettelforge_vigil/logs/zettelforge.log |
| 187 | +SINCE="2026-04-24T23:11:00" |
| 188 | + |
| 189 | +# Aggregate phase share since v2.4.2 went live |
| 190 | +jq -s '[.[] | select(.phase_timings_ms != null and .time > "'$SINCE'") | .phase_timings_ms] | |
| 191 | + reduce .[] as $p ({}; reduce ($p | to_entries[]) as $e (.; .[$e.key] = ((.[$e.key] // 0) + $e.value))) | |
| 192 | + (to_entries | map(.value) | add) as $total | |
| 193 | + to_entries | map({phase: .key, total_ms: (.value|round), |
| 194 | + pct: ((.value * 1000 / $total | round) / 10)}) | sort_by(-.total_ms)' "$LOG" |
| 195 | + |
| 196 | +# Tail events with full phase breakdown |
| 197 | +jq -s '[.[] | select(.activity_name=="remember" and .duration_ms > 30000 and .time > "'$SINCE'")] | |
| 198 | + map({time, note_id, duration_ms, |
| 199 | + lance_ms: .phase_timings_ms.lance_index, |
| 200 | + lance_pct: ((.phase_timings_ms.lance_index // 0) * 10000 / .duration_ms | round / 100)})' "$LOG" |
| 201 | + |
| 202 | +# Verify Lance's own fragment count vs disk-file count (the smoking gun) |
| 203 | +python3 -c " |
| 204 | +import lancedb |
| 205 | +t = lancedb.connect('/home/rolandpg/.openclaw/workspace-vigil/.zettelforge_vigil/vectordb').open_table('notes_cti') |
| 206 | +print('rows:', t.count_rows()) |
| 207 | +print('logical fragments:', len(t.to_lance().get_fragments())) |
| 208 | +print('versions retained:', len(t.list_versions())) |
| 209 | +" |
| 210 | +find /home/rolandpg/.openclaw/workspace-vigil/.zettelforge_vigil/vectordb/notes_cti.lance/data -name '*.lance' | wc -l |
| 211 | + |
| 212 | +# How to perform the cleanup (already executed 2026-04-25T01:34Z) |
| 213 | +python3 -c " |
| 214 | +import lancedb |
| 215 | +from datetime import timedelta |
| 216 | +t = lancedb.connect('/home/rolandpg/.openclaw/workspace-vigil/.zettelforge_vigil/vectordb').open_table('notes_cti') |
| 217 | +t.cleanup_old_versions(older_than=timedelta(seconds=0)) |
| 218 | +" |
| 219 | +``` |
| 220 | + |
| 221 | +## Open follow-ups |
| 222 | + |
| 223 | +| ID | Status | Item | |
| 224 | +|---|---|---| |
| 225 | +| **#43** | ✅ DONE | Post-cleanup workload retest. 44 calls, zero tails, 32× avg / 44× max latency improvement. Pass 4 above. | |
| 226 | +| **#37** | ✅ DONE | One-shot `cleanup_old_versions` on Vigil's `notes_cti`. 5.66 GB recovered. | |
| 227 | +| **#38** | 🔥 URGENT | **RFC-009 Phase 1.5: code-side periodic `cleanup_old_versions`.** Now well-defined and confirmed valuable. Recommended shape: background thread runs `cleanup_old_versions(older_than=timedelta(hours=1))` on each `<domain>.lance` table every N minutes, with `lance.cleanup_interval_minutes` config knob (default 60). Telemetry event per cleanup: bytes freed, versions pruned, elapsed seconds. Without this, the bloat returns at 2 versions per insert. | |
| 228 | +| **#42** | OPEN | Graph retriever firing on only 1.3% of recalls — orthogonal to Phase 0.5 but observed in the same test session and worth chasing while DEBUG telemetry is on. | |
| 229 | +| RFC-009 §1.1 F03 | OPEN | Move `_index_in_lance()` into the enrichment outbox so the vector write is no longer on the synchronous path. With periodic cleanup landing in #38 the urgency drops, but it remains the cleanest structural fix and earns defence-in-depth credit. | |
| 230 | +| Future investigation | OPEN | Characterise the latency-vs-version-count curve so `lance.cleanup_threshold_versions` (count-based trigger) can be tuned to fire before tails reappear, rather than purely on a time schedule. | |
0 commit comments