Skip to content

Commit 5bb0cc5

Browse files
authored
docs: Phase 0.5 attribution artifact (Vigil v2.4.1 log analysis) (#91)
Preliminary Phase 0.5 attribution from 961 real remember() calls in Vigil's v2.4.1 OCSF log. Finds 98.4% of remember() wall-clock is LanceDB notes_cti writes, shard has 7,356 uncompacted fragments. Proposes expanding RFC-009 §1.1 F03 to cover periodic compaction.
1 parent bab29b2 commit 5bb0cc5

1 file changed

Lines changed: 120 additions & 0 deletions

File tree

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
---
2+
title: "RFC-009 Phase 0.5 — remember() latency attribution (preliminary)"
3+
status: PRELIMINARY — confirm after v2.4.2 phase_timings_ms lands
4+
author: "claude-code"
5+
date: "2026-04-24"
6+
scope: "Attribute the 5.70s avg / 66.5s max remember() latency observed in Vigil's 2026-04-24 telemetry audit"
7+
data_source: "/home/rolandpg/.openclaw/workspace-vigil/.zettelforge_vigil/logs/zettelforge.log (9,510 events, 961 remember calls)"
8+
followup: "Re-run with phase_timings_ms after PR #90 (v2.4.2) ships to fully rule out fastembed / consolidation / KG-update"
9+
---
10+
11+
# Phase 0.5 — preliminary attribution
12+
13+
## TL;DR
14+
15+
**98.4% of Vigil's remember() wall-clock time is spent inside one LanceDB `Update` call on the `notes_cti` table.** The root cause is unbounded fragment accumulation: the CTI vector shard has **7,356 fragment files** because ZettelForge has never called LanceDB's `compact_files()`. Insert latency grows with fragment count, which manifests as a bimodal distribution: 90% of calls finish in <1.5s, while ~10% spike to 30–66 seconds when Lance does manifest-level work.
16+
17+
This is NOT the Ollama-timeout bug that RFC-010 just fixed, and RFC-009 Phases 1–6 (outbox, queue, circuit breaker) do not touch it. It needs its own fix.
18+
19+
## Method
20+
21+
Preliminary pass executed against production OCSF logs *before* the v2.4.2 `phase_timings_ms` instrumentation (PR #90) ships. We can already attribute to the **phase level** because each LanceDB table write happens to emit its own `ocsf_file_activity` event with `duration_ms`, which we match to the surrounding `ocsf_api_activity remember` event via `note_id`. The v2.4.2 data will refine — and can falsify — this analysis by attributing the *other* 1.6% to specific phases.
22+
23+
## Data
24+
25+
| Metric | Value |
26+
|---|---|
27+
| `remember` calls analysed | 961 |
28+
| avg duration | 5,695.75 ms ✓ (matches Vigil audit figure) |
29+
| max duration | 66,534.47 ms ✓ (matches Vigil audit figure) |
30+
| p50 | **765 ms** |
31+
| p95 | **49,750 ms** |
32+
| p99 | 55,790 ms |
33+
| calls > 5 s | 96 |
34+
| calls > 30 s | 95 |
35+
| calls > 60 s | 1 |
36+
37+
The distribution is **bimodal**, not a long tail. Half the calls run in under a second; about 10% spike into the 30–66 s band. Averaging a bimodal distribution obscures what is actually happening.
38+
39+
## Attribution
40+
41+
| Adjacent `ocsf_file_activity` event | Count | Avg duration | Max |
42+
|---|---|---|---|
43+
| `notes_cti` (LanceDB CTI vector shard) | 951 | **5,664 ms** | 66,501 ms |
44+
| `notes_general` (LanceDB general vector shard) | 9 | 53 ms ||
45+
46+
```
47+
remember() total time : 5,473,619 ms
48+
LanceDB notes_cti time : 5,387,226 ms
49+
= 98.42% of remember()
50+
```
51+
52+
All 95 slow (>30 s) events are LanceDB `notes_cti` updates. Non-LanceDB work (SQLite, fastembed, consolidation, KG update, entity index, enqueue) collectively accounts for 1.6% of total wall-clock — each of them below noise threshold for this pass.
53+
54+
## Root cause
55+
56+
```
57+
notes_cti.lance : 7,356 fragments, 4.7 GB on disk, ~5.66 s avg insert
58+
notes_general.lance : 458 fragments, 17 MB on disk, ~53 ms avg insert
59+
```
60+
61+
Two facts in combination:
62+
63+
1. **Every LanceDB write creates a new fragment.** This is by design — Lance is an append-optimized columnar store.
64+
2. **ZettelForge never calls `compact_files()`.** A grep for `compact_files`, `optimize`, or any equivalent in `src/zettelforge/` returns zero matches.
65+
66+
As the `notes_cti` shard accumulates fragments, every new insert has to update the table manifest, which in Lance is O(fragments). At 7,356 fragments, the manifest write dominates the insert. `notes_general` with 458 fragments shows the fast-path behaviour (53 ms). The tail events (30–66 s) likely coincide with a background metadata operation that scales worse.
67+
68+
## Implication for RFC-009
69+
70+
This finding does not invalidate RFC-009, but it **re-prioritises it**. The ordering in Section 7 currently puts LLM hardening + outbox + priority queue ahead of any write-path work. Reality:
71+
72+
- Phases 1–6 as specified will not materially change `remember()` latency. The LLM is already off the sync path.
73+
- The 2,329 enrichment-job drops/day are a **separate issue** (empty Ollama responses + queue overflow — RFC-009 phases 1–3 correctly address those).
74+
- `remember()` p50 and p95 both improve only when we fix LanceDB writes — either with periodic `compact_files()`, or by moving `_index_in_lance()` to the outbox (RFC-009 §1.1 F03 already flags this asymmetry as out of scope; Phase 0.5 now argues it should be *in* scope).
75+
76+
## Recommended actions
77+
78+
**Immediate (v2.4.2 or v2.4.3, ~1h work):**
79+
- Run a one-shot `compact_files()` on the live `notes_cti.lance` table in Vigil's workspace to measure how much of the latency recovers. This is an offline operation on the data dir and requires no release.
80+
- Capture before/after: fragment count, table size, next 100 `remember()` p50/p95.
81+
82+
**Short-term (v2.5.0, scope addition to RFC-009):**
83+
- Add an automatic compaction trigger. Options:
84+
1. Count-based: compact every N writes (N=100 per LanceDB guidance for write-heavy workloads).
85+
2. Time-based: background thread compacts every T minutes.
86+
3. Size-based: compact when fragment count exceeds a threshold.
87+
Recommendation: count-based with a small thread-safe counter on `LanceStore`, deferred to a background thread so the compaction itself does not block `remember()`.
88+
89+
**Medium-term (v2.5.0, part of RFC-009 Section 1.1 F03 scope expansion):**
90+
- Evaluate moving `_index_in_lance()` into the enrichment outbox so the vector write is no longer on the synchronous `remember()` path at all. This makes the SQLite txn the single durability boundary and lets compaction happen inline on the worker without user-visible latency.
91+
92+
## Confidence
93+
94+
**High** on the attribution itself: the data is from 961 real production calls with microsecond-precision per-event durations, and the 98.4% share leaves no room for another candidate. **Medium** on the root-cause mechanism (fragment count vs insert cost) — the pattern is consistent with LanceDB's documented behaviour, but definitive confirmation requires running a one-shot `compact_files()` and remeasuring.
95+
96+
## Followup once PR #90 (v2.4.2) lands
97+
98+
`phase_timings_ms` will split the remaining 1.6% into named phases. Expected result: `construct` shows a one-off fastembed first-load cost (~1–2s on cold start), `consolidation_observe` and `kg_update` both stay <10 ms, and `lance_index` continues to dominate. If any of those surprise us, this document gets revised; if not, this attribution stands and we close Phase 0.5.
99+
100+
## Appendix — reproduction
101+
102+
```bash
103+
LOG=/home/rolandpg/.openclaw/workspace-vigil/.zettelforge_vigil/logs/zettelforge.log
104+
105+
# remember() distribution
106+
jq -s '[.[] | select(.event=="ocsf_api_activity" and .activity_name=="remember") | .duration_ms]
107+
| {count: length, avg: (add/length), p50: (sort|.[length/2|floor]),
108+
p95: (sort|.[length*95/100|floor]), max: max}' "$LOG"
109+
110+
# LanceDB share of remember()
111+
jq -s 'def sum: add;
112+
([.[] | select(.event=="ocsf_api_activity" and .activity_name=="remember") | .duration_ms] | sum) as $R |
113+
([.[] | select(.event=="ocsf_file_activity" and .file.path=="notes_cti"
114+
and .duration_ms != null) | .duration_ms] | sum) as $L |
115+
{remember_ms: $R, lance_cti_ms: $L, lance_share_pct: ($L*100/$R)}' "$LOG"
116+
117+
# Fragment counts
118+
find "/home/rolandpg/.openclaw/workspace-vigil/.zettelforge_vigil/vectordb/notes_cti.lance/data" -name '*.lance' | wc -l
119+
find "/home/rolandpg/.openclaw/workspace-vigil/.zettelforge_vigil/vectordb/notes_general.lance/data" -name '*.lance' | wc -l
120+
```

0 commit comments

Comments
 (0)