Skip to content

Commit 8bfe844

Browse files
committed
Merge feature/brain-server: Phase 9 (Server)
Brings Phase 9 (sub-tasks 9.1 - 9.17) plus post-phase cleanups: - Tokio connection layer + frame dispatcher (Tokio<->Glommio bridge) - Cross-shard SUBSCRIBE fan-out - ArcSwap-published routing table - Health + metrics admin endpoints - Graceful shutdown - OpenAI / Ollama summarizer adapters (feature-gated) - PLAN / REASON tombstone filter - In-process end-to-end wire smoke - HNSW snapshot wired into shard checkpoint - Crate-level refactors: brain-workers (workers/), brain-ops (ops/ + writer/), brain-protocol (requests/ + responses/), brain-server (admin/ bootstrap/ config/ network/ shard/) Tag phase-9-complete points at the exit-checklist commit.
2 parents 9d1c6d4 + 2e247b0 commit 8bfe844

154 files changed

Lines changed: 26405 additions & 6917 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.claude/plans/phase-09-exit.md

Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
# Phase 9 — Exit checklist
2+
3+
Phase 9 has no `9.18` sub-task. The phase doc closes with a 6-item
4+
exit checklist (lines 370–377 of `docs/phases/phase-09-server.md`)
5+
that gates the `phase-9-complete` tag.
6+
7+
This plan walks each item to a verifiable conclusion, then tags.
8+
9+
---
10+
11+
## 1. Checklist items
12+
13+
### 1.1 All sub-tasks complete
14+
15+
`grep -c '\[x\]' docs/phases/phase-09-server.md` should equal the
16+
number of `### Task 9.` headers (16 sub-tasks: 9.1–9.17, with the
17+
mid-phase renumberings noted in the doc).
18+
19+
**Action:** visual inspection — confirm every `### Task 9.*` line
20+
ends in `[x]`. Mark ✓ or surface any missing.
21+
22+
### 1.2 `just verify` green
23+
24+
We've been running `just docker-verify` per sub-task. The phase
25+
exit calls for `just verify` specifically (host fmt + workspace
26+
tests + clippy). Both should be equivalent in green-state, but
27+
the phase doc names `just verify` so we run that.
28+
29+
**Action:** `just verify`. If it diverges from `docker-verify`
30+
(host-only flake), surface and decide whether to gate on the
31+
docker run.
32+
33+
### 1.3 `cargo run --bin brain-server` accepts a connection
34+
35+
Boot the server with a minimal config, connect a hand-rolled
36+
client over loopback, complete the handshake.
37+
38+
**Action:** prepare a minimal `config.toml` under a temp dir,
39+
`cargo run --bin brain-server -- --config <path>` in the
40+
background, `nc 127.0.0.1 <port>` or a one-shot Rust client to
41+
verify TCP accept + HELLO/WELCOME. Tear down with SIGTERM and
42+
confirm a clean exit.
43+
44+
The simplest path: write a tiny ad-hoc client binary or shell
45+
script under `target/scratch/` (not committed) that does the
46+
handshake from `crates/brain-server/tests/e2e.rs` against a
47+
real port. Or — even simpler — just `nc -z 127.0.0.1 <port>`
48+
to prove the listener is up, since 9.17's tests already prove
49+
the handshake works over loopback.
50+
51+
**Decision:** use `nc -z` for the connect check. The handshake
52+
is already covered by 9.17. The checklist phrasing "accepts a
53+
connection from a sample client" is satisfied by TCP accept;
54+
deeper coverage would duplicate 9.17's test surface.
55+
56+
### 1.4 E2E smoke test passes 100 iterations
57+
58+
9.17's `repeated_encode_recall_is_stable` is exactly this — 100
59+
× (encode + recall) on one connection. Run the test 5–10 times
60+
back-to-back to confirm it's deterministic, not flaky.
61+
62+
**Action:**
63+
```
64+
for i in $(seq 1 10); do
65+
cargo test -p brain-server --test e2e \
66+
repeated_encode_recall_is_stable -- --nocapture || break
67+
done
68+
```
69+
70+
Expect 10/10 passes. If a flake surfaces, surface and stop.
71+
72+
### 1.5 `just run-server` boots in < 5 seconds with empty data
73+
74+
Cold boot on an empty data dir. Time from `cargo run` invocation
75+
to "ready to accept connections" log line (or to the first
76+
successful `nc -z`) should be < 5 s after the binary is built.
77+
78+
**Action:**
79+
```
80+
cargo build --release --bin brain-server # warm the binary
81+
rm -rf /tmp/brain-empty && mkdir -p /tmp/brain-empty
82+
time cargo run --release --bin brain-server -- --config <path> &
83+
# wait for accept-loop
84+
while ! nc -z 127.0.0.1 <port>; do sleep 0.1; done
85+
# kill
86+
```
87+
88+
Report the wall-time. If > 5 s, surface — likely arena/wal init
89+
cost we haven't tuned.
90+
91+
**Note:** the spec says "< 5 seconds with empty data" without
92+
specifying debug vs release. Release is the fair measurement
93+
since it's what operators run. Debug-mode timing is informational.
94+
95+
### 1.6 Tag `phase-9-complete`
96+
97+
After the previous 5 items pass:
98+
```
99+
git tag phase-9-complete
100+
```
101+
102+
Annotated tag with a short message summarizing what shipped:
103+
```
104+
git tag -a phase-9-complete -m "Phase 9 — brain-server: ..."
105+
```
106+
107+
We pick annotated because the prior phase tags appear to use the
108+
same style (verify before tagging).
109+
110+
---
111+
112+
## 2. Files touched
113+
114+
- `docs/phases/phase-09-server.md` — flip each `[ ]` in the exit
115+
checklist to `[x]` as each item passes; mark the tag landed at
116+
the end.
117+
- (No code changes expected.)
118+
119+
---
120+
121+
## 3. Risks
122+
123+
| Risk | Mitigation |
124+
| ---- | ---------- |
125+
| `just verify` (host) reveals a flake the docker run masked, or vice versa | Run both; if they disagree, surface and don't tag |
126+
| Item 1.5's cold-boot exceeds 5 s on this machine | Report actual time; surface for the user to decide whether to gate the tag on it or relax the threshold for now |
127+
| `repeated_encode_recall_is_stable` flakes 1/10 | Stop, root-cause the flake before tagging |
128+
| Sample-client connect (1.3) hits a race where the listener isn't bound yet | Poll `nc -z` with a short interval and a 10 s timeout |
129+
130+
---
131+
132+
## 4. Done criteria
133+
134+
- [ ] All 6 exit-checklist items confirmed green.
135+
- [ ] Phase doc updated: every checklist item marked `[x]`.
136+
- [ ] Annotated tag `phase-9-complete` pushed onto the current
137+
commit.
138+
- [ ] Single commit with the checklist flip; tag points at that
139+
commit.
140+
141+
---
142+
143+
## 5. Out of scope
144+
145+
- No new sub-tasks. If something is missing, that's a 9.18 (or
146+
Phase 10) conversation, not a smuggle-it-into-the-tag.
147+
- No code changes outside the phase doc.
148+
- No subprocess E2E (9.17's plan §8 defers it).
149+
- No ROADMAP.md update (that happens when Phase 10 starts).
150+
151+
---
152+
153+
*Awaiting approval before executing the checklist.*
Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
# Phase-9 post-mortem audit
2+
3+
Pre-Phase-10 audit covering: spec alignment, code organization,
4+
and the deferred-work backlog. Findings only — no changes made.
5+
6+
---
7+
8+
## 1. Spec alignment
9+
10+
### 1.1 Critical misalignment (0 — see correction)
11+
12+
- ~~AUTH-phase timeout not enforced~~**false positive**. The
13+
spec-audit subagent missed `crates/brain-server/src/connection.rs:434`
14+
which arms a `handshake_deadline` at `serve_connection` start,
15+
fires `ErrorCode::Unauthenticated` on expiry, and is cleared on
16+
`ConnPhase::Established`. The implementation is correct; only
17+
the firing path lacks a dedicated test (one test in
18+
`tests/connection.rs:127` exists for a passing handshake but
19+
not the timeout case). Not blocking Phase 10.
20+
21+
### 1.2 No other semantic violations
22+
23+
Connection FSM, handshake, error mapping, frame dispatch,
24+
shard orchestration, shutdown discipline — all spec-faithful.
25+
All 19 SD entries in `docs/spec-deviations.md` remain accurate
26+
(none stale, none stealth-reconciled).
27+
28+
---
29+
30+
## 2. Code organization
31+
32+
The user's observation is correct: most crates pile files at
33+
the `src/` root with no sub-module grouping. The crates that
34+
have done it (brain-storage `arena/`+`wal/`, brain-metadata
35+
`tables/`, brain-planner `executor/`+`plan/`, brain-server
36+
`llm/`) are noticeably easier to navigate.
37+
38+
### 2.1 High-value refactors (low risk, big nav win)
39+
40+
| # | Crate | Move | Why now |
41+
| -- | ----- | ---- | ------- |
42+
| **A** | brain-workers | 12 worker files → `src/workers/` | 20 files in root, no grouping; workers are the textbook "one type per file" cluster |
43+
| **B** | brain-ops | `encode.rs writer.rs forget.rs recall.rs plan.rs reason.rs link.rs subscribe.rs txn.rs``src/ops/` | 16 files in root mix op handlers with infra (context, dispatch, error) |
44+
| **C** | brain-ops | Split `writer.rs` (1315 LOC) — extract `do_encode`/`do_forget`/`do_link`/`do_unlink` into `src/ops/writer/{encode,forget,link,unlink}.rs` | Single largest non-protocol file in the workspace; handlers are already independent functions |
45+
| **D** | brain-protocol | Group `request.rs` (1026) + `response.rs` (1497) bodies into `src/requests/` + `src/responses/` sub-modules by op family (cognitive / link / txn / admin / subscribe) | Two of the three biggest files in the workspace; the enums stay at root, only the per-variant structs move |
46+
47+
All four are pure-renames + visibility tweaks; no semantic
48+
changes. Each ships as its own commit; verify-after-each.
49+
50+
### 2.2 Medium-value refactors
51+
52+
| # | Crate | Move | Why later |
53+
| -- | ----- | ---- | --------- |
54+
| **E** | brain-server | Split `shard.rs` (975 LOC) — keep `ShardRequest` enum + `shard_main_loop` at root; extract worker-adapter glue into `src/shard_adapters/{rebuild,snapshot,retention,...}.rs` | Cleaner but riskier — touches the Tokio↔Glommio boundary, where bugs are expensive |
55+
| **F** | brain-planner | Op handlers (`encode/recall/forget/reason/path.rs`) → `src/ops/`; analysis (`cost.rs explain.rs`) stays at root | Mirrors brain-ops naming; less urgent because `executor/` + `plan/` already give some structure |
56+
| **G** | brain-index | Extract snapshot I/O out of `hnsw.rs` (1200 LOC) into `src/persistence/{codec,io}.rs` | The split needs private-struct visibility surgery; correctness-critical (CRC, magic, versioning) |
57+
58+
### 2.3 Skip
59+
60+
- **brain-server's 11 root files**`<concern>.rs` naming is
61+
appropriate for a multi-concern server; root layout reads fine.
62+
Refactor E covers the one outlier (`shard.rs`).
63+
- **brain-embed (9 files), brain-core (5), brain-metadata (4)**
64+
small enough that sub-moduling adds ceremony without payoff.
65+
- **Cross-crate duplication** — observed twice (shutdown signal
66+
pattern, metrics-snapshot wiring) but neither is acute enough to
67+
hoist into brain-core. Watch for a third occurrence before
68+
extracting.
69+
70+
### 2.4 Naming inconsistencies
71+
72+
None blocking. The brief surveyed every crate and found
73+
consistent intra-crate conventions (verbs in brain-ops, kinds in
74+
brain-workers, concerns in brain-server). The mix is intentional.
75+
76+
---
77+
78+
## 3. Deferred backlog
79+
80+
### 3.1 SDs closable with a one-line spec PR (S, batch)
81+
82+
- SD-2.3-1, SD-2.4-1 — CRC range typos (`[0..36]→[0..40]`,
83+
`[0..76]→[0..80]`).
84+
- SD-3.5-1 — document the `IdempotencyEntry.request_hash` field.
85+
- SD-4.5-1 — document the three-file HNSW snapshot layout.
86+
- SD-5.1-1 — tighten §04/03 §11 to "safetensors only".
87+
88+
These are spec-text changes, not code changes. The user owns
89+
spec edits, so this is a "queue these up next time we touch the
90+
spec" item, not a Brain-side TODO.
91+
92+
### 3.2 SDs to keep deferred (structurally correct)
93+
94+
SD-2.8-1 (O_DIRECT + WAL pages), SD-2.8-2-b (two-syscall fsync),
95+
SD-4.5-2 (Box::leak on HnswIo), SD-4.8-1 (RwLock vs ArcSwap on
96+
HNSW), SD-5.1-2 (full-file safetensors), SD-10.6-1 (crossbeam-
97+
epoch). All have load-bearing constraints; revisit only if
98+
benchmarks regress.
99+
100+
### 3.3 Phase 9 code-level punts
101+
102+
Only one real TODO landed in code:
103+
104+
- `crates/brain-server/src/shard_adapters.rs:225``hnsw.snapshot`
105+
in the snapshot worker is a no-op. Blocked on
106+
`HnswIndex::save_snapshot` (Phase 6 hadn't exposed the API
107+
when 9.12 shipped). Closing this is a brain-index +
108+
brain-server change (S). Worth doing before Phase 10 starts if
109+
Phase 10 touches snapshots; otherwise it's fine to carry.
110+
111+
All other 9.x deferrals (multi-frame streaming, SUBSCRIBE WAL
112+
replay, per-IP rate limits, full admin surface, in-flight drain
113+
accounting, signal-handling tests, multi-shard fan-out, crash
114+
recovery E2E) are intentional v2 / Phase-16 scope.
115+
116+
---
117+
118+
## 4. Recommendation
119+
120+
Before Phase 10, in priority order:
121+
122+
1. **Fix the AUTH-phase timeout** (§1.1) — 15 LOC, spec MUST.
123+
2. **Land refactor A** (brain-workers → `workers/`) — sets the
124+
pattern; lowest-risk.
125+
3. **Land refactor B** (brain-ops → `ops/`) — same template.
126+
4. **Land refactor C** (split `writer.rs`) — pays off the biggest
127+
file in the workspace.
128+
5. **Land refactor D** (brain-protocol bodies into sub-modules) —
129+
tackles the other two giants.
130+
131+
Stop here. Refactors E/F/G are nice but not pre-Phase-10
132+
urgent; carry them as a backlog. The HNSW snapshot TODO and
133+
the spec-text SDs can sit until they intersect new work.
134+
135+
Estimated effort for the top 5: ~1 commit each, ~30 min per
136+
commit including verify. Total ~3 hours.
137+
138+
---
139+
140+
*Awaiting user direction on which items to execute.*

0 commit comments

Comments
 (0)