|
| 1 | +# Phase-9 post-mortem audit |
| 2 | + |
| 3 | +Pre-Phase-10 audit covering: spec alignment, code organization, |
| 4 | +and the deferred-work backlog. Findings only — no changes made. |
| 5 | + |
| 6 | +--- |
| 7 | + |
| 8 | +## 1. Spec alignment |
| 9 | + |
| 10 | +### 1.1 Critical misalignment (0 — see correction) |
| 11 | + |
| 12 | +- ~~AUTH-phase timeout not enforced~~ — **false positive**. The |
| 13 | + spec-audit subagent missed `crates/brain-server/src/connection.rs:434` |
| 14 | + which arms a `handshake_deadline` at `serve_connection` start, |
| 15 | + fires `ErrorCode::Unauthenticated` on expiry, and is cleared on |
| 16 | + `ConnPhase::Established`. The implementation is correct; only |
| 17 | + the firing path lacks a dedicated test (one test in |
| 18 | + `tests/connection.rs:127` exists for a passing handshake but |
| 19 | + not the timeout case). Not blocking Phase 10. |
| 20 | + |
| 21 | +### 1.2 No other semantic violations |
| 22 | + |
| 23 | +Connection FSM, handshake, error mapping, frame dispatch, |
| 24 | +shard orchestration, shutdown discipline — all spec-faithful. |
| 25 | +All 19 SD entries in `docs/spec-deviations.md` remain accurate |
| 26 | +(none stale, none stealth-reconciled). |
| 27 | + |
| 28 | +--- |
| 29 | + |
| 30 | +## 2. Code organization |
| 31 | + |
| 32 | +The user's observation is correct: most crates pile files at |
| 33 | +the `src/` root with no sub-module grouping. The crates that |
| 34 | +have done it (brain-storage `arena/`+`wal/`, brain-metadata |
| 35 | +`tables/`, brain-planner `executor/`+`plan/`, brain-server |
| 36 | +`llm/`) are noticeably easier to navigate. |
| 37 | + |
| 38 | +### 2.1 High-value refactors (low risk, big nav win) |
| 39 | + |
| 40 | +| # | Crate | Move | Why now | |
| 41 | +| -- | ----- | ---- | ------- | |
| 42 | +| **A** | brain-workers | 12 worker files → `src/workers/` | 20 files in root, no grouping; workers are the textbook "one type per file" cluster | |
| 43 | +| **B** | brain-ops | `encode.rs writer.rs forget.rs recall.rs plan.rs reason.rs link.rs subscribe.rs txn.rs` → `src/ops/` | 16 files in root mix op handlers with infra (context, dispatch, error) | |
| 44 | +| **C** | brain-ops | Split `writer.rs` (1315 LOC) — extract `do_encode`/`do_forget`/`do_link`/`do_unlink` into `src/ops/writer/{encode,forget,link,unlink}.rs` | Single largest non-protocol file in the workspace; handlers are already independent functions | |
| 45 | +| **D** | brain-protocol | Group `request.rs` (1026) + `response.rs` (1497) bodies into `src/requests/` + `src/responses/` sub-modules by op family (cognitive / link / txn / admin / subscribe) | Two of the three biggest files in the workspace; the enums stay at root, only the per-variant structs move | |
| 46 | + |
| 47 | +All four are pure-renames + visibility tweaks; no semantic |
| 48 | +changes. Each ships as its own commit; verify-after-each. |
| 49 | + |
| 50 | +### 2.2 Medium-value refactors |
| 51 | + |
| 52 | +| # | Crate | Move | Why later | |
| 53 | +| -- | ----- | ---- | --------- | |
| 54 | +| **E** | brain-server | Split `shard.rs` (975 LOC) — keep `ShardRequest` enum + `shard_main_loop` at root; extract worker-adapter glue into `src/shard_adapters/{rebuild,snapshot,retention,...}.rs` | Cleaner but riskier — touches the Tokio↔Glommio boundary, where bugs are expensive | |
| 55 | +| **F** | brain-planner | Op handlers (`encode/recall/forget/reason/path.rs`) → `src/ops/`; analysis (`cost.rs explain.rs`) stays at root | Mirrors brain-ops naming; less urgent because `executor/` + `plan/` already give some structure | |
| 56 | +| **G** | brain-index | Extract snapshot I/O out of `hnsw.rs` (1200 LOC) into `src/persistence/{codec,io}.rs` | The split needs private-struct visibility surgery; correctness-critical (CRC, magic, versioning) | |
| 57 | + |
| 58 | +### 2.3 Skip |
| 59 | + |
| 60 | +- **brain-server's 11 root files** — `<concern>.rs` naming is |
| 61 | + appropriate for a multi-concern server; root layout reads fine. |
| 62 | + Refactor E covers the one outlier (`shard.rs`). |
| 63 | +- **brain-embed (9 files), brain-core (5), brain-metadata (4)** — |
| 64 | + small enough that sub-moduling adds ceremony without payoff. |
| 65 | +- **Cross-crate duplication** — observed twice (shutdown signal |
| 66 | + pattern, metrics-snapshot wiring) but neither is acute enough to |
| 67 | + hoist into brain-core. Watch for a third occurrence before |
| 68 | + extracting. |
| 69 | + |
| 70 | +### 2.4 Naming inconsistencies |
| 71 | + |
| 72 | +None blocking. The brief surveyed every crate and found |
| 73 | +consistent intra-crate conventions (verbs in brain-ops, kinds in |
| 74 | +brain-workers, concerns in brain-server). The mix is intentional. |
| 75 | + |
| 76 | +--- |
| 77 | + |
| 78 | +## 3. Deferred backlog |
| 79 | + |
| 80 | +### 3.1 SDs closable with a one-line spec PR (S, batch) |
| 81 | + |
| 82 | +- SD-2.3-1, SD-2.4-1 — CRC range typos (`[0..36]→[0..40]`, |
| 83 | + `[0..76]→[0..80]`). |
| 84 | +- SD-3.5-1 — document the `IdempotencyEntry.request_hash` field. |
| 85 | +- SD-4.5-1 — document the three-file HNSW snapshot layout. |
| 86 | +- SD-5.1-1 — tighten §04/03 §11 to "safetensors only". |
| 87 | + |
| 88 | +These are spec-text changes, not code changes. The user owns |
| 89 | +spec edits, so this is a "queue these up next time we touch the |
| 90 | +spec" item, not a Brain-side TODO. |
| 91 | + |
| 92 | +### 3.2 SDs to keep deferred (structurally correct) |
| 93 | + |
| 94 | +SD-2.8-1 (O_DIRECT + WAL pages), SD-2.8-2-b (two-syscall fsync), |
| 95 | +SD-4.5-2 (Box::leak on HnswIo), SD-4.8-1 (RwLock vs ArcSwap on |
| 96 | +HNSW), SD-5.1-2 (full-file safetensors), SD-10.6-1 (crossbeam- |
| 97 | +epoch). All have load-bearing constraints; revisit only if |
| 98 | +benchmarks regress. |
| 99 | + |
| 100 | +### 3.3 Phase 9 code-level punts |
| 101 | + |
| 102 | +Only one real TODO landed in code: |
| 103 | + |
| 104 | +- `crates/brain-server/src/shard_adapters.rs:225` — `hnsw.snapshot` |
| 105 | + in the snapshot worker is a no-op. Blocked on |
| 106 | + `HnswIndex::save_snapshot` (Phase 6 hadn't exposed the API |
| 107 | + when 9.12 shipped). Closing this is a brain-index + |
| 108 | + brain-server change (S). Worth doing before Phase 10 starts if |
| 109 | + Phase 10 touches snapshots; otherwise it's fine to carry. |
| 110 | + |
| 111 | +All other 9.x deferrals (multi-frame streaming, SUBSCRIBE WAL |
| 112 | +replay, per-IP rate limits, full admin surface, in-flight drain |
| 113 | +accounting, signal-handling tests, multi-shard fan-out, crash |
| 114 | +recovery E2E) are intentional v2 / Phase-16 scope. |
| 115 | + |
| 116 | +--- |
| 117 | + |
| 118 | +## 4. Recommendation |
| 119 | + |
| 120 | +Before Phase 10, in priority order: |
| 121 | + |
| 122 | +1. **Fix the AUTH-phase timeout** (§1.1) — 15 LOC, spec MUST. |
| 123 | +2. **Land refactor A** (brain-workers → `workers/`) — sets the |
| 124 | + pattern; lowest-risk. |
| 125 | +3. **Land refactor B** (brain-ops → `ops/`) — same template. |
| 126 | +4. **Land refactor C** (split `writer.rs`) — pays off the biggest |
| 127 | + file in the workspace. |
| 128 | +5. **Land refactor D** (brain-protocol bodies into sub-modules) — |
| 129 | + tackles the other two giants. |
| 130 | + |
| 131 | +Stop here. Refactors E/F/G are nice but not pre-Phase-10 |
| 132 | +urgent; carry them as a backlog. The HNSW snapshot TODO and |
| 133 | +the spec-text SDs can sit until they intersect new work. |
| 134 | + |
| 135 | +Estimated effort for the top 5: ~1 commit each, ~30 min per |
| 136 | +commit including verify. Total ~3 hours. |
| 137 | + |
| 138 | +--- |
| 139 | + |
| 140 | +*Awaiting user direction on which items to execute.* |
0 commit comments