|
| 1 | +# LLM Index — Architecture Document |
| 2 | + |
| 3 | +## Problem |
| 4 | + |
| 5 | +LLM agents (Claude Code, Cursor, etc.) working on ReScript projects need access to type information — function signatures, module contents, type definitions — to write correct code. They cannot call LSP requests directly from the editor, so they need an alternative way to query this information. |
| 6 | + |
| 7 | +The current solution is a Claude Code skill that: |
| 8 | + |
| 9 | +1. Hooks into `js-post-build` to run a Python script after each file compiles |
| 10 | +2. The script calls `rescript-tools doc` (subprocess) to extract type info from `.cmi`/`.cmt` files |
| 11 | +3. Writes the results into a SQLite database (`rescript.db`) |
| 12 | +4. LLMs query the database via `sqlite3 rescript.db "SELECT ..."` |
| 13 | + |
| 14 | +This works but has significant friction: |
| 15 | + |
| 16 | +- Requires Python (`uv`) as a runtime dependency |
| 17 | +- Requires `js-post-build` hook configuration in every `rescript.json` |
| 18 | +- Concurrent `js-post-build` invocations cause write contention (Python's `sqlite3.connect(timeout=30)` is the workaround) |
| 19 | +- Spawns a `rescript-tools doc` subprocess per file, per compile |
| 20 | +- The sync/update/discovery logic duplicates knowledge the compiler already has (package resolution, source directories, module graph) |
| 21 | + |
| 22 | +## Goal |
| 23 | + |
| 24 | +Move the index generation into the `rescript lsp` server so that the database stays in sync automatically, with zero user configuration. The skill simplifies to just the query layer. |
| 25 | + |
| 26 | +## Architecture |
| 27 | + |
| 28 | +``` |
| 29 | +┌─────────────────────────────────────────────────────────┐ |
| 30 | +│ rescript lsp │ |
| 31 | +│ │ |
| 32 | +│ ┌──────────────┐ ┌──────────────┐ ┌───────────────┐ │ |
| 33 | +│ │ Build Engine │ │ LSP Protocol │ │ LLM Index │ │ |
| 34 | +│ │ (rewatch) │ │ (tower-lsp) │ │ Writer │ │ |
| 35 | +│ │ │ │ │ │ │ │ |
| 36 | +│ │ Knows: │ │ │ │ After build: │ │ |
| 37 | +│ │ - All modules │ │ │ │ 1. Identify │ │ |
| 38 | +│ │ - Dep graph │ │ │ │ changed │ │ |
| 39 | +│ │ - cmi/cmt │ │ │ │ modules │ │ |
| 40 | +│ │ paths │ │ │ │ 2. Call │ │ |
| 41 | +│ │ │ │ │ │ analysis │ │ |
| 42 | +│ │ │ │ │ │ binary │ │ |
| 43 | +│ │ │ │ │ │ (batch) │ │ |
| 44 | +│ │ │ │ │ │ 3. Write │ │ |
| 45 | +│ │ │ │ │ │ SQLite │ │ |
| 46 | +│ └──────┬────────┘ └──────────────┘ └───────┬───────┘ │ |
| 47 | +│ │ │ │ |
| 48 | +│ └──────── triggers ───────────────────┘ │ |
| 49 | +└─────────────────────────────────────────────────────────┘ |
| 50 | + │ |
| 51 | + ▼ |
| 52 | + rescript.db ◄──── sqlite3 queries from LLM agents |
| 53 | +``` |
| 54 | + |
| 55 | +### Components |
| 56 | + |
| 57 | +**Analysis binary — new `docIndex` subcommand** |
| 58 | + |
| 59 | +A new subcommand for `rescript-editor-analysis` that processes multiple modules in a single invocation and outputs JSON tailored for direct SQLite insertion. This follows the same pattern as existing analysis subcommands: the Rust side sends a JSON blob on stdin, the OCaml binary processes it and writes JSON to stdout. |
| 60 | + |
| 61 | +Input (JSON via stdin): |
| 62 | + |
| 63 | +```json |
| 64 | +{ |
| 65 | + "files": [ |
| 66 | + { "cmt": "/path/to/lib/lsp/src/App.cmt", "cmi": "/path/to/lib/ocaml/App.cmi" }, |
| 67 | + { "cmti": "/path/to/lib/lsp/src/Types.cmti", "cmi": "/path/to/lib/ocaml/Types.cmi" } |
| 68 | + ], |
| 69 | + "runtimePath": "/path/to/node_modules/@rescript/runtime" |
| 70 | +} |
| 71 | +``` |
| 72 | + |
| 73 | +Output (JSON on stdout) — structured to match the database schema directly, not the generic `rescript-tools doc` format. The Rust side should be able to iterate over this and insert rows without reshaping: |
| 74 | + |
| 75 | +```json |
| 76 | +[ |
| 77 | + { |
| 78 | + "moduleName": "App", |
| 79 | + "qualifiedName": "App", |
| 80 | + "sourceFilePath": "src/App.res", |
| 81 | + "types": [ |
| 82 | + { "name": "state", "kind": "record", "signature": "type state = {count: int}", "detail": "{\"items\":[{\"name\":\"count\",\"signature\":\"int\"}]}" } |
| 83 | + ], |
| 84 | + "values": [ |
| 85 | + { "name": "make", "signature": "(~title: string) => React.element", "paramCount": 1, "returnType": "React.element" } |
| 86 | + ], |
| 87 | + "aliases": [], |
| 88 | + "nestedModules": [ |
| 89 | + { |
| 90 | + "moduleName": "Inner", |
| 91 | + "qualifiedName": "App.Inner", |
| 92 | + "types": [], |
| 93 | + "values": [], |
| 94 | + "aliases": [], |
| 95 | + "nestedModules": [] |
| 96 | + } |
| 97 | + ] |
| 98 | + } |
| 99 | +] |
| 100 | +``` |
| 101 | + |
| 102 | +Key design choices for the output format: |
| 103 | + |
| 104 | +- `detail` is pre-serialized as a JSON string (not a nested object) — the Rust side stores it as-is in SQLite without re-serializing |
| 105 | +- `paramCount` and `returnType` are computed by the OCaml side (it has the typed tree, it can do this accurately rather than regex-counting `=>`) |
| 106 | +- `sourceFilePath` is relative to the package root — the Rust side has the package path and can make it absolute for the database |
| 107 | +- Nested modules are inline — the Rust side handles `parent_module_id` assignment during insertion |
| 108 | + |
| 109 | +The Rust side provides the `.cmt`/`.cmti` paths (it knows these from `BuildCommandState`) and the `.cmi` paths (for hash-based invalidation). The analysis binary reads the `.cmt`/`.cmti` to extract type information. |
| 110 | + |
| 111 | +**LLM Index Writer (Rust, in rewatch)** |
| 112 | + |
| 113 | +A new module in `rewatch/src/` responsible for: |
| 114 | + |
| 115 | +- Building the stdin JSON from `BuildCommandState` (it already knows all module paths, package paths, etc.) |
| 116 | +- Spawning the analysis binary once with `["rewatch", "docIndex"]` and parsing the stdout JSON |
| 117 | +- Owning the SQLite connection (single writer, no contention) |
| 118 | +- Inserting rows directly from the output format — no reshaping needed |
| 119 | +- Tracking `.cmi` hashes to skip unchanged modules (hash computed on the Rust side before calling the analysis binary, so unchanged modules are never sent) |
| 120 | + |
| 121 | +The writer does not need to resolve packages or discover source files — the `BuildCommandState` already has this information. |
| 122 | + |
| 123 | +**Database file location** |
| 124 | + |
| 125 | +One `rescript.db` per workspace root, not per project root. This maps to the `ProjectMap.states: HashMap<PathBuf, BuildCommandState>` structure in the LSP — a single database can contain modules from multiple project roots (monorepo case, or multiple folders open in the editor). |
| 126 | + |
| 127 | +For the `rescript lsp` case, the database lives alongside the workspace. For the CLI case (`rescript db sync`), it lives at the project root. |
| 128 | + |
| 129 | +### Trigger Points |
| 130 | + |
| 131 | +**Initial sync (on LSP startup)** |
| 132 | + |
| 133 | +After `initial_build()` completes and the `BuildCommandState` is populated: |
| 134 | + |
| 135 | +1. Spawn a background task (non-blocking — the LSP should be responsive immediately) |
| 136 | +2. Enumerate all `.cmi` files across all project roots in `ProjectMap.states` |
| 137 | +3. Call the analysis binary batch subcommand |
| 138 | +4. Write everything to `rescript.db` |
| 139 | +5. This includes dependencies (`@rescript/react`, `@rescript/webapi`, etc.) |
| 140 | + |
| 141 | +**Incremental update (after queue flush)** |
| 142 | + |
| 143 | +After the queue consumer finishes a flush cycle (builds + typechecks): |
| 144 | + |
| 145 | +1. Identify which modules were recompiled (the build engine already tracks this) |
| 146 | +2. For changed modules, call the analysis binary to extract updated docs |
| 147 | +3. Upsert into `rescript.db` |
| 148 | + |
| 149 | +Dependencies don't change during normal editing, so incremental updates only cover project modules. |
| 150 | + |
| 151 | +**CLI sync (`rescript sync`) — start here** |
| 152 | + |
| 153 | +A standalone subcommand that builds the project and writes `rescript.db`. This is the first thing to implement because it exercises the analysis binary subcommand + SQLite writer end-to-end without any async/LSP complexity. |
| 154 | + |
| 155 | +Usage: |
| 156 | + |
| 157 | +```bash |
| 158 | +rescript sync # build + index, writes rescript.db in project root |
| 159 | +rescript sync --folder ./packages/app # monorepo: specify project root |
| 160 | +``` |
| 161 | + |
| 162 | +What it does: |
| 163 | + |
| 164 | +1. Run `build::build()` (same as `rescript build`) to get a `BuildState` with all modules compiled |
| 165 | +2. Enumerate all modules from `BuildState.modules` + dependency packages + runtime |
| 166 | +3. For each module, compute `.cmi` hash and collect `.cmt`/`.cmti` paths |
| 167 | +4. Call the analysis binary once: `rescript-editor-analysis.exe rewatch docIndex` with the file list on stdin |
| 168 | +5. Create/open `rescript.db`, apply schema DDL |
| 169 | +6. Insert all rows from the analysis output |
| 170 | +7. Mark auto-opened modules (`Stdlib`, `Pervasives`, and `-open` flags from compiler config) |
| 171 | + |
| 172 | +Implementation touches: |
| 173 | + |
| 174 | +- `rewatch/src/cli.rs` — add `Sync` variant to `Command` enum with a `FolderArg` |
| 175 | +- `rewatch/src/main.rs` — add `cli::Command::Sync { folder } => run_sync(&folder)` match arm |
| 176 | +- `rewatch/src/llm_index.rs` (new) — the SQLite writer module: schema DDL, insert logic, hash tracking |
| 177 | +- `analysis/bin/main.ml` — add `| ["llmIndex"] -> CommandsRewatch.llmIndex ()` to the rewatch dispatch |
| 178 | +- `analysis/src/LlmIndex.ml` (new) — `llmIndex` handler that reads file list from stdin, processes each `.cmt`/`.cmti`, outputs the schema-tailored JSON |
| 179 | + |
| 180 | +### Trying the `llmIndex` subcommand |
| 181 | + |
| 182 | +After building the project (`make lib`), you can test the analysis binary's `llmIndex` subcommand directly: |
| 183 | + |
| 184 | +```bash |
| 185 | +# First, build a ReScript project so .cmt files exist |
| 186 | +cd /path/to/your/rescript-project |
| 187 | +rescript build |
| 188 | + |
| 189 | +# Craft the stdin JSON and pipe it to the analysis binary |
| 190 | +cat <<'EOF' | rescript-editor-analysis.exe rewatch llmIndex |
| 191 | +{ |
| 192 | + "rootPath": "/path/to/your/rescript-project", |
| 193 | + "namespace": null, |
| 194 | + "suffix": ".mjs", |
| 195 | + "rescriptVersion": [13, 0], |
| 196 | + "genericJsxModule": null, |
| 197 | + "opens": [], |
| 198 | + "pathsForModule": { |
| 199 | + "MyModule": { |
| 200 | + "impl": { |
| 201 | + "cmt": "/path/to/your/rescript-project/lib/bs/src/MyModule.cmt", |
| 202 | + "res": "/path/to/your/rescript-project/lib/bs/src/MyModule.res" |
| 203 | + } |
| 204 | + } |
| 205 | + }, |
| 206 | + "projectFiles": ["MyModule"], |
| 207 | + "dependenciesFiles": [], |
| 208 | + "files": [ |
| 209 | + { "moduleName": "MyModule", "cmt": "/path/to/your/rescript-project/lib/bs/src/MyModule.cmt", "cmti": "" } |
| 210 | + ] |
| 211 | +} |
| 212 | +EOF |
| 213 | +``` |
| 214 | + |
| 215 | +```bash |
| 216 | +cat <<'EOF' | /Users/nojaf/Projects/rescript/packages/@rescript/darwin-arm64/bin/rescript-editor-analysis.exe rewatch llmIndex |
| 217 | +{ |
| 218 | + "rootPath": "/Users/nojaf/Projects/relocation", |
| 219 | + "namespace": null, |
| 220 | + "suffix": ".res.mjs", |
| 221 | + "rescriptVersion": [13, 0], |
| 222 | + "genericJsxModule": null, |
| 223 | + "opens": [], |
| 224 | + "pathsForModule": { |
| 225 | + "App": { |
| 226 | + "impl": { |
| 227 | + "cmt": "/Users/nojaf/Projects/relocation/lib/bs/src/App.cmt", |
| 228 | + "res": "/Users/nojaf/Projects/relocation/lib/bs/src/App.res" |
| 229 | + } |
| 230 | + } |
| 231 | + }, |
| 232 | + "projectFiles": ["App"], |
| 233 | + "dependenciesFiles": [], |
| 234 | + "files": [ |
| 235 | + { "moduleName": "App", "cmt": "/Users/nojaf/Projects/relocation/lib/bs/src/App.cmt", "cmti": "" } |
| 236 | + ] |
| 237 | +} |
| 238 | +EOF |
| 239 | +``` |
| 240 | + |
| 241 | +The output is a JSON array of module objects with `records`, `variants`, `typeAliases`, `values`, `moduleAliases`, and `nestedModules`. |
| 242 | + |
| 243 | +### Database Schema |
| 244 | + |
| 245 | +Same schema as the current skill, proven to work well for LLM queries: |
| 246 | + |
| 247 | +```sql |
| 248 | +packages (id, name, path, rescript_json, config_hash) |
| 249 | +modules (id, package_id, parent_module_id, name, qualified_name, |
| 250 | + source_file_path, compiled_file_path, file_hash, is_auto_opened) |
| 251 | +types (id, module_id, name, kind, signature, detail) |
| 252 | +"values" (id, module_id, name, return_type, param_count, signature, detail) |
| 253 | +aliases (id, source_module_id, alias_name, alias_kind, target_qualified_name, docstrings) |
| 254 | +``` |
| 255 | + |
| 256 | +Key indexes: `qualified_name`, `compiled_file_path`, `is_auto_opened`, `alias_name`. |
| 257 | + |
| 258 | +Hash-based invalidation: `modules.file_hash` stores the SHA-256 of the `.cmi` file. On incremental update, skip modules whose hash hasn't changed. |
| 259 | + |
| 260 | +### What the Skill Becomes |
| 261 | + |
| 262 | +The skill reduces to: |
| 263 | + |
| 264 | +- `SKILL.md` with the schema documentation and query patterns |
| 265 | +- LLMs query directly: `sqlite3 rescript.db "SELECT ..."` |
| 266 | + |
| 267 | +No Python, no `uv`, no `js-post-build` hook, no sync/update scripts. |
| 268 | + |
| 269 | +## Key Files |
| 270 | + |
| 271 | +### Rust side (rewatch) |
| 272 | + |
| 273 | +| File | What's there | Relevance | |
| 274 | +|------|-------------|-----------| |
| 275 | +| `rewatch/src/lsp.rs` | `Backend` struct, `LanguageServer` impl, `ProjectMap` (maps project roots → `BuildCommandState`) | Top-level LSP orchestration. `ProjectMap.states` is the source of truth for all modules/packages. `initial_build()` (line 762) and queue startup (line 343) are the trigger points. | |
| 276 | +| `rewatch/src/lsp/analysis.rs` | `AnalysisContext`, `build_context_json()`, `spawn()` | Pattern to follow: builds JSON context from `BuildCommandState`, sends via stdin to analysis binary, parses stdout. The new `docIndex` subcommand follows this same pattern. | |
| 277 | +| `rewatch/src/lsp/queue.rs` | Unified debounced queue, `flush_inner()` (line 522) | After flush completes (builds + typechecks), this is where incremental index updates would be triggered. The `buildFinished` notification (line 681) marks the natural hook point. | |
| 278 | +| `rewatch/src/lsp/queue/file_build.rs` | Per-file incremental build | Knows which modules were recompiled — needed to identify what to re-index. | |
| 279 | +| `rewatch/src/build/build_types.rs` | `BuildCommandState` (line 666), `BuildState` (line 647), `Module` enum (line 572), `SourceFileModule` (line 464) | Core types. `BuildState.modules: HashMap<String, Module>` contains all modules with their paths, deps, and compilation stage. | |
| 280 | +| `rewatch/src/cli.rs` | CLI entry point, `Command` enum (line 388) | Where to add a `rescript db sync` subcommand. | |
| 281 | +| `rewatch/src/lsp/initial_build.rs` | Full `TypecheckOnly` build on startup | Runs before the queue starts. After this completes, the initial index sync would begin as a background task. | |
| 282 | + |
| 283 | +### OCaml side (analysis binary) |
| 284 | + |
| 285 | +| File | What's there | Relevance | |
| 286 | +|------|-------------|-----------| |
| 287 | +| `analysis/bin/main.ml` | CLI dispatch, `rewatch` subcommand routing (line 135) | Where to add the `"docIndex"` match arm: `\| ["docIndex"] -> CommandsRewatch.docIndex ()` | |
| 288 | +| `analysis/src/CommandsRewatch.ml` | `withRewatchContext` (line 145), all rewatch subcommand handlers | Pattern to follow: reads JSON from stdin via `withRewatchContext`, calls into analysis logic, prints JSON to stdout. The new `docIndex` handler goes here. | |
| 289 | +| `analysis/src/DocumentSymbol.ml` | `command ~path ~source` — extracts symbols from a single file | Existing per-file symbol extraction. The `docIndex` implementation may reuse some of this logic but needs a different output shape. | |
| 290 | +| `tools/src/tools.ml` | `extractDocs` (line 421) — the function behind `rescript-tools doc` | This is what the current Python skill calls. Produces the generic doc JSON. The new `docIndex` subcommand replaces this with a schema-tailored output. | |
| 291 | +| `tools/bin/main.ml` | `rescript-tools` CLI, `"doc"` command (line 60) | Reference for how `extractDocs` is invoked today. Not modified by this work. | |
| 292 | + |
| 293 | +### Current skill (reference implementation to replace) |
| 294 | + |
| 295 | +| File | What's there | Relevance | |
| 296 | +|------|-------------|-----------| |
| 297 | +| `../relocation/.claude/skills/rescript/scripts/rescript-db.py` | Python sync/update/query CLI | The logic being replaced. Useful as reference for: schema DDL, `parse_module_documentation()` (the JSON→rows mapping), hash-based invalidation, auto-opened module detection. | |
| 298 | +| `../relocation/.claude/skills/rescript/SKILL.md` | Skill documentation, schema docs, query patterns | The query patterns and schema documentation survive as-is. The sync/update sections go away. | |
| 299 | + |
| 300 | +## Decisions Made |
| 301 | + |
| 302 | +- **Analysis binary input format**: JSON via stdin, consistent with how all other analysis subcommands work (see `analysis.rs` — `build_context_json` + `spawn()`). |
| 303 | +- **Output format**: Tailored for SQLite insertion, not the generic `rescript-tools doc` shape. The OCaml side computes `paramCount`/`returnType` accurately from the typed tree. `detail` is pre-serialized as a JSON string. |
| 304 | + |
| 305 | +## Open Questions |
| 306 | + |
| 307 | +- **Database path configuration**: Should the LSP accept an initialization option for the database path, or always use a fixed location relative to the workspace/project root? |
| 308 | +- **Dependency indexing frequency**: Dependencies only change on `bun install` / package updates. Should we track a hash of `node_modules` state to know when to re-index deps, or just re-index them on every full sync? |
| 309 | +- **WAL mode and readers**: SQLite WAL mode allows concurrent reads while the LSP writes. Do we need any additional coordination, or is WAL sufficient? |
| 310 | +- **Multi-root workspaces**: When multiple project roots exist in `ProjectMap.states`, should the database include a `project_root` column to disambiguate, or is the `packages` table sufficient? |
| 311 | +- **Auto-opened modules**: The current skill detects these from compiler flags (`-open`) and hardcodes `Stdlib`/`Pervasives` for `@rescript/runtime`. Should the analysis binary report `is_auto_opened` per module, or should the Rust side keep this logic? |
0 commit comments