Skip to content

Commit 3834a65

Browse files
committed
Add llmIndex analysis binary subcommand
New `llmIndex` subcommand for rescript-editor-analysis that extracts module information (records, variants, type aliases, values, module aliases, nested modules) from .cmt/.cmti files and outputs structured JSON for SQLite insertion. - analysis/src/LlmIndex.ml: walks typed tree, computes paramCount and returnType from type_expr, extracts record fields and variant constructors as structured data - analysis/bin/main.ml: dispatch llmIndex to CommandsRewatch - tests/rewatch_tests/tests/llm-index.test.mjs: snapshot test - LLM_INDEX.md: updated with llmIndex docs and example
1 parent 6a04d52 commit 3834a65

File tree

6 files changed

+835
-0
lines changed

6 files changed

+835
-0
lines changed

LLM_INDEX.md

Lines changed: 311 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,311 @@
1+
# LLM Index — Architecture Document
2+
3+
## Problem
4+
5+
LLM agents (Claude Code, Cursor, etc.) working on ReScript projects need access to type information — function signatures, module contents, type definitions — to write correct code. They cannot call LSP requests directly from the editor, so they need an alternative way to query this information.
6+
7+
The current solution is a Claude Code skill that:
8+
9+
1. Hooks into `js-post-build` to run a Python script after each file compiles
10+
2. The script calls `rescript-tools doc` (subprocess) to extract type info from `.cmi`/`.cmt` files
11+
3. Writes the results into a SQLite database (`rescript.db`)
12+
4. LLMs query the database via `sqlite3 rescript.db "SELECT ..."`
13+
14+
This works but has significant friction:
15+
16+
- Requires Python (`uv`) as a runtime dependency
17+
- Requires `js-post-build` hook configuration in every `rescript.json`
18+
- Concurrent `js-post-build` invocations cause write contention (Python's `sqlite3.connect(timeout=30)` is the workaround)
19+
- Spawns a `rescript-tools doc` subprocess per file, per compile
20+
- The sync/update/discovery logic duplicates knowledge the compiler already has (package resolution, source directories, module graph)
21+
22+
## Goal
23+
24+
Move the index generation into the `rescript lsp` server so that the database stays in sync automatically, with zero user configuration. The skill simplifies to just the query layer.
25+
26+
## Architecture
27+
28+
```
29+
┌─────────────────────────────────────────────────────────┐
30+
│ rescript lsp │
31+
│ │
32+
│ ┌──────────────┐ ┌──────────────┐ ┌───────────────┐ │
33+
│ │ Build Engine │ │ LSP Protocol │ │ LLM Index │ │
34+
│ │ (rewatch) │ │ (tower-lsp) │ │ Writer │ │
35+
│ │ │ │ │ │ │ │
36+
│ │ Knows: │ │ │ │ After build: │ │
37+
│ │ - All modules │ │ │ │ 1. Identify │ │
38+
│ │ - Dep graph │ │ │ │ changed │ │
39+
│ │ - cmi/cmt │ │ │ │ modules │ │
40+
│ │ paths │ │ │ │ 2. Call │ │
41+
│ │ │ │ │ │ analysis │ │
42+
│ │ │ │ │ │ binary │ │
43+
│ │ │ │ │ │ (batch) │ │
44+
│ │ │ │ │ │ 3. Write │ │
45+
│ │ │ │ │ │ SQLite │ │
46+
│ └──────┬────────┘ └──────────────┘ └───────┬───────┘ │
47+
│ │ │ │
48+
│ └──────── triggers ───────────────────┘ │
49+
└─────────────────────────────────────────────────────────┘
50+
51+
52+
rescript.db ◄──── sqlite3 queries from LLM agents
53+
```
54+
55+
### Components
56+
57+
**Analysis binary — new `docIndex` subcommand**
58+
59+
A new subcommand for `rescript-editor-analysis` that processes multiple modules in a single invocation and outputs JSON tailored for direct SQLite insertion. This follows the same pattern as existing analysis subcommands: the Rust side sends a JSON blob on stdin, the OCaml binary processes it and writes JSON to stdout.
60+
61+
Input (JSON via stdin):
62+
63+
```json
64+
{
65+
"files": [
66+
{ "cmt": "/path/to/lib/lsp/src/App.cmt", "cmi": "/path/to/lib/ocaml/App.cmi" },
67+
{ "cmti": "/path/to/lib/lsp/src/Types.cmti", "cmi": "/path/to/lib/ocaml/Types.cmi" }
68+
],
69+
"runtimePath": "/path/to/node_modules/@rescript/runtime"
70+
}
71+
```
72+
73+
Output (JSON on stdout) — structured to match the database schema directly, not the generic `rescript-tools doc` format. The Rust side should be able to iterate over this and insert rows without reshaping:
74+
75+
```json
76+
[
77+
{
78+
"moduleName": "App",
79+
"qualifiedName": "App",
80+
"sourceFilePath": "src/App.res",
81+
"types": [
82+
{ "name": "state", "kind": "record", "signature": "type state = {count: int}", "detail": "{\"items\":[{\"name\":\"count\",\"signature\":\"int\"}]}" }
83+
],
84+
"values": [
85+
{ "name": "make", "signature": "(~title: string) => React.element", "paramCount": 1, "returnType": "React.element" }
86+
],
87+
"aliases": [],
88+
"nestedModules": [
89+
{
90+
"moduleName": "Inner",
91+
"qualifiedName": "App.Inner",
92+
"types": [],
93+
"values": [],
94+
"aliases": [],
95+
"nestedModules": []
96+
}
97+
]
98+
}
99+
]
100+
```
101+
102+
Key design choices for the output format:
103+
104+
- `detail` is pre-serialized as a JSON string (not a nested object) — the Rust side stores it as-is in SQLite without re-serializing
105+
- `paramCount` and `returnType` are computed by the OCaml side (it has the typed tree, it can do this accurately rather than regex-counting `=>`)
106+
- `sourceFilePath` is relative to the package root — the Rust side has the package path and can make it absolute for the database
107+
- Nested modules are inline — the Rust side handles `parent_module_id` assignment during insertion
108+
109+
The Rust side provides the `.cmt`/`.cmti` paths (it knows these from `BuildCommandState`) and the `.cmi` paths (for hash-based invalidation). The analysis binary reads the `.cmt`/`.cmti` to extract type information.
110+
111+
**LLM Index Writer (Rust, in rewatch)**
112+
113+
A new module in `rewatch/src/` responsible for:
114+
115+
- Building the stdin JSON from `BuildCommandState` (it already knows all module paths, package paths, etc.)
116+
- Spawning the analysis binary once with `["rewatch", "docIndex"]` and parsing the stdout JSON
117+
- Owning the SQLite connection (single writer, no contention)
118+
- Inserting rows directly from the output format — no reshaping needed
119+
- Tracking `.cmi` hashes to skip unchanged modules (hash computed on the Rust side before calling the analysis binary, so unchanged modules are never sent)
120+
121+
The writer does not need to resolve packages or discover source files — the `BuildCommandState` already has this information.
122+
123+
**Database file location**
124+
125+
One `rescript.db` per workspace root, not per project root. This maps to the `ProjectMap.states: HashMap<PathBuf, BuildCommandState>` structure in the LSP — a single database can contain modules from multiple project roots (monorepo case, or multiple folders open in the editor).
126+
127+
For the `rescript lsp` case, the database lives alongside the workspace. For the CLI case (`rescript db sync`), it lives at the project root.
128+
129+
### Trigger Points
130+
131+
**Initial sync (on LSP startup)**
132+
133+
After `initial_build()` completes and the `BuildCommandState` is populated:
134+
135+
1. Spawn a background task (non-blocking — the LSP should be responsive immediately)
136+
2. Enumerate all `.cmi` files across all project roots in `ProjectMap.states`
137+
3. Call the analysis binary batch subcommand
138+
4. Write everything to `rescript.db`
139+
5. This includes dependencies (`@rescript/react`, `@rescript/webapi`, etc.)
140+
141+
**Incremental update (after queue flush)**
142+
143+
After the queue consumer finishes a flush cycle (builds + typechecks):
144+
145+
1. Identify which modules were recompiled (the build engine already tracks this)
146+
2. For changed modules, call the analysis binary to extract updated docs
147+
3. Upsert into `rescript.db`
148+
149+
Dependencies don't change during normal editing, so incremental updates only cover project modules.
150+
151+
**CLI sync (`rescript sync`) — start here**
152+
153+
A standalone subcommand that builds the project and writes `rescript.db`. This is the first thing to implement because it exercises the analysis binary subcommand + SQLite writer end-to-end without any async/LSP complexity.
154+
155+
Usage:
156+
157+
```bash
158+
rescript sync # build + index, writes rescript.db in project root
159+
rescript sync --folder ./packages/app # monorepo: specify project root
160+
```
161+
162+
What it does:
163+
164+
1. Run `build::build()` (same as `rescript build`) to get a `BuildState` with all modules compiled
165+
2. Enumerate all modules from `BuildState.modules` + dependency packages + runtime
166+
3. For each module, compute `.cmi` hash and collect `.cmt`/`.cmti` paths
167+
4. Call the analysis binary once: `rescript-editor-analysis.exe rewatch docIndex` with the file list on stdin
168+
5. Create/open `rescript.db`, apply schema DDL
169+
6. Insert all rows from the analysis output
170+
7. Mark auto-opened modules (`Stdlib`, `Pervasives`, and `-open` flags from compiler config)
171+
172+
Implementation touches:
173+
174+
- `rewatch/src/cli.rs` — add `Sync` variant to `Command` enum with a `FolderArg`
175+
- `rewatch/src/main.rs` — add `cli::Command::Sync { folder } => run_sync(&folder)` match arm
176+
- `rewatch/src/llm_index.rs` (new) — the SQLite writer module: schema DDL, insert logic, hash tracking
177+
- `analysis/bin/main.ml` — add `| ["llmIndex"] -> CommandsRewatch.llmIndex ()` to the rewatch dispatch
178+
- `analysis/src/LlmIndex.ml` (new) — `llmIndex` handler that reads file list from stdin, processes each `.cmt`/`.cmti`, outputs the schema-tailored JSON
179+
180+
### Trying the `llmIndex` subcommand
181+
182+
After building the project (`make lib`), you can test the analysis binary's `llmIndex` subcommand directly:
183+
184+
```bash
185+
# First, build a ReScript project so .cmt files exist
186+
cd /path/to/your/rescript-project
187+
rescript build
188+
189+
# Craft the stdin JSON and pipe it to the analysis binary
190+
cat <<'EOF' | rescript-editor-analysis.exe rewatch llmIndex
191+
{
192+
"rootPath": "/path/to/your/rescript-project",
193+
"namespace": null,
194+
"suffix": ".mjs",
195+
"rescriptVersion": [13, 0],
196+
"genericJsxModule": null,
197+
"opens": [],
198+
"pathsForModule": {
199+
"MyModule": {
200+
"impl": {
201+
"cmt": "/path/to/your/rescript-project/lib/bs/src/MyModule.cmt",
202+
"res": "/path/to/your/rescript-project/lib/bs/src/MyModule.res"
203+
}
204+
}
205+
},
206+
"projectFiles": ["MyModule"],
207+
"dependenciesFiles": [],
208+
"files": [
209+
{ "moduleName": "MyModule", "cmt": "/path/to/your/rescript-project/lib/bs/src/MyModule.cmt", "cmti": "" }
210+
]
211+
}
212+
EOF
213+
```
214+
215+
```bash
216+
cat <<'EOF' | /Users/nojaf/Projects/rescript/packages/@rescript/darwin-arm64/bin/rescript-editor-analysis.exe rewatch llmIndex
217+
{
218+
"rootPath": "/Users/nojaf/Projects/relocation",
219+
"namespace": null,
220+
"suffix": ".res.mjs",
221+
"rescriptVersion": [13, 0],
222+
"genericJsxModule": null,
223+
"opens": [],
224+
"pathsForModule": {
225+
"App": {
226+
"impl": {
227+
"cmt": "/Users/nojaf/Projects/relocation/lib/bs/src/App.cmt",
228+
"res": "/Users/nojaf/Projects/relocation/lib/bs/src/App.res"
229+
}
230+
}
231+
},
232+
"projectFiles": ["App"],
233+
"dependenciesFiles": [],
234+
"files": [
235+
{ "moduleName": "App", "cmt": "/Users/nojaf/Projects/relocation/lib/bs/src/App.cmt", "cmti": "" }
236+
]
237+
}
238+
EOF
239+
```
240+
241+
The output is a JSON array of module objects with `records`, `variants`, `typeAliases`, `values`, `moduleAliases`, and `nestedModules`.
242+
243+
### Database Schema
244+
245+
Same schema as the current skill, proven to work well for LLM queries:
246+
247+
```sql
248+
packages (id, name, path, rescript_json, config_hash)
249+
modules (id, package_id, parent_module_id, name, qualified_name,
250+
source_file_path, compiled_file_path, file_hash, is_auto_opened)
251+
types (id, module_id, name, kind, signature, detail)
252+
"values" (id, module_id, name, return_type, param_count, signature, detail)
253+
aliases (id, source_module_id, alias_name, alias_kind, target_qualified_name, docstrings)
254+
```
255+
256+
Key indexes: `qualified_name`, `compiled_file_path`, `is_auto_opened`, `alias_name`.
257+
258+
Hash-based invalidation: `modules.file_hash` stores the SHA-256 of the `.cmi` file. On incremental update, skip modules whose hash hasn't changed.
259+
260+
### What the Skill Becomes
261+
262+
The skill reduces to:
263+
264+
- `SKILL.md` with the schema documentation and query patterns
265+
- LLMs query directly: `sqlite3 rescript.db "SELECT ..."`
266+
267+
No Python, no `uv`, no `js-post-build` hook, no sync/update scripts.
268+
269+
## Key Files
270+
271+
### Rust side (rewatch)
272+
273+
| File | What's there | Relevance |
274+
|------|-------------|-----------|
275+
| `rewatch/src/lsp.rs` | `Backend` struct, `LanguageServer` impl, `ProjectMap` (maps project roots → `BuildCommandState`) | Top-level LSP orchestration. `ProjectMap.states` is the source of truth for all modules/packages. `initial_build()` (line 762) and queue startup (line 343) are the trigger points. |
276+
| `rewatch/src/lsp/analysis.rs` | `AnalysisContext`, `build_context_json()`, `spawn()` | Pattern to follow: builds JSON context from `BuildCommandState`, sends via stdin to analysis binary, parses stdout. The new `docIndex` subcommand follows this same pattern. |
277+
| `rewatch/src/lsp/queue.rs` | Unified debounced queue, `flush_inner()` (line 522) | After flush completes (builds + typechecks), this is where incremental index updates would be triggered. The `buildFinished` notification (line 681) marks the natural hook point. |
278+
| `rewatch/src/lsp/queue/file_build.rs` | Per-file incremental build | Knows which modules were recompiled — needed to identify what to re-index. |
279+
| `rewatch/src/build/build_types.rs` | `BuildCommandState` (line 666), `BuildState` (line 647), `Module` enum (line 572), `SourceFileModule` (line 464) | Core types. `BuildState.modules: HashMap<String, Module>` contains all modules with their paths, deps, and compilation stage. |
280+
| `rewatch/src/cli.rs` | CLI entry point, `Command` enum (line 388) | Where to add a `rescript db sync` subcommand. |
281+
| `rewatch/src/lsp/initial_build.rs` | Full `TypecheckOnly` build on startup | Runs before the queue starts. After this completes, the initial index sync would begin as a background task. |
282+
283+
### OCaml side (analysis binary)
284+
285+
| File | What's there | Relevance |
286+
|------|-------------|-----------|
287+
| `analysis/bin/main.ml` | CLI dispatch, `rewatch` subcommand routing (line 135) | Where to add the `"docIndex"` match arm: `\| ["docIndex"] -> CommandsRewatch.docIndex ()` |
288+
| `analysis/src/CommandsRewatch.ml` | `withRewatchContext` (line 145), all rewatch subcommand handlers | Pattern to follow: reads JSON from stdin via `withRewatchContext`, calls into analysis logic, prints JSON to stdout. The new `docIndex` handler goes here. |
289+
| `analysis/src/DocumentSymbol.ml` | `command ~path ~source` — extracts symbols from a single file | Existing per-file symbol extraction. The `docIndex` implementation may reuse some of this logic but needs a different output shape. |
290+
| `tools/src/tools.ml` | `extractDocs` (line 421) — the function behind `rescript-tools doc` | This is what the current Python skill calls. Produces the generic doc JSON. The new `docIndex` subcommand replaces this with a schema-tailored output. |
291+
| `tools/bin/main.ml` | `rescript-tools` CLI, `"doc"` command (line 60) | Reference for how `extractDocs` is invoked today. Not modified by this work. |
292+
293+
### Current skill (reference implementation to replace)
294+
295+
| File | What's there | Relevance |
296+
|------|-------------|-----------|
297+
| `../relocation/.claude/skills/rescript/scripts/rescript-db.py` | Python sync/update/query CLI | The logic being replaced. Useful as reference for: schema DDL, `parse_module_documentation()` (the JSON→rows mapping), hash-based invalidation, auto-opened module detection. |
298+
| `../relocation/.claude/skills/rescript/SKILL.md` | Skill documentation, schema docs, query patterns | The query patterns and schema documentation survive as-is. The sync/update sections go away. |
299+
300+
## Decisions Made
301+
302+
- **Analysis binary input format**: JSON via stdin, consistent with how all other analysis subcommands work (see `analysis.rs``build_context_json` + `spawn()`).
303+
- **Output format**: Tailored for SQLite insertion, not the generic `rescript-tools doc` shape. The OCaml side computes `paramCount`/`returnType` accurately from the typed tree. `detail` is pre-serialized as a JSON string.
304+
305+
## Open Questions
306+
307+
- **Database path configuration**: Should the LSP accept an initialization option for the database path, or always use a fixed location relative to the workspace/project root?
308+
- **Dependency indexing frequency**: Dependencies only change on `bun install` / package updates. Should we track a hash of `node_modules` state to know when to re-index deps, or just re-index them on every full sync?
309+
- **WAL mode and readers**: SQLite WAL mode allows concurrent reads while the LSP writes. Do we need any additional coordination, or is WAL sufficient?
310+
- **Multi-root workspaces**: When multiple project roots exist in `ProjectMap.states`, should the database include a `project_root` column to disambiguate, or is the `packages` table sufficient?
311+
- **Auto-opened modules**: The current skill detects these from compiler flags (`-open`) and hardcodes `Stdlib`/`Pervasives` for `@rescript/runtime`. Should the analysis binary report `is_auto_opened` per module, or should the Rust side keep this logic?

analysis/bin/main.ml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -148,6 +148,7 @@ let main () =
148148
| ["inlayHint"] -> CommandsRewatch.inlayHint ()
149149
| ["semanticTokens"] -> CommandsRewatch.semanticTokens ()
150150
| ["codeAction"] -> CommandsRewatch.codeAction ()
151+
| ["llmIndex"] -> CommandsRewatch.llmIndex ()
151152
| _ -> prerr_endline "Unknown rewatch subcommand")
152153
| [_; "completion"; path; line; col; currentFile] ->
153154
printHeaderInfo path line col;

analysis/src/CommandsRewatch.ml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -433,6 +433,8 @@ let signatureHelp () =
433433
{signatures = []; activeSignature = None; activeParameter = None}
434434
| Some res -> Protocol.stringifySignatureHelp res)
435435

436+
let llmIndex () = LlmIndex.command ()
437+
436438
let typeDefinition () =
437439
withRewatchContext ~name:"typeDefinition" ~default:Protocol.null (fun ctx ->
438440
let locationOpt =

0 commit comments

Comments
 (0)