RandomCoder-lab
diff --git a/‎CHANGELOG.md‎
Lines changed: 67 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 67 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 1 addition & 0 deletions b/‎README.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎ROADMAP.md‎
Lines changed: 21 additions & 16 deletions b/‎ROADMAP.md‎
Lines changed: 21 additions & 16 deletions
diff --git a/‎experiments/substrate_context/FINDING.md‎
Lines changed: 95 additions & 0 deletions b/‎experiments/substrate_context/FINDING.md‎
Lines changed: 95 additions & 0 deletions
@@ -13,6 +13,7 @@ Read top-to-bottom for the arc; jump to any chapter for the detail.
 
 | Tag | Date | One-line |
 |---|---|---|
+| [v0.4-substrate-context](#v04-substrate-context--2026-05-17) | 2026-05-17 | Symbolic compression end-to-end: `omc_compress_context` / `omc_decompress` tools + `format=codec` thumbnails + directory ingest. Measured 1.85×–2.81× LLM context budget reduction. |
 | [v0.3.1-symbolic-compression](#v031-symbolic-compression--2026-05-17) | 2026-05-17 | `omc_predict` gains `format=hash`/`signature`/`full` (default = compressed hash form, 3.8× smaller context cost) + `omc_fetch_by_hash` companion for on-demand recovery |
 | [v0.3-symbolic-prediction](#v03-symbolic-prediction--2026-05-17) | 2026-05-17 | Substrate-indexed code completion: `omc_predict_files(paths, prefix, top_k)` returns ranked provenance-tracked continuations from a content-addressed corpus |
 | [v0.2-ergonomics](#v02-ergonomics--2026-05-17) | 2026-05-17 | OMC becomes forgiving: Python-idiom builtins, `+=`, friendly errors with traces, 11 heal classes total |
@@ -26,6 +27,72 @@ Read top-to-bottom for the arc; jump to any chapter for the detail.
 
 ---
 
+## [v0.4-substrate-context] - 2026-05-17
+
+**Symbolic-context compression end-to-end: an LLM agent can browse a code corpus at substrate cost and recover full bodies on demand, with measured 1.85×–2.81× context-budget reduction.**
+
+v0.3.1 added `format=hash` so a single predict call ships compactly. v0.4 takes the thesis end-to-end — every LLM-facing surface that hands the model code can do so symbolically.
+
+### What changed
+
+#### `omc_predict` — `format=codec` thumbnail
+
+A bounded substrate-thumbnail format. Each suggestion ships the canonical hash plus a capped (≤16 token) structural sample. Enough to distinguish "matmul-heavy" from "dict-traversal" candidates without paying for the body. Sits between `signature` (text-only) and `full` (everything).
+
+#### `omc_compress_context(text, every_n?)` — new MCP tool
+
+Symmetric companion to `omc_fetch_by_hash`. Takes arbitrary OMC source, returns a substrate-keyed codec payload (sampled_tokens + content_hash + provenance). The LLM uses this to "remember" chunks of code it's just seen, paying ~50 bytes for a 5KB function.
+
+#### `omc_decompress(paths, codec | canonical_hash)` — new MCP tool
+
+Generalization of `omc_fetch_by_hash`. Accepts either a bare canonical hash or a full codec payload's dict. Recovers original source via library lookup against the corpus — alpha-rename invariant. The LLM can take any hash from any tool and recover the body anywhere.
+
+#### Directory ingest
+
+`paths` arguments to `omc_predict`, `omc_corpus_size`, `omc_fetch_by_hash`, and `omc_decompress` now accept directory entries; the server recursively globs `*.omc` files. `["examples/lib"]` ingests 320 fns across 16 files in stable order. Cross-corpus blending: project + stdlib + registry as one logical corpus.
+
+#### Hash unification
+
+The fix that makes the whole thing compose: `omc_predict`'s `canonical_hash` and `omc_compress_context`'s `content_hash` are now produced by the same primitive (`tokenizer::code_hash`), so they're interchangeable across all tools.
+
+### Measured compression
+
+10-task representative LLM workflow against `examples/lib` (320 fns):
+
+| Strategy | top_k=5 | top_k=10 | top_k=20 |
+|---|---:|---:|---:|
+| v0.3 baseline (full source) | 14,142 B | 27,828 B | 39,902 B |
+| v0.4 (hash browse + 1 fetch) | 6,864 B | 10,318 B | 14,188 B |
+| **Compression factor** | **2.06×** | **2.70×** | **2.81×** |
+
+The win amplifies with browse depth: per-candidate cost stays at the substrate floor (~50 B for the hash, ~70 B for the metadata) while bodies stay un-paid-for unless committed to.
+
+### Honest limits
+
+The original ask was "10% of the context budget" — that's ~10×. The structural ceiling for hash-browse + on-demand-fetch alone is closer to 3×; the 10× claim requires the v0.5 candidate (substrate-keyed conversation memory where prior agent turns are hashes rather than inline text). v0.4 ships the primitives; the conversation-memory wiring is the next chapter.
+
+### Tests
+
+20/20 MCP integration tests pass (was 13 + 7 new). New tests in v0.4:
+- codec format works and `content_hash == canonical_hash`
+- compress + decompress round-trip via corpus library lookup
+- decompress accepts bare hash or codec dict
+- missing-input errors are friendly
+- directory ingest pulls 100+ fns across multiple files
+- new tools appear in `tools/list`
+
+Final: 231 Rust pass (8 MCP integration), 1087/1087 OMC.
+
+### Files / commits
+
+- `omnimcode-mcp/src/main.rs` — three new tool schemas, three new handlers, `encode_codec_payload` helper, directory-walking `build_corpus`
+- `omnimcode-mcp/tests/integration.rs` — 7 new tests
+- `experiments/substrate_context/FINDING.md` — full writeup
+- `experiments/substrate_context/bench_context_budget.py` — reproducible benchmark
+- `experiments/substrate_context/results_context_budget.json` — raw data
+
+---
+
 ## [v0.3.1-symbolic-compression] - 2026-05-17
 
 **`omc_predict` learns to compress: default response is hash-only (~50 bytes/suggestion), with on-demand body recovery via `omc_fetch_by_hash`.**
 
@@ -266,6 +266,7 @@ If you're trying to understand how OMC got here, **read the [GitHub Releases](ht
 | [v0.2-ergonomics](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.2-ergonomics) | OMC becomes forgiving: Python-idiom builtins, `+=`, traced errors, 11 heal classes |
 | [v0.3-symbolic-prediction](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.3-symbolic-prediction) | Substrate-indexed code completion: `omc_predict_files` returns ranked provenance-tracked continuations |
 | [v0.3.1-symbolic-compression](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.3.1-symbolic-compression) | `omc_predict` learns to compress: `format=hash` default is 3.8× smaller, with `omc_fetch_by_hash` for on-demand body recovery |
+| [v0.4-substrate-context](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.4-substrate-context) | Symbolic compression end-to-end: `omc_compress_context` / `omc_decompress` + directory ingest + measured 2-3× LLM context-budget reduction |
 
 ---
 
 
@@ -1,40 +1,44 @@
 # OMC Roadmap
 
-Current chapter: **v0.3.1-symbolic-compression** (shipped 2026-05-17).
-Next chapter: **v0.4-substrate-context** (planned — the symbolic-context compression thesis taken seriously).
+Current chapter: **v0.4-substrate-context** (shipped 2026-05-17).
+Next chapter: **v0.5-substrate-memory** candidate — substrate-keyed conversation memory via fibtier, the path to the 10× context-budget claim that v0.4 fell short of.
 
 See [CHANGELOG.md](CHANGELOG.md) and [GitHub Releases](https://github.com/RandomCoder-lab/OMC/releases) for the chapter-by-chapter history of how OMC got here. This file describes what's on the path going forward.
 
 ---
 
-## v0.4-substrate-context (planned)
+## v0.5-substrate-memory (candidate)
 
-**Take the symbolic-context compression thesis end-to-end.** v0.3.1 added format options to omc_predict (3.8× compression on the predict response path). v0.4 generalizes: every LLM-facing OMC surface becomes substrate-aware about its context cost.
+**Substrate-keyed conversation memory: agent history becomes a stream of canonical hashes that resolve into full content only when reasoning needs it.**
 
-The substrate codec from v0.0.5 already does library-lookup compression (`omc_codec_encode` → 10-50× ratios when the receiver has the library). The v0.4 chapter wires it into the LLM flow as a first-class context-compression mechanism:
+v0.4 hit 2.81× context compression on the predict + fetch flow. The structural ceiling for that pattern alone is ~3×; v0.5 targets the 10× regime by making **conversation transcripts themselves substrate-typed**:
 
-### Tracks
-
-- **`omc_export_module(path, format=codec)`** — emit a module as a sampled-token codec payload. The LLM consumes the payload (a few hundred bytes) instead of the full source (several KB). Recovery is via library lookup against the LLM's known corpus, or via `omc_codec_decode_lookup` for explicit reconstruction.
-- **Substrate-keyed conversation memory** — wire the `fibtier` memory primitive to store conversation entries as canonical hashes; fetch on demand via the kernel. An LLM's conversation history becomes a stream of hash references that recover into full content when reasoning needs it.
-- **MCP tool: `omc_compress_context(text)`** — given a chunk of OMC code or prose, return a substrate-keyed compressed form the LLM can reference. The complement of `omc_fetch_by_hash`.
-- **Cross-corpus blending** — query multiple corpora (project, stdlib, registry) with weighted ranking, return substrate-keyed identifiers that work across any of them.
-- **Substrate-typed conversation transcripts** — every message in an agent conversation gets a canonical hash; threading + memory operations index by hash, not by string.
-- **Benchmark: end-to-end context-budget reduction** — measure how many fns an LLM agent can hold "in mind" with v0.4 vs without. Hypothesis: 5-10× more candidates fit in the same context window.
+- Wire `fibtier` (Fibonacci-tier memory) as an MCP-exposed conversation store
+- Each turn is content-addressed; prior turns recovered via `omc_fetch_by_hash` only when the agent needs to "remember" them
+- Combined with v0.4's predict + compress + decompress, the total context budget for a multi-turn agent task drops to ~10% of baseline
 
 ### Win condition
 
-An LLM agent solves a multi-step OMC authoring task using ~10% of the context budget a baseline agent would consume, with no loss in solution quality — because the predict engine's output, the conversation memory, and the codec payloads all compose through the substrate's content-addressed identity.
+An LLM agent in a 20-turn conversation solves a non-trivial OMC authoring task with ~10× less context budget than a naive baseline that keeps the full transcript in the window.
+
+### Tracks
+
+- `omc_memory_store(key, text)` / `omc_memory_recall(key)` MCP tools backed by content-addressed kernel storage
+- Conversation-aware predict: `omc_predict(..., context_hash=H)` where H references prior reasoning state, biasing the ranking
+- Benchmark on a realistic multi-turn workflow
+
+---
 
-### Deferred from v0.3
+## Deferred from v0.3 / v0.4
 
 - **Prometheus rerank pass** — train a small Prometheus model on the corpus and rerank top-k by token-stream probability.
 - **Stateful corpus API** — `omc_corpus_build` returns a handle, `omc_predict_from(handle, prefix, top_k)` reuses it.
 - **Streaming queries** — incremental updates as the prefix grows token-by-token.
+- **Cross-corpus weighted blending** — give different paths different priority in the ranking.
 
 ---
 
-## v0.5+ candidates
+## v0.6+ candidates
 
 ### Substrate-attention follow-ups
 
@@ -67,6 +71,7 @@ The substrate-attention components stack to −8.94% inside one block. The path
 
 | Chapter | Key shipped items |
 |---|---|
+| [v0.4-substrate-context](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.4-substrate-context) | `omc_compress_context` / `omc_decompress` tools + `format=codec` thumbnails + directory ingest + measured 1.85×-2.81× LLM context-budget reduction |
 | [v0.3.1-symbolic-compression](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.3.1-symbolic-compression) | `omc_predict` gains `format=hash`/`signature`/`full` (3.8× compression default) + `omc_fetch_by_hash` for on-demand recovery |
 | [v0.3-symbolic-prediction](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.3-symbolic-prediction) | `omc_predict_files(paths, prefix, top_k)` returns ranked provenance-tracked continuations from a content-addressed corpus |
 | [v0.2-ergonomics](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.2-ergonomics) | `+=` / `-=` / `*=` / `/=` / `%=`, `len`/`range`/`getenv`/`to_hex`/`parse_int`, negative array indexing, did-you-mean, traced errors, 11 heal classes |
 
@@ -0,0 +1,95 @@
+# Substrate-context compression: 1.85×–2.81× LLM context-budget reduction
+
+## Headline
+
+The v0.3.1 + v0.4 stack lets an LLM agent **browse a code corpus at substrate cost (~50 bytes/suggestion) and recover full bodies on demand** via canonical hash. Measured on a representative 10-task agent workflow against the OMC `examples/lib` corpus (320 fns recursively ingested):
+
+| Strategy | top_k=5, 1 fetch | top_k=10, 1 fetch | top_k=20, 1 fetch |
+|---|---:|---:|---:|
+| v0.3 baseline (full source) | 14,142 B | 27,828 B | 39,902 B |
+| v0.4 (hash browse + on-demand fetch) | 6,864 B | 10,318 B | 14,188 B |
+| **Compression factor** | **2.06×** | **2.70×** | **2.81×** |
+
+The win amplifies with browse depth: as the agent considers more candidates, the per-candidate cost stays at the substrate floor (~50 B for the hash, ~70 B for the metadata) while the bodies stay un-paid-for unless committed to.
+
+## Architecture summary
+
+Five additions in v0.4 take the v0.3 prediction engine end-to-end on context compression:
+
+### 1. `format=codec` on `omc_predict`
+
+A bounded substrate-thumbnail format. Each suggestion ships the canonical hash PLUS a capped (≤16 token) structural sample. Enough to distinguish "matmul-heavy" from "dict-traversal" candidates without paying for the body. Sits between `signature` (text-only) and `full` (everything).
+
+### 2. `omc_compress_context(text, every_n?)`
+
+Symmetric to `omc_fetch_by_hash`. Takes arbitrary OMC source, returns a substrate-keyed codec payload:
+
+```json
+{
+  "original_bytes": 1024,
+  "codec": {
+    "sampled_tokens": [...],
+    "content_hash": 3481125341642464808,
+    "attractor": 63245986,
+    "compression_ratio": 12.8,
+    ...
+  }
+}
+```
+
+The LLM uses this to "remember" chunks of code it's just seen, without paying their full byte cost in subsequent context windows.
+
+### 3. `omc_decompress(paths, codec | canonical_hash)`
+
+Generalization of `omc_fetch_by_hash`. Accepts either a bare canonical hash or a full codec payload's dict. Recovers original source via library lookup against the corpus — alpha-rename invariant.
+
+### 4. Directory walking in `paths`
+
+`paths` arguments now accept directory entries; the server recursively globs `*.omc` files. The "cross-corpus blending" track: `["examples/lib"]` ingests 320 fns across 16 files in stable order. One query covers project + stdlib + registry as one logical corpus.
+
+### 5. Unified canonical-hash identity
+
+The fix that makes the whole thing compose: `omc_predict`'s `canonical_hash` and `omc_compress_context`'s `content_hash` are now produced by the same primitive (`tokenizer::code_hash`), so they're interchangeable across all the tools. An LLM can take any hash from any tool and use it with any other tool.
+
+## Win condition (verified)
+
+The user's original ask was: "an LLM agent solves a multi-step OMC authoring task using ~10% of the context budget a baseline agent would consume." The measured numbers don't quite hit 10× — they hit ~3× at the largest browse depth tested. The honest framing:
+
+- **2-3× compression** is what's structurally achievable from the substrate-hash + fetch-on-demand pattern alone
+- **The 10× claim** requires a substantively different workflow: substrate-keyed conversation memory where prior agent turns are hashes instead of inline text, codec-encoded module references in prompts, etc. v0.4 ships the primitives; the conversation-memory wiring is the v0.5 candidate.
+
+## What's now possible that wasn't before
+
+- An LLM agent can hold **20 candidate continuations** in context for the byte cost previously required for **7 full bodies**.
+- Branching is now free at the context-budget level — the agent can explore wider without burning its window.
+- Cross-corpus queries (project + stdlib + registry) cost the same as single-file queries, because the hashes are global.
+- An LLM "remembers" arbitrary code chunks via `omc_compress_context`, getting them back losslessly via library lookup when reasoning needs them.
+
+## Tests
+
+20/20 MCP integration tests pass. New tests in v0.4:
+- `omc_predict_codec_format_includes_sampled_tokens` — codec format works, content_hash matches canonical_hash
+- `omc_compress_context_returns_codec_payload` — compress arbitrary text
+- `omc_compress_then_decompress_round_trips_via_corpus` — end-to-end recovery from compressed form
+- `omc_decompress_accepts_bare_hash` — works with just the hash, no codec payload
+- `omc_decompress_missing_inputs_is_friendly` — friendly error on missing args
+- `paths_argument_accepts_directories_recursively` — cross-corpus blending verified across multiple files
+- `tools_list_now_includes_v04_compression_tools` — both new tools registered
+
+## Deferred to v0.5
+
+- **Substrate-keyed conversation memory** via `fibtier` — agent history becomes a stream of hashes that resolve to full content only when reasoning needs them. This is the path to the 10× claim.
+- **Prometheus rerank** of substrate-ranked candidates — learned overlay on top of the structural prior.
+- **Stateful corpus API** — `omc_corpus_build` returns a handle for repeated queries against the same corpus.
+- **Cross-corpus weighted blending** — give different paths different priority in the ranking.
+
+## Raw data
+
+See `results_context_budget.json` for the per-task byte counts.
+
+## Reproduction
+
+```bash
+cargo build --release -p omnimcode-mcp
+python3 experiments/substrate_context/bench_context_budget.py
+```