RandomCoder-lab
diff --git a/‎CHANGELOG.md‎
Lines changed: 58 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 58 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 1 addition & 0 deletions b/‎README.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎ROADMAP.md‎
Lines changed: 12 additions & 31 deletions b/‎ROADMAP.md‎
Lines changed: 12 additions & 31 deletions
diff --git a/‎examples/tests/test_predict.omc‎
Lines changed: 117 additions & 0 deletions b/‎examples/tests/test_predict.omc‎
Lines changed: 117 additions & 0 deletions
@@ -13,6 +13,7 @@ Read top-to-bottom for the arc; jump to any chapter for the detail.
 
 | Tag | Date | One-line |
 |---|---|---|
+| [v0.3-symbolic-prediction](#v03-symbolic-prediction--2026-05-17) | 2026-05-17 | Substrate-indexed code completion: `omc_predict_files(paths, prefix, top_k)` returns ranked provenance-tracked continuations from a content-addressed corpus |
 | [v0.2-ergonomics](#v02-ergonomics--2026-05-17) | 2026-05-17 | OMC becomes forgiving: Python-idiom builtins, `+=`, friendly errors with traces, 11 heal classes total |
 | [v0.1-substrate-attention](#v01-substrate-attention--2026-05-17) | 2026-05-17 | Three substrate-component swaps inside transformer attention (K, S-MOD softmax, V) stack to −8.94% val on TinyShakespeare |
 | [v0.0.6-prometheus](#v006-prometheus--2026-05-16) | 2026-05-16 | Substrate-native ML framework in pure OMC: tape autograd, AdamW, attention, multi-block transformer. First substrate-K (L1) wins land. Ends with the two-agent demo |
@@ -24,6 +25,63 @@ Read top-to-bottom for the arc; jump to any chapter for the detail.
 
 ---
 
+## [v0.3-symbolic-prediction] - 2026-05-17
+
+**Substrate-indexed code completion: `omc_predict_files(paths, prefix, top_k)` returns ranked provenance-tracked continuations from a content-addressed corpus.**
+
+The synthesis of two earlier substrates — `tokenizer::encode` (symbol stream) and `canonical_hash` + `attractor_distance` (substrate metric) — into one primitive that LLM agents (and humans) can use while writing OMC to find out "what could come next here?" with each result carrying a substrate-distance score and a pointer back to the source function it came from. Branching is first-class: every result is a viable continuation.
+
+### What changed
+
+- **New module `omnimcode-core/src/predict.rs`** (~370 lines):
+  - `CorpusEntry { fn_name, source, file, symbol_stream, canonical_hash, attractor }`
+  - `PrefixTrie` — each node accumulates corpus indices whose stream passes through it
+  - `CodeCorpus` — entries + trie; `ingest_fn` and `ingest_file`
+  - `predict_continuations(corpus, prefix_source, top_k) -> Vec<Suggestion>`
+  - Ranking: `(longest prefix match, smallest substrate distance, corpus index)`
+- **Two new builtins** in interpreter.rs:
+  - `omc_predict_files(paths_array, prefix_source, top_k) -> array of dicts` — stateless
+  - `omc_corpus_size(paths_array) -> int` — diagnostic
+- **Result dict fields**: `fn_name`, `source`, `file`, `canonical_hash`, `attractor`, `prefix_match_len`, `substrate_distance`, `query_attractor`.
+- **10 Rust unit tests** (`predict::tests`) cover trie semantics, ingestion, ranking, top_k cap, empty inputs, provenance.
+- **11 OMC end-to-end tests** (`test_predict.omc`) exercise the builtins against the real Prometheus corpus.
+
+### Win condition (verified)
+
+Prefix `fn prom_linear_` against the Prometheus corpus (70 fns) returns exactly the three `prom_linear_*` functions, ranked by substrate distance:
+
+```
+prom_linear_forward  (substrate_distance=1.4e18, prefix_match_len=24)
+prom_linear_new      (substrate_distance=2.4e18, prefix_match_len=24)
+prom_linear_params   (substrate_distance=5.5e18, prefix_match_len=24)
+```
+
+All three share `prefix_match_len=24` (the canonicalized prefix matched 24 tokens before diverging into the function-specific suffix). The wider `fn prom_attention_` prefix surfaces 5 attention-related fns with substrate distances ~3 orders of magnitude tighter than the linear namespace — substrate distance reflects code-shape similarity inside a namespace.
+
+### Why it matters
+
+Three primitives already in OMC — `canonicalize` (alpha-rename invariance), `tokenizer::encode` (substrate-aware symbol stream), `code_hash` (substrate-routed identity) — combine without modification. The trie is a 50-line data structure on top. No embedding model, no neural inference. Deterministic: same corpus + same prefix → same top-k, every run.
+
+### What's now possible that wasn't before
+
+- An LLM agent can query "what previous code came next at this shape?" as a single tool call.
+- Branching is first-class — each result is a viable continuation, not a "best guess."
+- Provenance is content-addressed: every suggestion includes its source file path AND its canonical hash, so a downstream agent can verify integrity by recompute.
+- The corpus is just file paths; no index-build step, no maintenance overhead.
+
+### Deferred (post-v0.3)
+
+- **Prometheus rerank pass** — train a small Prometheus model on the corpus and rerank top-k by token-stream probability. Substrate is the structural prior; Prometheus would be the learned overlay.
+- **Stateful corpus API** — `omc_corpus_build` returns a handle, `omc_predict_from(handle, prefix, top_k)` reuses it. Current stateless API rebuilds per call.
+- **MCP tool surface** — wrap as an MCP tool so LLM clients can query during code generation without launching a subprocess.
+- **Streaming / cross-corpus** — incremental updates as the prefix grows; weighted blending across project + stdlib + registry corpora.
+
+### Tests
+
+10 Rust + 11 OMC end-to-end. Final: **223 Rust pass, 1087/1087 OMC pass** (1087 = 1076 from previous + 11 new).
+
+---
+
 ## [v0.2-ergonomics] - 2026-05-17
 
 **OMC becomes forgiving: a Python user can sit down and write code without a manual.**
 
@@ -264,6 +264,7 @@ If you're trying to understand how OMC got here, **read the [GitHub Releases](ht
 | [v0.0.6-prometheus](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.0.6-prometheus) | Pure-OMC ML framework, multi-block transformer, first substrate-K (L1) wins |
 | [v0.1-substrate-attention](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.1-substrate-attention) | Three substrate components (K, S-MOD, V) stack inside attention for −8.94% val |
 | [v0.2-ergonomics](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.2-ergonomics) | OMC becomes forgiving: Python-idiom builtins, `+=`, traced errors, 11 heal classes |
+| [v0.3-symbolic-prediction](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.3-symbolic-prediction) | Substrate-indexed code completion: `omc_predict_files` returns ranked provenance-tracked continuations |
 
 ---
 
 
@@ -1,53 +1,33 @@
 # OMC Roadmap
 
-Current chapter: **v0.2-ergonomics** (shipped 2026-05-17).
-Next chapter: **v0.3-symbolic-prediction** (in flight).
+Current chapter: **v0.3-symbolic-prediction** (shipped 2026-05-17).
+Next chapter: open — candidates listed below.
 
 See [CHANGELOG.md](CHANGELOG.md) and [GitHub Releases](https://github.com/RandomCoder-lab/OMC/releases) for the chapter-by-chapter history of how OMC got here. This file describes what's on the path going forward.
 
 ---
 
-## v0.3-symbolic-prediction (in flight)
+## Post-v0.3 candidates (none committed yet)
 
-**Substrate-indexed code completion: given a partial OMC prefix, return ranked provenance-tracked continuations from a content-addressed corpus.**
+### v0.4 candidate A — predict engine grows up
 
-The synthesis of two earlier threads — substrate codec (symbolic context) and Prometheus (text prediction) — into a single primitive that LLM agents (and humans) can use to navigate "what could come next here?" while writing OMC. Branching is first-class: each result is a viable continuation with a substrate-distance score and a pointer back to the source function it came from.
+The v0.3 engine ships a stateless predictor with substrate ranking. Natural extensions:
 
-### Architecture
-
-- `omnimcode-core/src/predict.rs` — `CodeCorpus`, `PrefixTrie`, `predict_continuations`.
-- Builtins: `omc_corpus_build(paths)` → handle, `omc_predict(prefix_source, corpus_handle, top_k)` → ranked dict.
-- CLI subcommand: `omc --predict --files DIR --prefix "fn ..." --top-k 5 --json`.
-- Win condition: prefix `fn prom_linear_` against the Prometheus corpus returns `prom_linear_new`, `prom_linear_forward`, `prom_linear_params` ranked by substrate distance, with provenance pointers to the source files.
-
-### Phases
-
-1. Symbol-stream encoding wrapper over the existing `tokenizer::encode` — already produces `Vec<i64>` symbol IDs; just expose a clean ingestion API.
-2. `CodeCorpus` builder: parse each file in a path list, extract top-level fns via `extract_top_level_fns`, build entries `{fn_name, source, symbol_stream, canonical_hash, attractor}`.
-3. `PrefixTrie` over symbol streams: insert each stream once, query a prefix to get matching corpus indices in O(prefix length).
-4. `predict_continuations(corpus, trie, prefix_source, top_k)` — tokenize prefix, query trie, rank surviving matches by `(longest prefix match, smallest substrate distance)`.
-5. Rust tests + OMC tests against the lib/ corpus.
-6. CLI demo + writeup as `experiments/symbolic_prediction/FINDING.md`.
-7. Tag as `v0.3-symbolic-prediction` with chapter release notes.
-
-### Deferred (post-v0.3)
-
-- **Prometheus rerank pass** — once the trie-based candidate list is solid, train a small Prometheus model on the corpus and rerank top-k by token-stream probability.
-- **MCP tool surface** — expose `predict_omc_continuation(prefix, top_k)` as an MCP tool so LLM clients can query during code generation.
+- **Prometheus rerank pass** — train a small Prometheus model on the corpus and rerank top-k by token-stream probability. Substrate ranking is the structural prior; Prometheus would be the learned overlay.
+- **Stateful corpus API** — `omc_corpus_build` returns a handle, `omc_predict_from(handle, prefix, top_k)` reuses it. The current API rebuilds per call (fine for interactive use; slow in tight loops).
+- **MCP tool surface** — wrap `omc_predict_files` as an MCP tool so LLM clients can query during code generation without launching a subprocess.
 - **Streaming queries** — incremental updates as the prefix grows token-by-token.
 - **Cross-corpus blending** — query multiple corpora (project, stdlib, registry) with weighted ranking.
 
----
-
-## Beyond v0.3 (rough)
-
-### Substrate-attention follow-ups
+### v0.4 candidate B — substrate-attention follow-ups
 
 - Substrate-modulated Q projection. Q hasn't been swapped yet; the V resample recipe (post-projection modulation) may generalize.
 - Substrate FF: dampen off-attractor activations in the feed-forward residual.
 - Substrate LayerNorm: substrate-distance-weighted variance computation.
 - Larger-scale validation: every substrate-attention claim was made at TinyShakespeare scale (1.1MB). Need to verify the stack holds at 10-100MB corpora.
 
+### Beyond (rough)
+
 ### Transformerless LLM
 
 The substrate-attention components stack to −8.94% inside one block. The path forward is a top-to-bottom harmonic-only architecture trained competitively. Open: how to handle non-integer-coherent quantities at this scale (the substrate metric only applies to integer-valued quantities, per the rule derived from the HBit-gate falsification).
@@ -70,6 +50,7 @@ The substrate-attention components stack to −8.94% inside one block. The path
 
 | Chapter | Key shipped items |
 |---|---|
+| [v0.3-symbolic-prediction](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.3-symbolic-prediction) | `omc_predict_files(paths, prefix, top_k)` returns ranked provenance-tracked continuations from a content-addressed corpus |
 | [v0.2-ergonomics](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.2-ergonomics) | `+=` / `-=` / `*=` / `/=` / `%=`, `len`/`range`/`getenv`/`to_hex`/`parse_int`, negative array indexing, did-you-mean, traced errors, 11 heal classes |
 | [v0.1-substrate-attention](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.1-substrate-attention) | Substrate-K + S-MOD softmax + substrate-V resample → −8.94% val on TinyShakespeare |
 | [v0.0.6-prometheus](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.0.6-prometheus) | Tape autograd, AdamW, Embedding, LayerNorm, multi-block transformer, first substrate-K wins |
 
@@ -0,0 +1,117 @@
+# Tests for the substrate-indexed code completion engine.
+#
+# Each test runs the predictor against examples/lib/prometheus.omc as
+# a real corpus and asserts ranking properties.
+
+fn assert_true(cond, msg) { if !cond { test_record_failure(msg); } }
+fn assert_eq(actual, expected, msg) {
+    if actual != expected {
+        test_record_failure(msg + ": expected " + to_string(expected)
+                            + " got " + to_string(actual));
+    }
+}
+
+fn _corpus() {
+    return ["examples/lib/prometheus.omc"];
+}
+
+# ---- Corpus ingestion ----
+
+fn test_corpus_size_matches_real_corpus() {
+    h size = omc_corpus_size(_corpus());
+    # Prometheus has ~70 top-level fns currently. Lower bound is the
+    # only stable assertion (the corpus grows over time).
+    assert_true(size > 30, "corpus has > 30 fns (got " + to_string(size) + ")");
+}
+
+# ---- Predict shape ----
+
+fn test_predict_returns_array_of_dicts() {
+    h hits = omc_predict_files(_corpus(), "fn prom_linear_", 3);
+    assert_eq(type_of(hits), "array", "result is array");
+    assert_true(arr_len(hits) > 0, "got at least one hit");
+    h first = arr_get(hits, 0);
+    assert_eq(type_of(first), "dict", "elements are dicts");
+}
+
+fn test_predict_top_k_caps_results() {
+    h hits = omc_predict_files(_corpus(), "fn prom_", 2);
+    assert_true(arr_len(hits) <= 2, "respects top_k=2");
+}
+
+# ---- Entry fields ----
+
+fn test_each_hit_has_required_fields() {
+    h hits = omc_predict_files(_corpus(), "fn prom_linear_", 1);
+    h h0 = arr_get(hits, 0);
+    assert_true(dict_has(h0, "fn_name"), "fn_name");
+    assert_true(dict_has(h0, "source"), "source");
+    assert_true(dict_has(h0, "file"), "file");
+    assert_true(dict_has(h0, "canonical_hash"), "canonical_hash");
+    assert_true(dict_has(h0, "attractor"), "attractor");
+    assert_true(dict_has(h0, "prefix_match_len"), "prefix_match_len");
+    assert_true(dict_has(h0, "substrate_distance"), "substrate_distance");
+}
+
+# ---- Ranking ----
+
+fn test_prom_linear_prefix_returns_the_three_prom_linear_fns() {
+    # The prefix `fn prom_linear_` should surface exactly the three
+    # prom_linear_* defs in Prometheus (new, forward, params).
+    h hits = omc_predict_files(_corpus(), "fn prom_linear_", 10);
+    assert_true(arr_len(hits) >= 3,
+                "at least 3 results for fn prom_linear_ prefix");
+    # Collect the names.
+    h names = [];
+    for entry in hits {
+        arr_push(names, dict_get(entry, "fn_name"));
+    }
+    assert_true(arr_contains(names, "prom_linear_new"),
+                "prom_linear_new present in " + to_string(names));
+    assert_true(arr_contains(names, "prom_linear_forward"),
+                "prom_linear_forward present in " + to_string(names));
+    assert_true(arr_contains(names, "prom_linear_params"),
+                "prom_linear_params present in " + to_string(names));
+}
+
+fn test_prefix_match_len_nonzero_when_prefix_exists_in_corpus() {
+    h hits = omc_predict_files(_corpus(), "fn prom_linear_", 3);
+    h h0 = arr_get(hits, 0);
+    assert_true(dict_get(h0, "prefix_match_len") > 0,
+                "prefix_match_len > 0 when prefix tokens are in corpus");
+}
+
+fn test_substrate_distance_is_nonneg() {
+    h hits = omc_predict_files(_corpus(), "fn prom_linear_", 5);
+    for entry in hits {
+        assert_true(dict_get(entry, "substrate_distance") >= 0,
+                    "substrate_distance is non-negative");
+    }
+}
+
+# ---- Provenance ----
+
+fn test_file_provenance_points_at_source() {
+    h hits = omc_predict_files(_corpus(), "fn prom_linear_", 1);
+    h h0 = arr_get(hits, 0);
+    assert_eq(dict_get(h0, "file"), "examples/lib/prometheus.omc",
+              "file is the source path we passed in");
+}
+
+fn test_source_contains_fn_keyword() {
+    h hits = omc_predict_files(_corpus(), "fn prom_linear_", 1);
+    h source = dict_get(arr_get(hits, 0), "source");
+    assert_true(str_contains(source, "fn "), "source has fn keyword");
+}
+
+# ---- Empty inputs ----
+
+fn test_empty_paths_returns_empty() {
+    h hits = omc_predict_files([], "fn anything", 5);
+    assert_eq(arr_len(hits), 0, "empty paths → empty hits");
+}
+
+fn test_top_k_zero_returns_empty() {
+    h hits = omc_predict_files(_corpus(), "fn prom_", 0);
+    assert_eq(arr_len(hits), 0, "top_k=0 → empty hits");
+}