Skip to content

Commit fe87baf

Browse files
🥂 v0.3 symbolic prediction: substrate-indexed code completion
Given a partial OMC prefix, omc_predict_files(paths, prefix, top_k) returns ranked provenance-tracked continuations from a content- addressed corpus. The synthesis of two earlier substrates — tokenizer::encode (symbol stream) and canonical_hash + attractor_distance (substrate metric) — into one primitive. ## Win condition (verified) Prefix `fn prom_linear_` against examples/lib/prometheus.omc (70 fns) returns exactly the three prom_linear_* fns ranked by substrate distance: prom_linear_forward (substrate_distance=1.4e18, prefix_match_len=24) prom_linear_new (substrate_distance=2.4e18, prefix_match_len=24) prom_linear_params (substrate_distance=5.5e18, prefix_match_len=24) A wider prefix `fn prom_attention_` surfaces 5 attention-related fns with substrate distances ~3 orders of magnitude tighter than the linear namespace — substrate distance reflects code-shape similarity inside a namespace. ## Architecture - omnimcode-core/src/predict.rs (~370 lines): - CorpusEntry { fn_name, source, file, symbol_stream, canonical_hash, attractor } - PrefixTrie { children: HashMap<i64, PrefixTrie>, matches: Vec<usize> } - CodeCorpus { entries, trie } with ingest_fn / ingest_file - predict_continuations(corpus, prefix_source, top_k) -> Vec<Suggestion> - Ranking: (longest prefix match desc, smallest substrate distance asc, corpus index asc) - Two new builtins in interpreter.rs: - omc_predict_files(paths_array, prefix_source, top_k) -> array of dicts - omc_corpus_size(paths_array) -> int (diagnostic) - Suggestion dicts carry fn_name, source, file, canonical_hash, attractor, prefix_match_len, substrate_distance, query_attractor. ## Why this composes well Three primitives already in OMC — canonicalize (alpha-rename invariance), tokenizer::encode (substrate-aware symbol stream), code_hash (substrate-routed identity) — combine without modification. The trie is a 50-line data structure on top. No embedding model, no neural inference. Deterministic: same corpus + same prefix → same top-k, every run. ## Tests - 10 Rust unit tests in predict::tests (trie semantics, ingestion, ranking, top_k cap, empty inputs, provenance) - 11 OMC end-to-end tests in test_predict.omc against real Prometheus corpus Final: 223 Rust pass, 1087/1087 OMC pass (was 213/1076). ## Deferred - Prometheus rerank pass (structural substrate ranking + learned probability overlay) - Stateful corpus API (omc_corpus_build + omc_predict_from for repeated queries) - MCP tool surface for LLM clients - Streaming + cross-corpus blending Full chapter notes: CHANGELOG.md, experiments/symbolic_prediction/FINDING.md. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 4fec2a4 commit fe87baf

9 files changed

Lines changed: 787 additions & 31 deletions

File tree

‎CHANGELOG.md‎

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ Read top-to-bottom for the arc; jump to any chapter for the detail.
1313

1414
| Tag | Date | One-line |
1515
|---|---|---|
16+
| [v0.3-symbolic-prediction](#v03-symbolic-prediction--2026-05-17) | 2026-05-17 | Substrate-indexed code completion: `omc_predict_files(paths, prefix, top_k)` returns ranked provenance-tracked continuations from a content-addressed corpus |
1617
| [v0.2-ergonomics](#v02-ergonomics--2026-05-17) | 2026-05-17 | OMC becomes forgiving: Python-idiom builtins, `+=`, friendly errors with traces, 11 heal classes total |
1718
| [v0.1-substrate-attention](#v01-substrate-attention--2026-05-17) | 2026-05-17 | Three substrate-component swaps inside transformer attention (K, S-MOD softmax, V) stack to −8.94% val on TinyShakespeare |
1819
| [v0.0.6-prometheus](#v006-prometheus--2026-05-16) | 2026-05-16 | Substrate-native ML framework in pure OMC: tape autograd, AdamW, attention, multi-block transformer. First substrate-K (L1) wins land. Ends with the two-agent demo |
@@ -24,6 +25,63 @@ Read top-to-bottom for the arc; jump to any chapter for the detail.
2425

2526
---
2627

28+
## [v0.3-symbolic-prediction] - 2026-05-17
29+
30+
**Substrate-indexed code completion: `omc_predict_files(paths, prefix, top_k)` returns ranked provenance-tracked continuations from a content-addressed corpus.**
31+
32+
The synthesis of two earlier substrates — `tokenizer::encode` (symbol stream) and `canonical_hash` + `attractor_distance` (substrate metric) — into one primitive that LLM agents (and humans) can use while writing OMC to find out "what could come next here?" with each result carrying a substrate-distance score and a pointer back to the source function it came from. Branching is first-class: every result is a viable continuation.
33+
34+
### What changed
35+
36+
- **New module `omnimcode-core/src/predict.rs`** (~370 lines):
37+
- `CorpusEntry { fn_name, source, file, symbol_stream, canonical_hash, attractor }`
38+
- `PrefixTrie` — each node accumulates corpus indices whose stream passes through it
39+
- `CodeCorpus` — entries + trie; `ingest_fn` and `ingest_file`
40+
- `predict_continuations(corpus, prefix_source, top_k) -> Vec<Suggestion>`
41+
- Ranking: `(longest prefix match, smallest substrate distance, corpus index)`
42+
- **Two new builtins** in interpreter.rs:
43+
- `omc_predict_files(paths_array, prefix_source, top_k) -> array of dicts` — stateless
44+
- `omc_corpus_size(paths_array) -> int` — diagnostic
45+
- **Result dict fields**: `fn_name`, `source`, `file`, `canonical_hash`, `attractor`, `prefix_match_len`, `substrate_distance`, `query_attractor`.
46+
- **10 Rust unit tests** (`predict::tests`) cover trie semantics, ingestion, ranking, top_k cap, empty inputs, provenance.
47+
- **11 OMC end-to-end tests** (`test_predict.omc`) exercise the builtins against the real Prometheus corpus.
48+
49+
### Win condition (verified)
50+
51+
Prefix `fn prom_linear_` against the Prometheus corpus (70 fns) returns exactly the three `prom_linear_*` functions, ranked by substrate distance:
52+
53+
```
54+
prom_linear_forward (substrate_distance=1.4e18, prefix_match_len=24)
55+
prom_linear_new (substrate_distance=2.4e18, prefix_match_len=24)
56+
prom_linear_params (substrate_distance=5.5e18, prefix_match_len=24)
57+
```
58+
59+
All three share `prefix_match_len=24` (the canonicalized prefix matched 24 tokens before diverging into the function-specific suffix). The wider `fn prom_attention_` prefix surfaces 5 attention-related fns with substrate distances ~3 orders of magnitude tighter than the linear namespace — substrate distance reflects code-shape similarity inside a namespace.
60+
61+
### Why it matters
62+
63+
Three primitives already in OMC — `canonicalize` (alpha-rename invariance), `tokenizer::encode` (substrate-aware symbol stream), `code_hash` (substrate-routed identity) — combine without modification. The trie is a 50-line data structure on top. No embedding model, no neural inference. Deterministic: same corpus + same prefix → same top-k, every run.
64+
65+
### What's now possible that wasn't before
66+
67+
- An LLM agent can query "what previous code came next at this shape?" as a single tool call.
68+
- Branching is first-class — each result is a viable continuation, not a "best guess."
69+
- Provenance is content-addressed: every suggestion includes its source file path AND its canonical hash, so a downstream agent can verify integrity by recompute.
70+
- The corpus is just file paths; no index-build step, no maintenance overhead.
71+
72+
### Deferred (post-v0.3)
73+
74+
- **Prometheus rerank pass** — train a small Prometheus model on the corpus and rerank top-k by token-stream probability. Substrate is the structural prior; Prometheus would be the learned overlay.
75+
- **Stateful corpus API** — `omc_corpus_build` returns a handle, `omc_predict_from(handle, prefix, top_k)` reuses it. Current stateless API rebuilds per call.
76+
- **MCP tool surface** — wrap as an MCP tool so LLM clients can query during code generation without launching a subprocess.
77+
- **Streaming / cross-corpus** — incremental updates as the prefix grows; weighted blending across project + stdlib + registry corpora.
78+
79+
### Tests
80+
81+
10 Rust + 11 OMC end-to-end. Final: **223 Rust pass, 1087/1087 OMC pass** (1087 = 1076 from previous + 11 new).
82+
83+
---
84+
2785
## [v0.2-ergonomics] - 2026-05-17
2886

2987
**OMC becomes forgiving: a Python user can sit down and write code without a manual.**

‎README.md‎

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -264,6 +264,7 @@ If you're trying to understand how OMC got here, **read the [GitHub Releases](ht
264264
| [v0.0.6-prometheus](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.0.6-prometheus) | Pure-OMC ML framework, multi-block transformer, first substrate-K (L1) wins |
265265
| [v0.1-substrate-attention](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.1-substrate-attention) | Three substrate components (K, S-MOD, V) stack inside attention for −8.94% val |
266266
| [v0.2-ergonomics](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.2-ergonomics) | OMC becomes forgiving: Python-idiom builtins, `+=`, traced errors, 11 heal classes |
267+
| [v0.3-symbolic-prediction](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.3-symbolic-prediction) | Substrate-indexed code completion: `omc_predict_files` returns ranked provenance-tracked continuations |
267268

268269
---
269270

‎ROADMAP.md‎

Lines changed: 12 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,53 +1,33 @@
11
# OMC Roadmap
22

3-
Current chapter: **v0.2-ergonomics** (shipped 2026-05-17).
4-
Next chapter: **v0.3-symbolic-prediction** (in flight).
3+
Current chapter: **v0.3-symbolic-prediction** (shipped 2026-05-17).
4+
Next chapter: open — candidates listed below.
55

66
See [CHANGELOG.md](CHANGELOG.md) and [GitHub Releases](https://github.com/RandomCoder-lab/OMC/releases) for the chapter-by-chapter history of how OMC got here. This file describes what's on the path going forward.
77

88
---
99

10-
## v0.3-symbolic-prediction (in flight)
10+
## Post-v0.3 candidates (none committed yet)
1111

12-
**Substrate-indexed code completion: given a partial OMC prefix, return ranked provenance-tracked continuations from a content-addressed corpus.**
12+
### v0.4 candidate A — predict engine grows up
1313

14-
The synthesis of two earlier threads — substrate codec (symbolic context) and Prometheus (text prediction) — into a single primitive that LLM agents (and humans) can use to navigate "what could come next here?" while writing OMC. Branching is first-class: each result is a viable continuation with a substrate-distance score and a pointer back to the source function it came from.
14+
The v0.3 engine ships a stateless predictor with substrate ranking. Natural extensions:
1515

16-
### Architecture
17-
18-
- `omnimcode-core/src/predict.rs` — `CodeCorpus`, `PrefixTrie`, `predict_continuations`.
19-
- Builtins: `omc_corpus_build(paths)` → handle, `omc_predict(prefix_source, corpus_handle, top_k)` → ranked dict.
20-
- CLI subcommand: `omc --predict --files DIR --prefix "fn ..." --top-k 5 --json`.
21-
- Win condition: prefix `fn prom_linear_` against the Prometheus corpus returns `prom_linear_new`, `prom_linear_forward`, `prom_linear_params` ranked by substrate distance, with provenance pointers to the source files.
22-
23-
### Phases
24-
25-
1. Symbol-stream encoding wrapper over the existing `tokenizer::encode` — already produces `Vec<i64>` symbol IDs; just expose a clean ingestion API.
26-
2. `CodeCorpus` builder: parse each file in a path list, extract top-level fns via `extract_top_level_fns`, build entries `{fn_name, source, symbol_stream, canonical_hash, attractor}`.
27-
3. `PrefixTrie` over symbol streams: insert each stream once, query a prefix to get matching corpus indices in O(prefix length).
28-
4. `predict_continuations(corpus, trie, prefix_source, top_k)` — tokenize prefix, query trie, rank surviving matches by `(longest prefix match, smallest substrate distance)`.
29-
5. Rust tests + OMC tests against the lib/ corpus.
30-
6. CLI demo + writeup as `experiments/symbolic_prediction/FINDING.md`.
31-
7. Tag as `v0.3-symbolic-prediction` with chapter release notes.
32-
33-
### Deferred (post-v0.3)
34-
35-
- **Prometheus rerank pass** — once the trie-based candidate list is solid, train a small Prometheus model on the corpus and rerank top-k by token-stream probability.
36-
- **MCP tool surface** — expose `predict_omc_continuation(prefix, top_k)` as an MCP tool so LLM clients can query during code generation.
16+
- **Prometheus rerank pass** — train a small Prometheus model on the corpus and rerank top-k by token-stream probability. Substrate ranking is the structural prior; Prometheus would be the learned overlay.
17+
- **Stateful corpus API** — `omc_corpus_build` returns a handle, `omc_predict_from(handle, prefix, top_k)` reuses it. The current API rebuilds per call (fine for interactive use; slow in tight loops).
18+
- **MCP tool surface** — wrap `omc_predict_files` as an MCP tool so LLM clients can query during code generation without launching a subprocess.
3719
- **Streaming queries** — incremental updates as the prefix grows token-by-token.
3820
- **Cross-corpus blending** — query multiple corpora (project, stdlib, registry) with weighted ranking.
3921

40-
---
41-
42-
## Beyond v0.3 (rough)
43-
44-
### Substrate-attention follow-ups
22+
### v0.4 candidate B — substrate-attention follow-ups
4523

4624
- Substrate-modulated Q projection. Q hasn't been swapped yet; the V resample recipe (post-projection modulation) may generalize.
4725
- Substrate FF: dampen off-attractor activations in the feed-forward residual.
4826
- Substrate LayerNorm: substrate-distance-weighted variance computation.
4927
- Larger-scale validation: every substrate-attention claim was made at TinyShakespeare scale (1.1MB). Need to verify the stack holds at 10-100MB corpora.
5028

29+
### Beyond (rough)
30+
5131
### Transformerless LLM
5232

5333
The substrate-attention components stack to −8.94% inside one block. The path forward is a top-to-bottom harmonic-only architecture trained competitively. Open: how to handle non-integer-coherent quantities at this scale (the substrate metric only applies to integer-valued quantities, per the rule derived from the HBit-gate falsification).
@@ -70,6 +50,7 @@ The substrate-attention components stack to −8.94% inside one block. The path
7050

7151
| Chapter | Key shipped items |
7252
|---|---|
53+
| [v0.3-symbolic-prediction](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.3-symbolic-prediction) | `omc_predict_files(paths, prefix, top_k)` returns ranked provenance-tracked continuations from a content-addressed corpus |
7354
| [v0.2-ergonomics](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.2-ergonomics) | `+=` / `-=` / `*=` / `/=` / `%=`, `len`/`range`/`getenv`/`to_hex`/`parse_int`, negative array indexing, did-you-mean, traced errors, 11 heal classes |
7455
| [v0.1-substrate-attention](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.1-substrate-attention) | Substrate-K + S-MOD softmax + substrate-V resample → −8.94% val on TinyShakespeare |
7556
| [v0.0.6-prometheus](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.0.6-prometheus) | Tape autograd, AdamW, Embedding, LayerNorm, multi-block transformer, first substrate-K wins |
Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
# Tests for the substrate-indexed code completion engine.
2+
#
3+
# Each test runs the predictor against examples/lib/prometheus.omc as
4+
# a real corpus and asserts ranking properties.
5+
6+
fn assert_true(cond, msg) { if !cond { test_record_failure(msg); } }
7+
fn assert_eq(actual, expected, msg) {
8+
if actual != expected {
9+
test_record_failure(msg + ": expected " + to_string(expected)
10+
+ " got " + to_string(actual));
11+
}
12+
}
13+
14+
fn _corpus() {
15+
return ["examples/lib/prometheus.omc"];
16+
}
17+
18+
# ---- Corpus ingestion ----
19+
20+
fn test_corpus_size_matches_real_corpus() {
21+
h size = omc_corpus_size(_corpus());
22+
# Prometheus has ~70 top-level fns currently. Lower bound is the
23+
# only stable assertion (the corpus grows over time).
24+
assert_true(size > 30, "corpus has > 30 fns (got " + to_string(size) + ")");
25+
}
26+
27+
# ---- Predict shape ----
28+
29+
fn test_predict_returns_array_of_dicts() {
30+
h hits = omc_predict_files(_corpus(), "fn prom_linear_", 3);
31+
assert_eq(type_of(hits), "array", "result is array");
32+
assert_true(arr_len(hits) > 0, "got at least one hit");
33+
h first = arr_get(hits, 0);
34+
assert_eq(type_of(first), "dict", "elements are dicts");
35+
}
36+
37+
fn test_predict_top_k_caps_results() {
38+
h hits = omc_predict_files(_corpus(), "fn prom_", 2);
39+
assert_true(arr_len(hits) <= 2, "respects top_k=2");
40+
}
41+
42+
# ---- Entry fields ----
43+
44+
fn test_each_hit_has_required_fields() {
45+
h hits = omc_predict_files(_corpus(), "fn prom_linear_", 1);
46+
h h0 = arr_get(hits, 0);
47+
assert_true(dict_has(h0, "fn_name"), "fn_name");
48+
assert_true(dict_has(h0, "source"), "source");
49+
assert_true(dict_has(h0, "file"), "file");
50+
assert_true(dict_has(h0, "canonical_hash"), "canonical_hash");
51+
assert_true(dict_has(h0, "attractor"), "attractor");
52+
assert_true(dict_has(h0, "prefix_match_len"), "prefix_match_len");
53+
assert_true(dict_has(h0, "substrate_distance"), "substrate_distance");
54+
}
55+
56+
# ---- Ranking ----
57+
58+
fn test_prom_linear_prefix_returns_the_three_prom_linear_fns() {
59+
# The prefix `fn prom_linear_` should surface exactly the three
60+
# prom_linear_* defs in Prometheus (new, forward, params).
61+
h hits = omc_predict_files(_corpus(), "fn prom_linear_", 10);
62+
assert_true(arr_len(hits) >= 3,
63+
"at least 3 results for fn prom_linear_ prefix");
64+
# Collect the names.
65+
h names = [];
66+
for entry in hits {
67+
arr_push(names, dict_get(entry, "fn_name"));
68+
}
69+
assert_true(arr_contains(names, "prom_linear_new"),
70+
"prom_linear_new present in " + to_string(names));
71+
assert_true(arr_contains(names, "prom_linear_forward"),
72+
"prom_linear_forward present in " + to_string(names));
73+
assert_true(arr_contains(names, "prom_linear_params"),
74+
"prom_linear_params present in " + to_string(names));
75+
}
76+
77+
fn test_prefix_match_len_nonzero_when_prefix_exists_in_corpus() {
78+
h hits = omc_predict_files(_corpus(), "fn prom_linear_", 3);
79+
h h0 = arr_get(hits, 0);
80+
assert_true(dict_get(h0, "prefix_match_len") > 0,
81+
"prefix_match_len > 0 when prefix tokens are in corpus");
82+
}
83+
84+
fn test_substrate_distance_is_nonneg() {
85+
h hits = omc_predict_files(_corpus(), "fn prom_linear_", 5);
86+
for entry in hits {
87+
assert_true(dict_get(entry, "substrate_distance") >= 0,
88+
"substrate_distance is non-negative");
89+
}
90+
}
91+
92+
# ---- Provenance ----
93+
94+
fn test_file_provenance_points_at_source() {
95+
h hits = omc_predict_files(_corpus(), "fn prom_linear_", 1);
96+
h h0 = arr_get(hits, 0);
97+
assert_eq(dict_get(h0, "file"), "examples/lib/prometheus.omc",
98+
"file is the source path we passed in");
99+
}
100+
101+
fn test_source_contains_fn_keyword() {
102+
h hits = omc_predict_files(_corpus(), "fn prom_linear_", 1);
103+
h source = dict_get(arr_get(hits, 0), "source");
104+
assert_true(str_contains(source, "fn "), "source has fn keyword");
105+
}
106+
107+
# ---- Empty inputs ----
108+
109+
fn test_empty_paths_returns_empty() {
110+
h hits = omc_predict_files([], "fn anything", 5);
111+
assert_eq(arr_len(hits), 0, "empty paths → empty hits");
112+
}
113+
114+
fn test_top_k_zero_returns_empty() {
115+
h hits = omc_predict_files(_corpus(), "fn prom_", 0);
116+
assert_eq(arr_len(hits), 0, "top_k=0 → empty hits");
117+
}

0 commit comments

Comments
 (0)