|
| 1 | +--- |
| 2 | +name: substrate-synth |
| 3 | +description: Generate code that is VALID BY CONSTRUCTION and content-grounded, using the substrate address→generate→VERIFY→accept loop. Use when asked to synthesize or improve code with an execution guarantee, to retrieve content-relevant functions from a codebase, or to run a self-improvement loop. Pairs an agnostic scaffold (content-addressing + grammar/execution verification) with you as the generator — nothing invalid is ever accepted. |
| 4 | +--- |
| 5 | + |
| 6 | +# Substrate-Synth — verified, address-grounded code synthesis |
| 7 | + |
| 8 | +**The principle:** the substrate is an agnostic *scaffold* — content-addressing + grammar-validity + |
| 9 | +execution-eval — **not** a generator. Pair it with a strong generator (YOU, the model) and a verify |
| 10 | +gate, and nothing invalid gets through. Validity is guaranteed by construction; correctness is |
| 11 | +guaranteed by execution. It works on any codebase with a grammar + interpreter + corpus. |
| 12 | + |
| 13 | +Reference implementation + the full math: **OMNIcode (OMC)**, MIT, public at |
| 14 | +<https://github.com/RandomCoder-lab/OMC> — see [`SUBSTRATE.md`](https://github.com/RandomCoder-lab/OMC/blob/master/SUBSTRATE.md) |
| 15 | +and `examples/harmonic_mind.omc`. |
| 16 | + |
| 17 | +## When to use |
| 18 | +- Synthesize code with an **execution/validity guarantee** (not "looks right" — *verified*). |
| 19 | +- Retrieve **content-relevant** functions from a codebase by similarity. |
| 20 | +- Run a **self-improvement loop**: address the target's own code → generate an improvement → verify |
| 21 | + by execution before accepting. |
| 22 | + |
| 23 | +## The loop — always: address → generate → VERIFY → accept |
| 24 | +1. **Address** the need: find the content-relevant existing code (by similarity), or the slot a new |
| 25 | + function belongs at (by content address). This grounds generation in real, working parts. |
| 26 | +2. **Generate** the candidate (you are the generator) using that grounding. |
| 27 | +3. **VERIFY** by execution — parse + run + check against tests. Accept ONLY if it passes; on failure, |
| 28 | + revise using the error and retry. Never return code that hasn't passed the gate. |
| 29 | + |
| 30 | +## In OMC, the loop is in the language (v1.8+) — fastest path |
| 31 | +If your target is OMC, build the public repo (`cargo build -p omnimcode-cli --release` → |
| 32 | +`omnimcode-standalone`) and use these core builtins directly — no external scaffold needed: |
| 33 | +- `gen_omc([seed])` / `gen_at(addr)` — valid-by-construction OMC (parse/run 1.000 over 300 seeds). |
| 34 | +- `code_parse_check(src)` → `{ok, error}` and `eval_omc(src)` — the verify gate. |
| 35 | +- `fn_swap_verified(name, new_src, test_src)` → `{accepted, error, result}` — the WHOLE loop in one |
| 36 | + call: install a candidate, test it in a sandbox, keep it only if it passes, else roll back. |
| 37 | +- `@memo` — transparent, persistent, cross-run memoization (`@memo fn fib(n)...` makes `fib(90)` |
| 38 | + instant). Pure-only; impure functions are refused at definition. |
| 39 | +- `haddr`/`haddr_face` (uniform content keys), `locality_sim`/`locality_nearest`/`nearest_fn` |
| 40 | + (content-similarity retrieval + dispatch), `cas_put`/`cas_get`/`same_value` (content-addressed heap |
| 41 | + + O(1) semantic equality), `value_addr`/`value_hash`. |
| 42 | +- Dual-band coherence: `phi_shadow(v)`, `bands(v)` → `[α, β]`, `value_divergence(v)`, `@dualband`. |
| 43 | + |
| 44 | +```omc |
| 45 | +fn target(n) { return 0 - 1; } // a stub to improve |
| 46 | +h cand = "fn target(n) { return n * n; }"; |
| 47 | +h r = fn_swap_verified("target", cand, "target(5) == 25"); |
| 48 | +print(r["accepted"]); // true — verified, installed; else rolled back |
| 49 | +``` |
| 50 | + |
| 51 | +## For non-OMC targets (e.g. Python) — the agnostic scaffold |
| 52 | +The same loop runs on any language: (1) a validity function (parse/typecheck), (2) an executor for |
| 53 | +correctness, (3) a corpus to address against. The OMC repo's `experiments/transformerless_lm/` |
| 54 | +includes a worked Python instantiation (`super_loop.py`, `py_substrate.py`, `locality_fp.py`, |
| 55 | +`exec_eval.py`) and a learned NL→code retriever (`desc_encoder.pt`, held-out recall@5 0.89). Clone the |
| 56 | +repo to reuse them, or re-implement the three hooks for your stack — everything else is unchanged. |
| 57 | + |
| 58 | +## Hard rules (earned the hard way — these are load-bearing) |
| 59 | +- **VERIFY before accept.** Never return code that hasn't passed the gate. The gate is the whole |
| 60 | + point: it lets even an imperfect generator be safe. |
| 61 | +- **Two fingerprints, two jobs.** Use *content-similarity* (locality/byte-histogram) for "find similar"; |
| 62 | + use *uniform content-addressing* (`haddr`) only for exact keys/buckets. A uniform hash has **no** |
| 63 | + content locality — φ-cosine similarity retrieval measured ≈ random; do not use it for similarity. |
| 64 | +- **Similarity ≠ semantics.** Character/locality similarity is typo/variant-tolerant (`"quicksrt"` → |
| 65 | + `quicksort`) but does NOT map a natural-language description to code (`"greatest common divisor"` |
| 66 | + will not find `gcd`) — that needs a learned encoder. |
| 67 | +- **valid ≠ correct.** The grammar/exec gate guarantees the code parses and runs; *correctness* needs |
| 68 | + test cases (derive them by running reference implementations). Always state which you verified. |
| 69 | +- **Substrate is a detector/prior, not a computation path.** It helps on identity/addressing/position |
| 70 | + (attenuable); it does not belong on the learned-float scoring path (measured: it loses there). |
| 71 | + |
| 72 | +## Honest scope |
| 73 | +Capability scales by adding addressed content at flat per-query CPU cost (measured: correctness rises |
| 74 | +with coverage; exact-key lookup stays O(1); verify is constant). The open frontier is generalizing |
| 75 | +*beyond* stored content — bounded by generator quality, not by GPU. |
0 commit comments