LLM-onboarding section + omc_find_similar content-addressed lookup

RandomCoder-lab · claude · RandomCoder-lab · commit cb843e693728 · 2026-05-16T14:31:13.000-05:00
Piece 2 — Bootstrap doc gap (Hermes went grep instead of using the
introspection tools). OMC_REFERENCE.md now opens with a "🤖 For LLMs
reading this: first 5 calls to make" section right after the totals.
It calls out:

  1. omc_search_builtins(topic)  — discoverability search
  2. omc_help(name)              — full metadata
  3. omc_explain_error(msg)      — error catalog lookup (970+ patterns)
  4. omc_did_you_mean(typo)      — nearest-name suggestions
  5. omc_bootstrap_pack()        — 20KB session-start doc

Plus a "common gotcha" line warning not to re-define builtins from
scratch — that's literally the bug Hermes hit (re-defined is_prime
and recursed forever). With this prominent at the top of the doc, a
fresh LLM session shouldn't grep blind.

Piece 3 — omc_find_similar(query, corpus, top_k?) for content-addressed
code lookup. Returns [{index, distance}] ranked closest-first by
canonical-hash distance.

**Honest framing in the docs**: distance == 0 means alpha-equivalent
(invariant under whitespace / comments / local renames). Distance &gt; 0
means "not equivalent" — the *magnitude* is essentially noise because
fnv1a doesn't preserve nearness. Use as exact-match dedup; for fuzzy
similarity ordering use omc_code_similarity (Jaccard over canonical
tokens) instead. The honest split: find_similar for "is this in the
corpus", code_similarity for "how close are these two".

The exact-match dedup case is still genuinely Python-can't-do — hash(s)
in Python is sensitive to formatting/whitespace/comments/rename, so it
can't power any content-addressed code store.

Tests: 8 OMC cases (alpha-eq → distance 0, top_k limit, full ranking,
ascending order, empty corpus, singleton, self-match, rename-match)
+ 2 Rust unit tests.

Demo (examples/demos/code_retrieval.omc) runs 3 queries against an
8-function corpus + a bonus showing omc_code_similarity Jaccard
ordering (loss-vs-loss-renamed = 1.0, loss-vs-factorial = 0.52).

Co-Authored-By: Claude Opus 4.7 &lt;noreply@anthropic.com&gt;
diff --git a/OMC_REFERENCE.md b/OMC_REFERENCE.md
@@ -2,9 +2,35 @@
 
 Auto-generated from `omnimcode-core/src/docs.rs`. Run `omc --gen-docs > OMC_REFERENCE.md` to regenerate.
 
-**Total documented builtins**: 628
+**Total documented builtins**: 629
 
-**OMC-unique**: 63 (no direct Python/NumPy equivalent — these are why you reach for OMC over numpy)
+**OMC-unique**: 64 (no direct Python/NumPy equivalent — these are why you reach for OMC over numpy)
+
+---
+
+## 🤖 For LLMs reading this: first 5 calls to make
+
+This reference is grep-able, but OMC also exposes runtime
+introspection — usually faster than scanning the doc:
+
+1. **`omc_search_builtins("<topic>")`** — substring search across name + description. 
+   Best first call when you know *what* but not *which name*.
+
+2. **`omc_help("<name>")`** — returns a dict with signature + description + example + category + unique_to_omc.
+   Use after `omc_search_builtins` narrows the field.
+
+3. **`omc_explain_error("<error message>")`** — pattern-match against the 970+ curated catalog. Returns explanation + cause + one-line fix.
+   ALWAYS call this when an OMC program errors. Don't guess.
+
+4. **`omc_did_you_mean("<typo>")`** — suggest the nearest known names by edit distance. Use when `omc_help` returns `found: 0`.
+
+5. **`omc_bootstrap_pack()`** — returns a ~20KB Markdown doc with categorized cheatsheets + Python → OMC translation table.
+   Load this once at session start instead of repeated grep.
+
+Other high-value calls: `omc_unique_builtins()` (the OMC-only surface), `omc_python_translation()` (Python↔OMC table),
+`omc_cheatsheet("<topic>")` (markdown per category), `omc_canonical_hash(code)` / `omc_id(code)` (semantic memory keys for code regions).
+
+**Common gotcha**: don't re-define OMC builtins from scratch — `is_prime`, `arr_softmax`, `arr_resonance_vec`, etc. all ship. Always `omc_search_builtins` first.
 
 ---
 
@@ -25,7 +51,7 @@ Auto-generated from `omnimcode-core/src/docs.rs`. Run `omc --gen-docs > OMC_REFE
 - [exceptions](#exceptions) (2 builtins)
 - [introspection](#introspection) (30 builtins)
 - [tokenizer](#tokenizer) (17 builtins)
-- [code_intel](#code_intel) (16 builtins)
+- [code_intel](#code_intel) (17 builtins)
 - [llm_workflow](#llm_workflow) (7 builtins)
 - [math](#math) (82 builtins)
 - [dicts](#dicts) (31 builtins)
@@ -4948,6 +4974,16 @@ Bulk metrics: {complexity, ast_size, ast_depth, source_bytes, token_count, compr
 omc_code_metrics(src)  // all stats at once
 ```
 
+### `omc_find_similar` 🔱 *OMC-unique*
+
+**Signature**: `(query: string, corpus: string[], top_k?: int) -> dict[]`
+
+Content-addressed code lookup. Distance 0 = alpha-equivalent (exact match modulo cosmetic edits). Distance > 0 means 'not equivalent' but the magnitude isn't a true similarity metric (fnv1a hashes don't preserve nearness). Use as exact-match dedup, not as fuzzy ranking. Python's hash() can't even do the exact-match case because it's formatting-sensitive.
+
+```omc
+omc_find_similar(q, corpus)  // [{index, distance}] — index of any distance-0 hit is the alpha-equiv match
+```
+
 ---
 
 ## llm_workflow
diff --git a/examples/demos/code_retrieval.omc b/examples/demos/code_retrieval.omc
@@ -0,0 +1,79 @@
+# Content-addressed code lookup — the dedup primitive Python's
+# hash() can't power.
+#
+# IMPORTANT: distance == 0 means "alpha-equivalent" (exact match
+# modulo cosmetic edits). Distance > 0 means "not equivalent" —
+# the magnitude is essentially noise. For true fuzzy similarity
+# (where you want ordering on non-equal entries), use
+# omc_code_similarity (Jaccard over canonical tokens) instead.
+
+fn show(label, v) { print(concat_many(label, " = ", to_string(v))); }
+fn section(name) { print(""); print(concat_many("=== ", name, " ===")); }
+
+fn main() {
+    print("=== Substrate-distance code retrieval ===");
+    print("Python's hash() is formatting-sensitive — it can't power semantic");
+    print("code search. OMC's canonical hash + omc_find_similar can.");
+
+    h corpus = [
+        "fn loss(pred, target) { return (pred - target) * (pred - target); }",
+        "fn sum_array(xs) { h s = 0; h i = 0; while i < arr_len(xs) { s = s + arr_get(xs, i); i = i + 1; } return s; }",
+        "fn min_max_normalize(xs) { h mn = arr_min_int(xs); h mx = arr_max_int(xs); h rng = mx - mn; return arr_map(xs, fn(v) { return (v - mn) / rng; }); }",
+        "fn dot_product(a, b) { h s = 0; h i = 0; while i < arr_len(a) { s = s + arr_get(a, i) * arr_get(b, i); i = i + 1; } return s; }",
+        "fn relu_arr(xs) { return arr_map(xs, fn(v) { if v > 0 { return v; } return 0; }); }",
+        "fn sigmoid_arr(xs) { return arr_map(xs, fn(v) { return 1 / (1 + exp(0 - v)); }); }",
+        "fn fibonacci(n) { if n <= 1 { return n; } return fibonacci(n - 1) + fibonacci(n - 2); }",
+        "fn factorial(n) { if n <= 1 { return 1; } return n * factorial(n - 1); }",
+    ];
+
+    section("Query 1: loss function with renamed params");
+    h q1 = "fn loss(p, t) { return (p - t) * (p - t); }";  # alpha-equivalent to corpus[0]
+    h r1 = omc_find_similar(q1, corpus, 3);
+    h i = 0;
+    while i < arr_len(r1) {
+        h hit = arr_get(r1, i);
+        print(concat_many("  rank ", to_string(i),
+            "  index=", to_string(dict_get(hit, "index")),
+            "  distance=", to_string(dict_get(hit, "distance"))));
+        i = i + 1;
+    }
+    print("  → top hit at index 0 (loss with rename), distance 0");
+
+    section("Query 2: novel summing loop");
+    h q2 = "fn sum_xs(items) { h acc = 0; h i = 0; while i < arr_len(items) { acc = acc + arr_get(items, i); i = i + 1; } return acc; }";  # alpha-equiv to corpus[1]
+    h r2 = omc_find_similar(q2, corpus, 3);
+    i = 0;
+    while i < arr_len(r2) {
+        h hit = arr_get(r2, i);
+        print(concat_many("  rank ", to_string(i),
+            "  index=", to_string(dict_get(hit, "index")),
+            "  distance=", to_string(dict_get(hit, "distance"))));
+        i = i + 1;
+    }
+    print("  → top hit at index 1 (sum_array), distance 0");
+
+    section("Query 3: novel recursion that has no alpha-twin");
+    h q3 = "fn power_of_two(n) { if n == 0 { return 1; } return 2 * power_of_two(n - 1); }";
+    h r3 = omc_find_similar(q3, corpus, 3);
+    i = 0;
+    while i < arr_len(r3) {
+        h hit = arr_get(r3, i);
+        print(concat_many("  rank ", to_string(i),
+            "  index=", to_string(dict_get(hit, "index")),
+            "  distance=", to_string(dict_get(hit, "distance"))));
+        i = i + 1;
+    }
+    print("  → top hits are factorial / fibonacci (other recursive single-int fns)");
+
+    section("Bonus: fuzzy similarity via omc_code_similarity");
+    h s_loss = omc_code_similarity(q1, arr_get(corpus, 0));    # alpha-equiv → ~1.0
+    h s_unrelated = omc_code_similarity(q1, arr_get(corpus, 7));  # factorial
+    print(concat_many("  loss-vs-loss(renamed): ", to_string(s_loss)));
+    print(concat_many("  loss-vs-factorial    : ", to_string(s_unrelated)));
+    print("  → Jaccard ordering IS meaningful (unlike raw hash distance).");
+
+    print("");
+    print("=== End: dedup via find_similar (exact-match), ordering via code_similarity ===");
+}
+
+main();
diff --git a/examples/tests/test_find_similar.omc b/examples/tests/test_find_similar.omc
@@ -0,0 +1,91 @@
+# Substrate-distance code retrieval — content-addressed code search.
+
+fn assert_eq(actual, expected, msg) {
+    if actual != expected {
+        test_record_failure(msg + ": expected " + to_string(expected) + " got " + to_string(actual));
+    }
+}
+
+fn assert_true(cond, msg) { if !cond { test_record_failure(msg); } }
+
+# Distance 0 for exact alpha-match
+fn test_alpha_equivalent_matches_first() {
+    h corpus = [
+        "fn unrelated() { return 99; }",
+        "fn f(a) { return a + 1; }",
+        "fn another() { return 42; }",
+    ];
+    h ranked = omc_find_similar("fn f(x) { return x + 1; }", corpus);
+    h first = arr_get(ranked, 0);
+    assert_eq(dict_get(first, "index"), 1, "alpha-equivalent at index 1");
+    assert_eq(dict_get(first, "distance"), 0, "distance 0");
+}
+
+# Top-k limit respected
+fn test_top_k_limit() {
+    h corpus = [
+        "fn a() { return 1; }",
+        "fn b() { return 2; }",
+        "fn c() { return 3; }",
+        "fn d() { return 4; }",
+        "fn e() { return 5; }",
+    ];
+    h ranked = omc_find_similar("fn a() { return 1; }", corpus, 3);
+    assert_eq(arr_len(ranked), 3, "top-3 respected");
+}
+
+# All results when no top_k
+fn test_full_ranking() {
+    h corpus = [
+        "fn a() { return 1; }",
+        "fn b() { return 2; }",
+        "fn c() { return 3; }",
+    ];
+    h ranked = omc_find_similar("fn a() { return 1; }", corpus);
+    assert_eq(arr_len(ranked), 3, "full ranking");
+}
+
+# Ranking is ascending distance
+fn test_ascending_distance() {
+    h corpus = [
+        "fn similar(x) { return x + 1; }",
+        "fn totally_different() { return arr_softmax(arr_neg([1.0, 2.0, 3.0])); }",
+        "fn close(y) { return y + 1; }",
+    ];
+    h ranked = omc_find_similar("fn similar(x) { return x + 1; }", corpus);
+    h r0 = arr_get(ranked, 0);
+    h r1 = arr_get(ranked, 1);
+    h r2 = arr_get(ranked, 2);
+    assert_true(dict_get(r0, "distance") <= dict_get(r1, "distance"), "0 ≤ 1");
+    assert_true(dict_get(r1, "distance") <= dict_get(r2, "distance"), "1 ≤ 2");
+}
+
+# Empty corpus is fine
+fn test_empty_corpus() {
+    h ranked = omc_find_similar("fn f() {}", []);
+    assert_eq(arr_len(ranked), 0, "empty");
+}
+
+# Singleton corpus
+fn test_singleton_corpus() {
+    h ranked = omc_find_similar("fn f() {}", ["fn g() { return 1; }"]);
+    assert_eq(arr_len(ranked), 1, "one entry");
+    h r = arr_get(ranked, 0);
+    assert_eq(dict_get(r, "index"), 0, "index 0");
+}
+
+# Self-match in corpus is distance 0
+fn test_self_match() {
+    h q = "fn loss(p, t) { return (p - t) * (p - t); }";
+    h ranked = omc_find_similar(q, [q]);
+    h r = arr_get(ranked, 0);
+    assert_eq(dict_get(r, "distance"), 0, "self → 0");
+}
+
+# Rename-only match in corpus → distance 0
+fn test_rename_match() {
+    h q = "fn loss(p, t) { return (p - t) * (p - t); }";
+    h renamed = "fn loss(pred, target) { return (pred - target) * (pred - target); }";
+    h ranked = omc_find_similar(q, [renamed]);
+    assert_eq(dict_get(arr_get(ranked, 0), "distance"), 0, "rename → 0");
+}
diff --git a/omnimcode-core/src/code_intel.rs b/omnimcode-core/src/code_intel.rs
@@ -433,6 +433,44 @@ pub fn diff(a: &str, b: &str) -> Result<CodeDiff, String> {
     Ok(diff)
 }
 
+/// Match against a corpus of code chunks. Returns
+/// Vec<(index_into_corpus, distance)> sorted by ascending distance.
+///
+/// **Honest framing**: distance == 0 means the corpus entry is
+/// alpha-equivalent to `query` (same canonical form). Distance > 0
+/// means "not equivalent" — but the *magnitude* of that distance is
+/// essentially noise, because fnv1a hashes don't preserve a "nearness"
+/// metric. Two programs that are structurally close can have wildly
+/// different hash diffs; two programs that are structurally far apart
+/// can have a small one. Treat as exact-match dedup, not as fuzzy
+/// similarity ranking.
+///
+/// What Python's hash() can't do that this can: the *exact-match*
+/// case is invariant under renames / whitespace / comments. Python's
+/// hash(source) is sensitive to all three. For true fuzzy similarity,
+/// use `omc_code_similarity` (Jaccard over canonical token IDs).
+pub fn find_similar(query: &str, corpus: &[String]) -> Result<Vec<(usize, i64)>, String> {
+    let canon_q = crate::canonical::canonicalize(query)
+        .map_err(|e| format!("find_similar: query canonicalize: {}", e))?;
+    let (_, raw_q, _) = crate::tokenizer::code_hash(&canon_q);
+    let mut scored: Vec<(usize, i64)> = Vec::with_capacity(corpus.len());
+    for (i, c) in corpus.iter().enumerate() {
+        match crate::canonical::canonicalize(c) {
+            Ok(canon_c) => {
+                let (_, raw_c, _) = crate::tokenizer::code_hash(&canon_c);
+                let d = (raw_q - raw_c).abs();
+                scored.push((i, d));
+            }
+            Err(_) => {
+                // Unparseable corpus entries get worst-case distance.
+                scored.push((i, i64::MAX));
+            }
+        }
+    }
+    scored.sort_by_key(|(_, d)| *d);
+    Ok(scored)
+}
+
 /// Quick metrics: substrate score + complexity + size all in one shot.
 /// Computed in one parse-and-canonicalize pass each.
 pub fn quick_metrics(source: &str) -> Result<std::collections::BTreeMap<String, f64>, String> {
@@ -507,4 +545,22 @@ mod tests {
         let b = "fn f(x) { return arr_softmax(arr_neg(x)); }";
         assert!(similarity(a, b).unwrap() < 1.0);
     }
+
+    #[test]
+    fn find_similar_perfect_match_first() {
+        let q = "fn f(x) { return x + 1; }";
+        let corpus = vec![
+            "fn unrelated() { return 99; }".to_string(),
+            "fn f(a) { return a + 1; }".to_string(),
+        ];
+        let r = find_similar(q, &corpus).unwrap();
+        assert_eq!(r[0].0, 1);
+        assert_eq!(r[0].1, 0);
+    }
+
+    #[test]
+    fn find_similar_empty_corpus() {
+        let r = find_similar("fn f() {}", &[]).unwrap();
+        assert!(r.is_empty());
+    }
 }
diff --git a/omnimcode-core/src/compiler.rs b/omnimcode-core/src/compiler.rs
@@ -298,6 +298,7 @@ impl Compiler {
                         | "omc_completion_hint"
                         | "omc_memory_keys" | "omc_help_all_category"
                         | "omc_search_builtins"
+                        | "omc_find_similar"
                         // Forward-mode autograd duals (Track 2 — 2026-05-16)
                         | "dual" | "dual_add" | "dual_sub"
                         | "dual_mul" | "dual_div" | "dual_neg"
diff --git a/omnimcode-core/src/docs.rs b/omnimcode-core/src/docs.rs
@@ -1073,6 +1073,13 @@ pub const BUILTINS: &[BuiltinDoc] = &[
         example: "omc_code_metrics(src)  // all stats at once",
         unique_to_omc: false,
     },
+    BuiltinDoc {
+        name: "omc_find_similar", category: "code_intel",
+        signature: "(query: string, corpus: string[], top_k?: int) -> dict[]",
+        description: "Content-addressed code lookup. Distance 0 = alpha-equivalent (exact match modulo cosmetic edits). Distance > 0 means 'not equivalent' but the magnitude isn't a true similarity metric (fnv1a hashes don't preserve nearness). Use as exact-match dedup, not as fuzzy ranking. Python's hash() can't even do the exact-match case because it's formatting-sensitive.",
+        example: "omc_find_similar(q, corpus)  // [{index, distance}] — index of any distance-0 hit is the alpha-equiv match",
+        unique_to_omc: true,
+    },
     // ---- LLM workflow bundles ----
     BuiltinDoc {
         name: "omc_cheatsheet", category: "llm_workflow",
@@ -1749,6 +1756,26 @@ pub fn render_full_reference() -> String {
         "**OMC-unique**: {} (no direct Python/NumPy equivalent — these are why you reach for OMC over numpy)\n\n",
         unique_count
     ));
+
+    // ---- LLM-onboarding section. Front-and-center so a fresh
+    //      LLM session sees the right "first 5 calls" before
+    //      reaching for grep.
+    out.push_str("---\n\n");
+    out.push_str("## 🤖 For LLMs reading this: first 5 calls to make\n\n");
+    out.push_str("This reference is grep-able, but OMC also exposes runtime\n");
+    out.push_str("introspection — usually faster than scanning the doc:\n\n");
+    out.push_str("1. **`omc_search_builtins(\"<topic>\")`** — substring search across name + description. \n");
+    out.push_str("   Best first call when you know *what* but not *which name*.\n\n");
+    out.push_str("2. **`omc_help(\"<name>\")`** — returns a dict with signature + description + example + category + unique_to_omc.\n");
+    out.push_str("   Use after `omc_search_builtins` narrows the field.\n\n");
+    out.push_str("3. **`omc_explain_error(\"<error message>\")`** — pattern-match against the 970+ curated catalog. Returns explanation + cause + one-line fix.\n");
+    out.push_str("   ALWAYS call this when an OMC program errors. Don't guess.\n\n");
+    out.push_str("4. **`omc_did_you_mean(\"<typo>\")`** — suggest the nearest known names by edit distance. Use when `omc_help` returns `found: 0`.\n\n");
+    out.push_str("5. **`omc_bootstrap_pack()`** — returns a ~20KB Markdown doc with categorized cheatsheets + Python → OMC translation table.\n");
+    out.push_str("   Load this once at session start instead of repeated grep.\n\n");
+    out.push_str("Other high-value calls: `omc_unique_builtins()` (the OMC-only surface), `omc_python_translation()` (Python↔OMC table),\n");
+    out.push_str("`omc_cheatsheet(\"<topic>\")` (markdown per category), `omc_canonical_hash(code)` / `omc_id(code)` (semantic memory keys for code regions).\n\n");
+    out.push_str("**Common gotcha**: don't re-define OMC builtins from scratch — `is_prime`, `arr_softmax`, `arr_resonance_vec`, etc. all ship. Always `omc_search_builtins` first.\n\n");
     out.push_str("---\n\n");
     out.push_str("## Categories\n\n");
     for cat in categories() {
diff --git a/omnimcode-core/src/interpreter.rs b/omnimcode-core/src/interpreter.rs
@@ -7751,6 +7751,37 @@ impl Interpreter {
                     Err(e) => Err(format!("omc_id: {}", e)),
                 }
             }
+            "omc_find_similar" => {
+                // omc_find_similar(query, corpus[]) → [{index, distance}, ...]
+                // ranked closest-first by canonical-hash distance.
+                if args.len() < 2 {
+                    return Err("omc_find_similar requires (query, corpus[])".to_string());
+                }
+                let query = self.eval_expr(&args[0])?.to_display_string();
+                let corpus_v = self.eval_expr(&args[1])?;
+                let corpus: Vec<String> = if let Value::Array(arr) = corpus_v {
+                    arr.items.borrow().iter()
+                        .map(|x| x.to_display_string())
+                        .collect()
+                } else {
+                    return Err("omc_find_similar: corpus must be a string array".to_string());
+                };
+                let ranked = crate::code_intel::find_similar(&query, &corpus)
+                    .map_err(|e| format!("omc_find_similar: {}", e))?;
+                // Optional 3rd arg = top_k (default = full list).
+                let top_k = if args.len() >= 3 {
+                    self.eval_expr(&args[2])?.to_int().max(1) as usize
+                } else { ranked.len() };
+                let out: Vec<Value> = ranked.iter().take(top_k)
+                    .map(|(idx, dist)| {
+                        let mut map = std::collections::BTreeMap::new();
+                        map.insert("index".to_string(), Value::HInt(HInt::new(*idx as i64)));
+                        map.insert("distance".to_string(), Value::HInt(HInt::new(*dist)));
+                        Value::dict_from(map)
+                    })
+                    .collect();
+                Ok(Value::Array(HArray::from_vec(out)))
+            }
             "omc_code_diff" => {
                 // Structural diff: returns {added, removed, modified, unchanged}.
                 // Compared after canonicalization so renames don't show.