Skip to content

Commit cb843e6

Browse files
LLM-onboarding section + omc_find_similar content-addressed lookup
Piece 2 — Bootstrap doc gap (Hermes went grep instead of using the introspection tools). OMC_REFERENCE.md now opens with a "🤖 For LLMs reading this: first 5 calls to make" section right after the totals. It calls out: 1. omc_search_builtins(topic) — discoverability search 2. omc_help(name) — full metadata 3. omc_explain_error(msg) — error catalog lookup (970+ patterns) 4. omc_did_you_mean(typo) — nearest-name suggestions 5. omc_bootstrap_pack() — 20KB session-start doc Plus a "common gotcha" line warning not to re-define builtins from scratch — that's literally the bug Hermes hit (re-defined is_prime and recursed forever). With this prominent at the top of the doc, a fresh LLM session shouldn't grep blind. Piece 3 — omc_find_similar(query, corpus, top_k?) for content-addressed code lookup. Returns [{index, distance}] ranked closest-first by canonical-hash distance. **Honest framing in the docs**: distance == 0 means alpha-equivalent (invariant under whitespace / comments / local renames). Distance > 0 means "not equivalent" — the *magnitude* is essentially noise because fnv1a doesn't preserve nearness. Use as exact-match dedup; for fuzzy similarity ordering use omc_code_similarity (Jaccard over canonical tokens) instead. The honest split: find_similar for "is this in the corpus", code_similarity for "how close are these two". The exact-match dedup case is still genuinely Python-can't-do — hash(s) in Python is sensitive to formatting/whitespace/comments/rename, so it can't power any content-addressed code store. Tests: 8 OMC cases (alpha-eq → distance 0, top_k limit, full ranking, ascending order, empty corpus, singleton, self-match, rename-match) + 2 Rust unit tests. Demo (examples/demos/code_retrieval.omc) runs 3 queries against an 8-function corpus + a bonus showing omc_code_similarity Jaccard ordering (loss-vs-loss-renamed = 1.0, loss-vs-factorial = 0.52). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 92d7d90 commit cb843e6

7 files changed

Lines changed: 324 additions & 3 deletions

File tree

OMC_REFERENCE.md

Lines changed: 39 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,35 @@
22

33
Auto-generated from `omnimcode-core/src/docs.rs`. Run `omc --gen-docs > OMC_REFERENCE.md` to regenerate.
44

5-
**Total documented builtins**: 628
5+
**Total documented builtins**: 629
66

7-
**OMC-unique**: 63 (no direct Python/NumPy equivalent — these are why you reach for OMC over numpy)
7+
**OMC-unique**: 64 (no direct Python/NumPy equivalent — these are why you reach for OMC over numpy)
8+
9+
---
10+
11+
## 🤖 For LLMs reading this: first 5 calls to make
12+
13+
This reference is grep-able, but OMC also exposes runtime
14+
introspection — usually faster than scanning the doc:
15+
16+
1. **`omc_search_builtins("<topic>")`** — substring search across name + description.
17+
Best first call when you know *what* but not *which name*.
18+
19+
2. **`omc_help("<name>")`** — returns a dict with signature + description + example + category + unique_to_omc.
20+
Use after `omc_search_builtins` narrows the field.
21+
22+
3. **`omc_explain_error("<error message>")`** — pattern-match against the 970+ curated catalog. Returns explanation + cause + one-line fix.
23+
ALWAYS call this when an OMC program errors. Don't guess.
24+
25+
4. **`omc_did_you_mean("<typo>")`** — suggest the nearest known names by edit distance. Use when `omc_help` returns `found: 0`.
26+
27+
5. **`omc_bootstrap_pack()`** — returns a ~20KB Markdown doc with categorized cheatsheets + Python → OMC translation table.
28+
Load this once at session start instead of repeated grep.
29+
30+
Other high-value calls: `omc_unique_builtins()` (the OMC-only surface), `omc_python_translation()` (Python↔OMC table),
31+
`omc_cheatsheet("<topic>")` (markdown per category), `omc_canonical_hash(code)` / `omc_id(code)` (semantic memory keys for code regions).
32+
33+
**Common gotcha**: don't re-define OMC builtins from scratch — `is_prime`, `arr_softmax`, `arr_resonance_vec`, etc. all ship. Always `omc_search_builtins` first.
834

935
---
1036

@@ -25,7 +51,7 @@ Auto-generated from `omnimcode-core/src/docs.rs`. Run `omc --gen-docs > OMC_REFE
2551
- [exceptions](#exceptions) (2 builtins)
2652
- [introspection](#introspection) (30 builtins)
2753
- [tokenizer](#tokenizer) (17 builtins)
28-
- [code_intel](#code_intel) (16 builtins)
54+
- [code_intel](#code_intel) (17 builtins)
2955
- [llm_workflow](#llm_workflow) (7 builtins)
3056
- [math](#math) (82 builtins)
3157
- [dicts](#dicts) (31 builtins)
@@ -4948,6 +4974,16 @@ Bulk metrics: {complexity, ast_size, ast_depth, source_bytes, token_count, compr
49484974
omc_code_metrics(src) // all stats at once
49494975
```
49504976

4977+
### `omc_find_similar` 🔱 *OMC-unique*
4978+
4979+
**Signature**: `(query: string, corpus: string[], top_k?: int) -> dict[]`
4980+
4981+
Content-addressed code lookup. Distance 0 = alpha-equivalent (exact match modulo cosmetic edits). Distance > 0 means 'not equivalent' but the magnitude isn't a true similarity metric (fnv1a hashes don't preserve nearness). Use as exact-match dedup, not as fuzzy ranking. Python's hash() can't even do the exact-match case because it's formatting-sensitive.
4982+
4983+
```omc
4984+
omc_find_similar(q, corpus) // [{index, distance}] — index of any distance-0 hit is the alpha-equiv match
4985+
```
4986+
49514987
---
49524988

49534989
## llm_workflow

examples/demos/code_retrieval.omc

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# Content-addressed code lookup — the dedup primitive Python's
2+
# hash() can't power.
3+
#
4+
# IMPORTANT: distance == 0 means "alpha-equivalent" (exact match
5+
# modulo cosmetic edits). Distance > 0 means "not equivalent" —
6+
# the magnitude is essentially noise. For true fuzzy similarity
7+
# (where you want ordering on non-equal entries), use
8+
# omc_code_similarity (Jaccard over canonical tokens) instead.
9+
10+
fn show(label, v) { print(concat_many(label, " = ", to_string(v))); }
11+
fn section(name) { print(""); print(concat_many("=== ", name, " ===")); }
12+
13+
fn main() {
14+
print("=== Substrate-distance code retrieval ===");
15+
print("Python's hash() is formatting-sensitive — it can't power semantic");
16+
print("code search. OMC's canonical hash + omc_find_similar can.");
17+
18+
h corpus = [
19+
"fn loss(pred, target) { return (pred - target) * (pred - target); }",
20+
"fn sum_array(xs) { h s = 0; h i = 0; while i < arr_len(xs) { s = s + arr_get(xs, i); i = i + 1; } return s; }",
21+
"fn min_max_normalize(xs) { h mn = arr_min_int(xs); h mx = arr_max_int(xs); h rng = mx - mn; return arr_map(xs, fn(v) { return (v - mn) / rng; }); }",
22+
"fn dot_product(a, b) { h s = 0; h i = 0; while i < arr_len(a) { s = s + arr_get(a, i) * arr_get(b, i); i = i + 1; } return s; }",
23+
"fn relu_arr(xs) { return arr_map(xs, fn(v) { if v > 0 { return v; } return 0; }); }",
24+
"fn sigmoid_arr(xs) { return arr_map(xs, fn(v) { return 1 / (1 + exp(0 - v)); }); }",
25+
"fn fibonacci(n) { if n <= 1 { return n; } return fibonacci(n - 1) + fibonacci(n - 2); }",
26+
"fn factorial(n) { if n <= 1 { return 1; } return n * factorial(n - 1); }",
27+
];
28+
29+
section("Query 1: loss function with renamed params");
30+
h q1 = "fn loss(p, t) { return (p - t) * (p - t); }"; # alpha-equivalent to corpus[0]
31+
h r1 = omc_find_similar(q1, corpus, 3);
32+
h i = 0;
33+
while i < arr_len(r1) {
34+
h hit = arr_get(r1, i);
35+
print(concat_many(" rank ", to_string(i),
36+
" index=", to_string(dict_get(hit, "index")),
37+
" distance=", to_string(dict_get(hit, "distance"))));
38+
i = i + 1;
39+
}
40+
print(" → top hit at index 0 (loss with rename), distance 0");
41+
42+
section("Query 2: novel summing loop");
43+
h q2 = "fn sum_xs(items) { h acc = 0; h i = 0; while i < arr_len(items) { acc = acc + arr_get(items, i); i = i + 1; } return acc; }"; # alpha-equiv to corpus[1]
44+
h r2 = omc_find_similar(q2, corpus, 3);
45+
i = 0;
46+
while i < arr_len(r2) {
47+
h hit = arr_get(r2, i);
48+
print(concat_many(" rank ", to_string(i),
49+
" index=", to_string(dict_get(hit, "index")),
50+
" distance=", to_string(dict_get(hit, "distance"))));
51+
i = i + 1;
52+
}
53+
print(" → top hit at index 1 (sum_array), distance 0");
54+
55+
section("Query 3: novel recursion that has no alpha-twin");
56+
h q3 = "fn power_of_two(n) { if n == 0 { return 1; } return 2 * power_of_two(n - 1); }";
57+
h r3 = omc_find_similar(q3, corpus, 3);
58+
i = 0;
59+
while i < arr_len(r3) {
60+
h hit = arr_get(r3, i);
61+
print(concat_many(" rank ", to_string(i),
62+
" index=", to_string(dict_get(hit, "index")),
63+
" distance=", to_string(dict_get(hit, "distance"))));
64+
i = i + 1;
65+
}
66+
print(" → top hits are factorial / fibonacci (other recursive single-int fns)");
67+
68+
section("Bonus: fuzzy similarity via omc_code_similarity");
69+
h s_loss = omc_code_similarity(q1, arr_get(corpus, 0)); # alpha-equiv → ~1.0
70+
h s_unrelated = omc_code_similarity(q1, arr_get(corpus, 7)); # factorial
71+
print(concat_many(" loss-vs-loss(renamed): ", to_string(s_loss)));
72+
print(concat_many(" loss-vs-factorial : ", to_string(s_unrelated)));
73+
print(" → Jaccard ordering IS meaningful (unlike raw hash distance).");
74+
75+
print("");
76+
print("=== End: dedup via find_similar (exact-match), ordering via code_similarity ===");
77+
}
78+
79+
main();
Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
# Substrate-distance code retrieval — content-addressed code search.
2+
3+
fn assert_eq(actual, expected, msg) {
4+
if actual != expected {
5+
test_record_failure(msg + ": expected " + to_string(expected) + " got " + to_string(actual));
6+
}
7+
}
8+
9+
fn assert_true(cond, msg) { if !cond { test_record_failure(msg); } }
10+
11+
# Distance 0 for exact alpha-match
12+
fn test_alpha_equivalent_matches_first() {
13+
h corpus = [
14+
"fn unrelated() { return 99; }",
15+
"fn f(a) { return a + 1; }",
16+
"fn another() { return 42; }",
17+
];
18+
h ranked = omc_find_similar("fn f(x) { return x + 1; }", corpus);
19+
h first = arr_get(ranked, 0);
20+
assert_eq(dict_get(first, "index"), 1, "alpha-equivalent at index 1");
21+
assert_eq(dict_get(first, "distance"), 0, "distance 0");
22+
}
23+
24+
# Top-k limit respected
25+
fn test_top_k_limit() {
26+
h corpus = [
27+
"fn a() { return 1; }",
28+
"fn b() { return 2; }",
29+
"fn c() { return 3; }",
30+
"fn d() { return 4; }",
31+
"fn e() { return 5; }",
32+
];
33+
h ranked = omc_find_similar("fn a() { return 1; }", corpus, 3);
34+
assert_eq(arr_len(ranked), 3, "top-3 respected");
35+
}
36+
37+
# All results when no top_k
38+
fn test_full_ranking() {
39+
h corpus = [
40+
"fn a() { return 1; }",
41+
"fn b() { return 2; }",
42+
"fn c() { return 3; }",
43+
];
44+
h ranked = omc_find_similar("fn a() { return 1; }", corpus);
45+
assert_eq(arr_len(ranked), 3, "full ranking");
46+
}
47+
48+
# Ranking is ascending distance
49+
fn test_ascending_distance() {
50+
h corpus = [
51+
"fn similar(x) { return x + 1; }",
52+
"fn totally_different() { return arr_softmax(arr_neg([1.0, 2.0, 3.0])); }",
53+
"fn close(y) { return y + 1; }",
54+
];
55+
h ranked = omc_find_similar("fn similar(x) { return x + 1; }", corpus);
56+
h r0 = arr_get(ranked, 0);
57+
h r1 = arr_get(ranked, 1);
58+
h r2 = arr_get(ranked, 2);
59+
assert_true(dict_get(r0, "distance") <= dict_get(r1, "distance"), "0 ≤ 1");
60+
assert_true(dict_get(r1, "distance") <= dict_get(r2, "distance"), "1 ≤ 2");
61+
}
62+
63+
# Empty corpus is fine
64+
fn test_empty_corpus() {
65+
h ranked = omc_find_similar("fn f() {}", []);
66+
assert_eq(arr_len(ranked), 0, "empty");
67+
}
68+
69+
# Singleton corpus
70+
fn test_singleton_corpus() {
71+
h ranked = omc_find_similar("fn f() {}", ["fn g() { return 1; }"]);
72+
assert_eq(arr_len(ranked), 1, "one entry");
73+
h r = arr_get(ranked, 0);
74+
assert_eq(dict_get(r, "index"), 0, "index 0");
75+
}
76+
77+
# Self-match in corpus is distance 0
78+
fn test_self_match() {
79+
h q = "fn loss(p, t) { return (p - t) * (p - t); }";
80+
h ranked = omc_find_similar(q, [q]);
81+
h r = arr_get(ranked, 0);
82+
assert_eq(dict_get(r, "distance"), 0, "self → 0");
83+
}
84+
85+
# Rename-only match in corpus → distance 0
86+
fn test_rename_match() {
87+
h q = "fn loss(p, t) { return (p - t) * (p - t); }";
88+
h renamed = "fn loss(pred, target) { return (pred - target) * (pred - target); }";
89+
h ranked = omc_find_similar(q, [renamed]);
90+
assert_eq(dict_get(arr_get(ranked, 0), "distance"), 0, "rename → 0");
91+
}

omnimcode-core/src/code_intel.rs

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -433,6 +433,44 @@ pub fn diff(a: &str, b: &str) -> Result<CodeDiff, String> {
433433
Ok(diff)
434434
}
435435

436+
/// Match against a corpus of code chunks. Returns
437+
/// Vec<(index_into_corpus, distance)> sorted by ascending distance.
438+
///
439+
/// **Honest framing**: distance == 0 means the corpus entry is
440+
/// alpha-equivalent to `query` (same canonical form). Distance > 0
441+
/// means "not equivalent" — but the *magnitude* of that distance is
442+
/// essentially noise, because fnv1a hashes don't preserve a "nearness"
443+
/// metric. Two programs that are structurally close can have wildly
444+
/// different hash diffs; two programs that are structurally far apart
445+
/// can have a small one. Treat as exact-match dedup, not as fuzzy
446+
/// similarity ranking.
447+
///
448+
/// What Python's hash() can't do that this can: the *exact-match*
449+
/// case is invariant under renames / whitespace / comments. Python's
450+
/// hash(source) is sensitive to all three. For true fuzzy similarity,
451+
/// use `omc_code_similarity` (Jaccard over canonical token IDs).
452+
pub fn find_similar(query: &str, corpus: &[String]) -> Result<Vec<(usize, i64)>, String> {
453+
let canon_q = crate::canonical::canonicalize(query)
454+
.map_err(|e| format!("find_similar: query canonicalize: {}", e))?;
455+
let (_, raw_q, _) = crate::tokenizer::code_hash(&canon_q);
456+
let mut scored: Vec<(usize, i64)> = Vec::with_capacity(corpus.len());
457+
for (i, c) in corpus.iter().enumerate() {
458+
match crate::canonical::canonicalize(c) {
459+
Ok(canon_c) => {
460+
let (_, raw_c, _) = crate::tokenizer::code_hash(&canon_c);
461+
let d = (raw_q - raw_c).abs();
462+
scored.push((i, d));
463+
}
464+
Err(_) => {
465+
// Unparseable corpus entries get worst-case distance.
466+
scored.push((i, i64::MAX));
467+
}
468+
}
469+
}
470+
scored.sort_by_key(|(_, d)| *d);
471+
Ok(scored)
472+
}
473+
436474
/// Quick metrics: substrate score + complexity + size all in one shot.
437475
/// Computed in one parse-and-canonicalize pass each.
438476
pub fn quick_metrics(source: &str) -> Result<std::collections::BTreeMap<String, f64>, String> {
@@ -507,4 +545,22 @@ mod tests {
507545
let b = "fn f(x) { return arr_softmax(arr_neg(x)); }";
508546
assert!(similarity(a, b).unwrap() < 1.0);
509547
}
548+
549+
#[test]
550+
fn find_similar_perfect_match_first() {
551+
let q = "fn f(x) { return x + 1; }";
552+
let corpus = vec![
553+
"fn unrelated() { return 99; }".to_string(),
554+
"fn f(a) { return a + 1; }".to_string(),
555+
];
556+
let r = find_similar(q, &corpus).unwrap();
557+
assert_eq!(r[0].0, 1);
558+
assert_eq!(r[0].1, 0);
559+
}
560+
561+
#[test]
562+
fn find_similar_empty_corpus() {
563+
let r = find_similar("fn f() {}", &[]).unwrap();
564+
assert!(r.is_empty());
565+
}
510566
}

omnimcode-core/src/compiler.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -298,6 +298,7 @@ impl Compiler {
298298
| "omc_completion_hint"
299299
| "omc_memory_keys" | "omc_help_all_category"
300300
| "omc_search_builtins"
301+
| "omc_find_similar"
301302
// Forward-mode autograd duals (Track 2 — 2026-05-16)
302303
| "dual" | "dual_add" | "dual_sub"
303304
| "dual_mul" | "dual_div" | "dual_neg"

omnimcode-core/src/docs.rs

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1073,6 +1073,13 @@ pub const BUILTINS: &[BuiltinDoc] = &[
10731073
example: "omc_code_metrics(src) // all stats at once",
10741074
unique_to_omc: false,
10751075
},
1076+
BuiltinDoc {
1077+
name: "omc_find_similar", category: "code_intel",
1078+
signature: "(query: string, corpus: string[], top_k?: int) -> dict[]",
1079+
description: "Content-addressed code lookup. Distance 0 = alpha-equivalent (exact match modulo cosmetic edits). Distance > 0 means 'not equivalent' but the magnitude isn't a true similarity metric (fnv1a hashes don't preserve nearness). Use as exact-match dedup, not as fuzzy ranking. Python's hash() can't even do the exact-match case because it's formatting-sensitive.",
1080+
example: "omc_find_similar(q, corpus) // [{index, distance}] — index of any distance-0 hit is the alpha-equiv match",
1081+
unique_to_omc: true,
1082+
},
10761083
// ---- LLM workflow bundles ----
10771084
BuiltinDoc {
10781085
name: "omc_cheatsheet", category: "llm_workflow",
@@ -1749,6 +1756,26 @@ pub fn render_full_reference() -> String {
17491756
"**OMC-unique**: {} (no direct Python/NumPy equivalent — these are why you reach for OMC over numpy)\n\n",
17501757
unique_count
17511758
));
1759+
1760+
// ---- LLM-onboarding section. Front-and-center so a fresh
1761+
// LLM session sees the right "first 5 calls" before
1762+
// reaching for grep.
1763+
out.push_str("---\n\n");
1764+
out.push_str("## 🤖 For LLMs reading this: first 5 calls to make\n\n");
1765+
out.push_str("This reference is grep-able, but OMC also exposes runtime\n");
1766+
out.push_str("introspection — usually faster than scanning the doc:\n\n");
1767+
out.push_str("1. **`omc_search_builtins(\"<topic>\")`** — substring search across name + description. \n");
1768+
out.push_str(" Best first call when you know *what* but not *which name*.\n\n");
1769+
out.push_str("2. **`omc_help(\"<name>\")`** — returns a dict with signature + description + example + category + unique_to_omc.\n");
1770+
out.push_str(" Use after `omc_search_builtins` narrows the field.\n\n");
1771+
out.push_str("3. **`omc_explain_error(\"<error message>\")`** — pattern-match against the 970+ curated catalog. Returns explanation + cause + one-line fix.\n");
1772+
out.push_str(" ALWAYS call this when an OMC program errors. Don't guess.\n\n");
1773+
out.push_str("4. **`omc_did_you_mean(\"<typo>\")`** — suggest the nearest known names by edit distance. Use when `omc_help` returns `found: 0`.\n\n");
1774+
out.push_str("5. **`omc_bootstrap_pack()`** — returns a ~20KB Markdown doc with categorized cheatsheets + Python → OMC translation table.\n");
1775+
out.push_str(" Load this once at session start instead of repeated grep.\n\n");
1776+
out.push_str("Other high-value calls: `omc_unique_builtins()` (the OMC-only surface), `omc_python_translation()` (Python↔OMC table),\n");
1777+
out.push_str("`omc_cheatsheet(\"<topic>\")` (markdown per category), `omc_canonical_hash(code)` / `omc_id(code)` (semantic memory keys for code regions).\n\n");
1778+
out.push_str("**Common gotcha**: don't re-define OMC builtins from scratch — `is_prime`, `arr_softmax`, `arr_resonance_vec`, etc. all ship. Always `omc_search_builtins` first.\n\n");
17521779
out.push_str("---\n\n");
17531780
out.push_str("## Categories\n\n");
17541781
for cat in categories() {

omnimcode-core/src/interpreter.rs

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7751,6 +7751,37 @@ impl Interpreter {
77517751
Err(e) => Err(format!("omc_id: {}", e)),
77527752
}
77537753
}
7754+
"omc_find_similar" => {
7755+
// omc_find_similar(query, corpus[]) → [{index, distance}, ...]
7756+
// ranked closest-first by canonical-hash distance.
7757+
if args.len() < 2 {
7758+
return Err("omc_find_similar requires (query, corpus[])".to_string());
7759+
}
7760+
let query = self.eval_expr(&args[0])?.to_display_string();
7761+
let corpus_v = self.eval_expr(&args[1])?;
7762+
let corpus: Vec<String> = if let Value::Array(arr) = corpus_v {
7763+
arr.items.borrow().iter()
7764+
.map(|x| x.to_display_string())
7765+
.collect()
7766+
} else {
7767+
return Err("omc_find_similar: corpus must be a string array".to_string());
7768+
};
7769+
let ranked = crate::code_intel::find_similar(&query, &corpus)
7770+
.map_err(|e| format!("omc_find_similar: {}", e))?;
7771+
// Optional 3rd arg = top_k (default = full list).
7772+
let top_k = if args.len() >= 3 {
7773+
self.eval_expr(&args[2])?.to_int().max(1) as usize
7774+
} else { ranked.len() };
7775+
let out: Vec<Value> = ranked.iter().take(top_k)
7776+
.map(|(idx, dist)| {
7777+
let mut map = std::collections::BTreeMap::new();
7778+
map.insert("index".to_string(), Value::HInt(HInt::new(*idx as i64)));
7779+
map.insert("distance".to_string(), Value::HInt(HInt::new(*dist)));
7780+
Value::dict_from(map)
7781+
})
7782+
.collect();
7783+
Ok(Value::Array(HArray::from_vec(out)))
7784+
}
77547785
"omc_code_diff" => {
77557786
// Structural diff: returns {added, removed, modified, unchanged}.
77567787
// Compared after canonicalization so renames don't show.

0 commit comments

Comments
 (0)