Skip to content

Commit 0d4ceb9

Browse files
omc-grep: alpha-rename-invariant code archaeology CLI
A standalone binary that walks a tree, extracts every top-level fn, canonicalizes each one, and clusters by canonical hash. The new primitive is `--body-only` mode: hash only the fn body (drop name and signature) to find fns with IDENTICAL code under DIFFERENT names — something text-grep, ast-grep, and tree-sitter queries can't do. Findings on OMC's own examples tree (151 files, 2388 fns): - 31.7% redundancy (name-sensitive canonical hash) - 33.0% redundancy (--body-only) - largest cluster: assert_eq @ 64 copies (test helper) - alpha-renamed clusters surfaced by --body-only that the name-sensitive pass missed include: is_digit / is_digit_b / is_digit_t (19 fns, 3 names) is_alpha / is_alpha_b (16 fns) tkind / tok_kind (15 fns — refactor leftover) arr_concat / arr_concat_b (14 fns) _bucket_discrete / endpoint_bucket / status_bucket (5 fns, 3 unrelated names) The bucket-family cluster is the proof case: three domain-specific names sharing NO token, but the canonical body matches exactly. That's only findable via substrate-canonical addressing. Implementation: - omnimcode-cli/src/bin/omc_grep.rs (new bin target) - extract_top_level_fns made pub in omnimcode-core - Skips target/, node_modules/, .git/, __pycache__/, omc_modules/ - Flags: --body-only, --near N, --min-cluster K - docs/omc_grep.md with the findings table Builds without JIT or Python deps (clean omnimcode-core only). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 84a4a16 commit 0d4ceb9

5 files changed

Lines changed: 502 additions & 3 deletions

File tree

README.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,8 @@ These are concrete, present-in-the-code features, not aspirations:
3434

3535
- **Substrate-routed harmonic libraries.** `harmonic_anomaly` beats scikit-learn's IsolationForest **10/10 vs 7/10** on multi-dim credential-stuffing detection (the structural-anomaly regime).
3636

37+
- **`omc-grep`: alpha-rename-invariant duplicate finder.** A standalone CLI ([`docs/omc_grep.md`](docs/omc_grep.md)) that walks a tree, extracts every top-level fn, canonicalizes, and clusters by canonical hash. `--body-only` mode strips the fn signature so duplicates with *different names* surface — something text-grep, ast-grep, and tree-sitter queries can't do. On OMC's own examples tree (151 files / 2388 fns): **31.7%** redundancy with name-sensitive hashing, **33.0%** with body-only — surfacing renamed-but-identical fns like `_bucket_discrete``endpoint_bucket``status_bucket` that share no token in their names.
38+
3739
- **Substrate-keyed code codec + compressed substrate-signed messaging.** `omc_codec_encode` produces a sampled-token payload addressed by the canonical AST hash (invariant under whitespace, comments, alpha-rename). `omc_codec_decode_lookup` returns the exact library entry on hash match. `omc_msg_sign_compressed` / `omc_msg_recover_compressed` carry the codec payload inside the substrate-signed wire format with lossless library recovery and full signature integrity. **Wire-byte sizing is honest**: token-count compression is ~N×, but wire-byte savings only appear at payloads ≳500 B with N≥8 (single-message). The always-on value is **library-lookup recovery** — alpha-rename invariant content addressing on the receiver, no shared key. 13 tests pass ([`test_codec.omc`](examples/tests/test_codec.omc), [`test_compressed_messaging.omc`](examples/tests/test_compressed_messaging.omc)). See [`experiments/seed_expansion/FINDINGS.md`](experiments/seed_expansion/FINDINGS.md).
3840

3941
---
@@ -262,7 +264,7 @@ fn coherent_loop(n) {
262264
|---|---|
263265
| `omnimcode-core/` | Parser, AST, interpreter, bytecode VM, substrate (`phi_pi_fib`), HBit, harmonic types, 50+ substrate builtins, substrate-routed heal pass |
264266
| `omnimcode-codegen/` | LLVM-backed JIT, dual-band lowerer, L1.6 array bridges, 22 harmonic-primitive intrinsics (table-driven) |
265-
| `omnimcode-cli/` | Standalone binary (`omnimcode-standalone`) + `omc-bench` |
267+
| `omnimcode-cli/` | Standalone binary (`omnimcode-standalone`) + `omc-bench` + `omc-grep` |
266268
| `omnimcode-wasm/` | WebAssembly target (no LLVM, no Python) |
267269
| `omnimcode-lsp/` | LSP server for editor integration |
268270
| `omnimcode-gdextension/` | Godot 4 GDExtension binding |
@@ -351,6 +353,7 @@ Submit a package: PR an entry to [`registry/index.json`](registry/index.json).
351353
| **Self-healing pass (7 classes, substrate-routed typo)** | shipped, `OMC_HEAL=1`, **10× typo lookup**, 16 tests, per-class pragmas |
352354
| **Substrate-keyed code codec + compressed messaging** | **shipped**, `omc_codec_encode/decode_lookup` + `omc_msg_sign_compressed/recover`, alpha-rename invariant, token-count ~N× (wire-byte breaks even at ≥500 B + N≥8); always-on win is library-lookup recovery; 13 tests, lossless on in-library content |
353355
| **Inline error-fix hints** | **shipped**, `Undefined function` errors now carry the suggested fn's signature inline (eliminates a separate `omc_help` round-trip after a typo) |
356+
| **`omc-grep`: alpha-rename-invariant code archaeology** | **shipped** ([docs/omc_grep.md](docs/omc_grep.md)) — standalone CLI; on OMC's examples: 31.7% redundancy (name-sensitive), 33.0% (body-only); surfaces renamed-but-identical fns that text-grep and ast-grep can't catch |
354357
| Two-engine parity (tree-walk + VM) | shipped, 44/45 byte-identical |
355358
| Embedded CPython + callbacks | shipped, 6 wrapper libs |
356359
| WASM + LSP + GDExtension targets | shipped |

docs/omc_grep.md

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
# omc-grep — canonical-hash code archaeology
2+
3+
> The new primitive: find duplicate fns under whitespace, comment,
4+
> parameter-rename, **and (with `--body-only`) under entirely
5+
> different fn names**. Nothing else does the last one.
6+
7+
## What it does
8+
9+
Walks a directory of `.omc` files, extracts every top-level fn,
10+
canonicalizes each one (whitespace stripped, comments removed,
11+
parameter binding normalized), and hashes the canonical form.
12+
13+
Reports:
14+
15+
- **EXACT clusters** — groups of 2+ fns with identical canonical
16+
hash. These are true duplicates regardless of whitespace, comment
17+
edits, or parameter renaming.
18+
- **NEAR clusters** (with `--near N`) — fn pairs sharing the same
19+
Fibonacci attractor whose canonical hashes differ by at most `N`.
20+
Use this to surface near-duplicates that diverged slightly.
21+
- **Body-only mode** (with `--body-only`) — drops the `fn NAME(...)`
22+
signature from the hash. This finds fns with identical bodies
23+
under DIFFERENT NAMES — the form of duplication that name-based
24+
tools and text grep can never catch.
25+
26+
## What it found on OMC's own examples tree
27+
28+
```
29+
omc-grep examples/
30+
→ 151 files, 2388 fns, 1631 unique → 757 dupes (31.7% redundant)
31+
32+
omc-grep --body-only examples/
33+
→ 151 files, 2388 fns, 1600 unique → 788 dupes (33.0% redundant)
34+
```
35+
36+
The body-only mode caught 31 additional alpha-equivalent clusters
37+
that the name-sensitive pass missed, including:
38+
39+
| Cluster | Members | Distinct names |
40+
|---|--:|---|
41+
| `is_digit` family | 19 | `is_digit`, `is_digit_b`, `is_digit_t` |
42+
| `is_alpha` family | 16 | `is_alpha`, `is_alpha_b` |
43+
| `is_space` family | 16 | `is_space`, `is_space_b` |
44+
| `tok_kind` / `tkind` | 15 | classic rename-during-refactor leftover |
45+
| `tok_value` / `tval` | 15 | same |
46+
| `arr_concat` / `arr_concat_b` | 14 | same |
47+
| 3-bucket family | 5 | `_bucket_discrete`, `endpoint_bucket`, `status_bucket` |
48+
| counter family | 5 | `count_anom_hits`, `count_caught`, `count_hits` |
49+
50+
The 3-bucket family is the case that proves the value: three
51+
domain-specific names (`_bucket_discrete`, `endpoint_bucket`,
52+
`status_bucket`) wrapping the *same code*. No text-grep, ast-grep,
53+
or tree-sitter query can find this because there's no shared token
54+
between the names — only the canonical body matches.
55+
56+
## How the substrate makes this fast
57+
58+
The fnv1a → nearest-Fibonacci-attractor lookup gives every fn an
59+
O(1) substrate address (`attractor_bucket`). Pre-bucketing all fns
60+
by their attractor means near-duplicate detection probes only
61+
within the same bucket, not the full corpus. Combined with the
62+
`log_phi_pi_fibonacci(N)` substrate-search primitive available
63+
inside OMC programs, the same architecture scales to multi-million-
64+
fn corpora.
65+
66+
## Usage
67+
68+
```bash
69+
omc-grep [OPTIONS] DIR
70+
71+
Options:
72+
--body-only hash the fn body only (drop name + signature);
73+
finds alpha-equivalent fns under DIFFERENT NAMES
74+
--near N also report fn pairs within substrate distance N
75+
(sharing same Fibonacci attractor) [default: 0]
76+
--min-cluster K only report exact clusters with K+ members [default: 2]
77+
-h, --help this help
78+
```
79+
80+
Skips: `target/`, `node_modules/`, `.git/`, `__pycache__/`,
81+
`omc_modules/`.
82+
83+
## Building
84+
85+
```bash
86+
PYO3_USE_ABI3_FORWARD_COMPATIBILITY=1 cargo build --release --bin omc-grep
87+
./target/release/omc-grep DIR
88+
```
89+
90+
No JIT or Python dependencies — pure tree-walk over the canonical
91+
form. ~30s build, <1s scan over 150 files.
92+
93+
## What it doesn't do (yet)
94+
95+
- **Non-OMC languages.** Phase 2 will add Python via the stdlib `ast`
96+
module (no tree-sitter dependency). After that: JS/TS via the
97+
tree-sitter bindings.
98+
- **Refactor-suggest mode.** Currently reports clusters; doesn't
99+
propose which one is the canonical-form, doesn't generate
100+
rename/import-rewrite diffs. Easy to add but requires a
101+
per-cluster "winner" heuristic (oldest file? most-used name?
102+
shortest? linted highest?).
103+
- **Cross-repo dedupe.** Walks one tree. Multi-tree mode (`omc-grep
104+
A B C/`) would need a per-root prefix for the file column.
105+
106+
These are all worth doing but each is a separable extension on
107+
top of the working core.

omnimcode-cli/Cargo.toml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,13 @@ name = "omc-bench"
2121
path = "src/bench.rs"
2222
required-features = ["llvm-jit"]
2323

24+
# Code-archaeology tool: walks a tree, extracts top-level fns, clusters
25+
# by canonical hash + substrate distance. The alpha-rename-invariant
26+
# duplicate finder. Doesn't depend on JIT or Python.
27+
[[bin]]
28+
name = "omc-grep"
29+
path = "src/bin/omc_grep.rs"
30+
2431
[features]
2532
default = ["python-embed"]
2633
# CPython embedding for `py_*` builtins. Forwards to core.

0 commit comments

Comments
 (0)