RandomCoder-lab
diff --git a/‎README.md‎
Lines changed: 8 additions & 8 deletions b/‎README.md‎
Lines changed: 8 additions & 8 deletions
diff --git a/‎SUBSTRATE_CHANGES.md‎
Lines changed: 85 additions & 0 deletions b/‎SUBSTRATE_CHANGES.md‎
Lines changed: 85 additions & 0 deletions
diff --git a/‎docs/anomaly_detection.md‎
Lines changed: 21 additions & 17 deletions b/‎docs/anomaly_detection.md‎
Lines changed: 21 additions & 17 deletions
diff --git a/‎examples/datascience/anomaly_detection.omc‎
Lines changed: 8 additions & 6 deletions b/‎examples/datascience/anomaly_detection.omc‎
Lines changed: 8 additions & 6 deletions
@@ -216,18 +216,18 @@ Real comparisons against scikit-learn's IsolationForest. Not synthetic glory —
 
 | Workload | OMC harmonic | IsolationForest | Where it matters |
 |---|:---:|:---:|---|
-| **Power-law data, K=5** (alert-budget regime) | **4/5** | 0/5 | Top-of-queue precision: SRE oncall paging |
 | **Multi-dim credential stuffing, K=10** | **10/10** | 7/10 | Account-takeover, exfiltration, structural attacks |
-| Multi-dim K=25 | **25/25** | 17/25 | Subspace anomaly detection |
-| Multi-dim K=50 | **50/50** | 40/50 | Same as above, broader recall |
-| **NSL-KDD real intrusion data, K=500** | **365/500** | 351/500 | Threat hunting — broad recall on real labeled attacks |
-| NSL-KDD K=10 / K=50 / K=100 | 7 / 42 / 78 | **9 / 45 / 92** | Volumetric DoS — IF wins on low-K when biggest spike = real |
+| Multi-dim K=25 | **24/25** | 17/25 | Subspace anomaly detection |
+| Multi-dim K=50 | **49/50** | 40/50 | Same as above, broader recall |
+| NSL-KDD real intrusion data, K=500 | 302/500 | **351/500** | Threat hunting on volumetric-dominated data |
+| NSL-KDD K=10 / K=50 / K=100 | 6 / 43 / 78 | **9 / 45 / 92** | Volumetric DoS — IF wins on low-K when biggest spike = real |
 | NAB realKnownCause (1-D time series) | 7/19 | 7/19 | Tie at naive baseline tier (SOTA needs CUSUM/HMM) |
-| Power-law K=30 (broad recall) | 5/30 | 15/30 | IF wins when you can investigate everything |
+| Power-law K=30 (broad recall) | 12/30 | **15/30** | IF still leads on total recall |
+| Power-law K=5 (alert budget) | 1/5 | 0/5 | Both struggle at extreme low-K on this synthetic data |
 
-The pattern: **harmonic decisively wins on multi-dim structural anomalies** (the credential-stuffing regime — values that look normal per-dim but rare in combination), and **crosses over to wins on broad-recall threat hunting** even on volumetric-dominated data like NSL-KDD once K is large enough to reward diversity. Ties on simple time-series benchmarks where neither approach exploits temporal structure. Loses at low K on data where the labeled anomalies are all magnitude outliers (IF's home turf).
+The pattern: **harmonic decisively wins on multi-dim structural anomalies** (the credential-stuffing regime — values that look normal per-dim but rare in combination). Ties on simple time-series benchmarks where neither approach exploits temporal structure. Loses on volumetric-dominated data where the labeled anomalies are all magnitude outliers (IF's home turf).
 
-NSL-KDD K=500 flipped from a tie (348 vs 351) to a harmonic win (365 vs 351) after the 2026-05-15 substrate refactor — the `log_phi_pi_fibonacci` substrate uses a 40-entry attractor table extending to 63M, vs the old 16-entry table that saturated at 610 and collapsed every large-magnitude attack into the same score. See [`SUBSTRATE_CHANGES.md`](SUBSTRATE_CHANGES.md).
+Two substrate-architecture changes on 2026-05-15 affected these numbers. **Phase 1** (refactor `compute_resonance` to `log_phi_pi_fibonacci`) flipped NSL-KDD K=500 from a tie to a harmonic win (348→365 vs IF's 351). **Phase 2** (substrate-fill: route the harmonic_anomaly bucket function through the substrate too) traded that K=500 win and the K=5 alert-budget win for architectural completeness — substrate-tempo bucketing produces empirically different bucket distributions on heavy-tailed data than base-10 decades, and on NSL-KDD that's a net loss. The choice was deliberate: substrate purity over benchmark numbers. See [`SUBSTRATE_CHANGES.md`](SUBSTRATE_CHANGES.md).
 
 The harmonic_anomaly library at [`examples/lib/harmonic_anomaly.omc`](examples/lib/harmonic_anomaly.omc) packages the multi-dim detector with a clean `new` / `fit` / `top_k` API. Install it:
 
 
@@ -252,3 +252,88 @@ The "IF wins on volumetric" framing in `docs/anomaly_detection.md` needs softeni
 2. **README's "Where harmonic detection actually wins" table** — replace NSL-KDD K=100/500 entries; add "+17 at K=500 from substrate refactor (2026-05-15)" note.
 3. **No changes needed** for credential stuffing, attack zoo, power-law, NAB sections — those numbers held.
 4. **PAIN_POINTS.md** — no substrate-dependent claims; unchanged.
+
+---
+
+# Phase 2 — Substrate Fill-in (same day, 2026-05-15)
+
+After the validation sweep above, the Architect declared `log_phi_pi_fibonacci` THE base algorithm of all of OMC and asked for a comprehensive audit + migration of every site that uses or should use the substrate. Five Bucket-B findings (sites that bypassed the substrate via Python `math.log10`/`math.log` round-trips or hardcoded Fibonacci arrays) plus one deprecated alias removal.
+
+## Migrations applied
+
+| ID | File / location | Old | New | Type |
+|---|---|---|---|---|
+| B1 | `examples/lib/harmonic_anomaly.omc` `_bucket_log` | `py_call(math, "log10", v) * 50` then `fold` | `log_phi_pi_fibonacci(v) * 50` then `fold` | substrate-tempo |
+| B2 | `examples/lib/harmonic_anomaly.omc` `score` | `-py_call(math, "log", p)` | `log_phi_pi_fibonacci(1.0/p)` (monotonic) | substrate-routed |
+| B3 | `examples/lib/harmonic_clustering.omc` `_bucket_log` | `py_call(math, "log10", v)` | `log_phi_pi_fibonacci(v) / log_phi_pi_fibonacci(10.0)` (decade-rescale: substrate-routed computation, log10-equivalent output) | substrate-routed |
+| B4 | `omnimcode-core/src/interpreter.rs` `harmonic_split` | hardcoded `[1,2,3,5,8,...,610]` 14-entry array | `phi_pi_fib::largest_attractor_at_most(remaining)` — new helper, 40-entry table reaches 63M | substrate-canonical |
+| B5 | `examples/datascience/multidim_anomaly.omc` and `anomaly_detection.omc` | inline copies of B1/B2 patterns | mirrored to substrate-tempo | substrate-tempo |
+| D2 | `omnimcode-core/src/phi_pi_fib.rs` | deprecated `log_phi(n)` alias | DELETED — new code uses `log_phi_pi_fibonacci` | DEPRECATION removed |
+
+New helper added: `phi_pi_fib::largest_attractor_at_most(value: i64) -> i64` — sign-preserving, returns the greatest attractor ≤ |value|. Replaces ad-hoc reverse linear scans over hardcoded Fibonacci arrays. Two new unit tests pin its behavior (basics + large-magnitude range that the old 16-entry table couldn't reach).
+
+## Architectural decision: substrate purity over benchmark numbers
+
+The Architect was presented with three resolution options for B1 (the bucket function in harmonic_anomaly) after observing that **substrate-tempo bucketing measurably hurts empirical results on real heavy-tailed data**:
+
+| Option | Substrate-routed | Empirical impact |
+|---|---|---|
+| Revert B1 to log10 (via OMC's native log builtin) | NO | Restores all numbers |
+| Decade-rescale (window-dressing route) | yes (mathematically equivalent to log10) | Restores all numbers |
+| **Keep current substrate-tempo (CHOSEN)** | **YES, fully** | **K=500 NSL-KDD: 365 → 302 (−63)** |
+
+The Architect chose substrate purity. The substrate now governs magnitude-slicing semantics throughout OMC, even where its grain (~1.5 buckets per base-10 decade) produces empirically worse anomaly recall than base-10 decades would.
+
+## Validation: empirical impact of the fill-in
+
+Engine parity and infrastructure tests all held:
+
+- 44/45 functional examples byte-identical TW vs VM (the diverger is `benchmarks.omc` — timing-only, same as before)
+- 149/149 Rust unit tests pass (was 148; one removed via D2, two added for `largest_attractor_at_most` and `log_phi_pi_fibonacci` monotonicity)
+- 18/18 OMC harmonic-lib tests pass (after decade-rescale fix to `harmonic_clustering`)
+- NAB realKnownCause: 7/19 covered, NEUTRAL
+- Attack zoo: 30/30, NEUTRAL
+
+Anomaly benchmarks (the substrate-sensitive sites):
+
+| Benchmark | Phase-1 substrate refactor | Phase-2 substrate fill-in | Verdict |
+|---|---|---|---|
+| Credential stuffing K=10 | 10/10 | 10/10 | NEUTRAL |
+| Credential stuffing K=25 | 25/25 | 24/25 | UNIMPROVEMENT (−1) |
+| Credential stuffing K=50 | 50/50 | 49/50 | UNIMPROVEMENT (−1) |
+| Credential stuffing K=100 | 50/100 | 50/100 | NEUTRAL |
+| Power-law K=5 (alert budget) | **4/5** | 1/5 | **UNIMPROVEMENT (−3)** |
+| Power-law K=10 | 5/10 | 3/10 | UNIMPROVEMENT (−2) |
+| Power-law K=20 | 5/20 | 7/20 | IMPROVEMENT (+2) |
+| Power-law K=30 | 5/30 | 12/30 | IMPROVEMENT (+7) |
+| NSL-KDD K=10 | 7/10 | 6/10 | UNIMPROVEMENT (−1) |
+| NSL-KDD K=50 | 42/50 | 43/50 | IMPROVEMENT (+1) |
+| NSL-KDD K=100 | 78/100 | 78/100 | NEUTRAL |
+| **NSL-KDD K=500** | **365/500** | 302/500 | **UNIMPROVEMENT (−63)** |
+
+The pattern: substrate-tempo bucketing **trades low-K precision for high-K-on-spread-data**. Where the old log10-bucketing concentrated big spikes into a single attractor (e.g. all DoS-attack byte counts landing in bucket-377), substrate-tempo spreads them across multiple attractors (377/610/987/...), which weakens "biggest spike wins" alerting but improves diversity at high K. Real-world heavy-tailed data (NSL-KDD's volumetric DoS) is the worst case for this trade — those attacks were structurally the same and benefited from concentration.
+
+## What's groundbreaking, what's an unimprovement
+
+**GROUNDBREAKING** — Phase 2:
+- The substrate is now THE base algorithm everywhere. Five sites that bypassed it via Python round-trips or hardcoded arrays are now routed through `phi_pi_fib::*`. Architectural completeness over benchmark numbers.
+- New helper `largest_attractor_at_most` retires the last hardcoded Fibonacci array inside core (`harmonic_split` was the holdout).
+
+**UNIMPROVEMENT** — Phase 2:
+- NSL-KDD K=500: 365 → 302. We lose the "harmonic beats IF on volumetric data at K=500" claim from Phase 1. This was the most-cited Phase-1 win and it's been deliberately traded for substrate consistency.
+- Power-law K=5 (alert budget): 4/5 → 1/5. The headline "harmonic surfaces structural anomalies before magnitude outliers" claim weakens — at top-5 we now mostly miss.
+- Credential stuffing K=25/K=50: 25→24, 50→49. Small slippage on the synthetic benchmark that was a Phase-1 anchor.
+
+**DEPRECATION** — Phase 2:
+- `phi_pi_fib::log_phi` deleted. New code uses `log_phi_pi_fibonacci`. The substrate naming convention is now consistent.
+
+## Doc updates needed
+
+1. **README's "Where harmonic detection actually wins" table** — Phase-2 numbers replace Phase-1 numbers. The K=500 win flips back to a tie (302 vs 351 → IF leads). The K=5 power-law win weakens.
+2. **`docs/anomaly_detection.md`** — Result 5 NSL-KDD K=500 narrative needs to drop the "harmonic now beats IF" framing; the K=500 crossover from Phase 1 is gone.
+3. **`SUBSTRATE_CHANGES.md`** (this doc) — captures the Phase-2 trade in full so future readers know the choice was deliberate.
+
+## What's NOT in scope of this fill-in (deferred)
+
+- **D3: HBit harmony substrate-routing.** `hbit.rs:43` uses Euclidean `1.0/(1.0+diff)`; the dual-band α/β/harmony channel doesn't yet speak substrate units. The Architect flagged this has "bigger implications" and deferred to its own session. Next on the queue.
+- **LLM evolution experiments (Experiments 0-9).** Developed ON the new substrate; no migration needed but worth a separate audit pass to identify which findings would've failed under the old substrate (substrate-aware vs substrate-dependent classification).
@@ -4,21 +4,23 @@
 
 ## TL;DR
 
+Numbers reflect the substrate-fill (Phase 2, 2026-05-15) where the library's `_bucket_log` now routes through `log_phi_pi_fibonacci` end-to-end. The Phase 1 K=500 win on NSL-KDD (365 vs 351) was traded for that architectural consistency. See `SUBSTRATE_CHANGES.md` for the full diff.
+
 | Dataset | Top-K | Harmonic | IsolationForest | Winner |
 |---|---|:---:|:---:|---|
 | Credential stuffing (synthesized, multi-dim) | K=10 | **10/10** | 7/10 | **Harmonic** |
-| Credential stuffing | K=25 | **25/25** | 17/25 | Harmonic |
-| Credential stuffing | K=50 | **50/50** | 40/50 | Harmonic |
+| Credential stuffing | K=25 | **24/25** | 17/25 | Harmonic |
+| Credential stuffing | K=50 | **49/50** | 40/50 | Harmonic |
 | Attack zoo: exfiltration + scraping + DDoS | K=10×3 | **30/30** | unmeasured | Harmonic (all 100%) |
-| Power-law latency outliers (synthesized, 1-D) | K=5 | **4/5** | 0/5 | **Harmonic** |
-| Power-law latency outliers | K=30 | 5/30 | **15/30** | IF |
+| Power-law latency outliers (synthesized, 1-D) | K=5 | 1/5 | 0/5 | both struggle |
+| Power-law latency outliers | K=30 | 12/30 | **15/30** | IF |
 | NAB realKnownCause (1-D time series) | K=10 windows | 7/19 | 7/19 | **Tie** |
-| **NSL-KDD network intrusion (real)** | K=10 | 7/10 | **9/10** | **IF** |
-| NSL-KDD | K=50 | 42/50 | **45/50** | IF |
+| **NSL-KDD network intrusion (real)** | K=10 | 6/10 | **9/10** | **IF** |
+| NSL-KDD | K=50 | 43/50 | **45/50** | IF |
 | NSL-KDD | K=100 | 78/100 | **92/100** | IF |
-| NSL-KDD | K=500 | **365/500** | 351/500 | **Harmonic** (post-substrate-refactor) |
+| NSL-KDD | K=500 | 302/500 | **351/500** | IF |
 
-**The pattern:** harmonic wins on *structural* anomalies (rare combinations of normal-looking values), loses on *magnitude* anomalies (values that are simply unusual in scale). NAB and NSL-KDD are mostly magnitude anomalies; credential stuffing is structural.
+**The pattern:** harmonic still wins decisively on *structural* anomalies (rare combinations of normal-looking values — credential stuffing, attack zoo). On *magnitude* anomalies (NAB, NSL-KDD, power-law top-K), IF leads. The Phase-2 substrate-fill widened IF's lead on volumetric data — see Result 5 for the trade.
 
 ---
 
@@ -163,27 +165,29 @@ The NAB result documents what doesn't work — and where the next architectural
 
 ---
 
-## Result 5: NSL-KDD network intrusion (mixed — substrate-refactor flipped K=500)
+## Result 5: NSL-KDD network intrusion (IF leads — substrate-fill traded the K=500 crossover)
 
 **Setup:** Real labeled network intrusion dataset from University of New Brunswick. 22,544 captured connections; we use a 5000-row sample with 2147 normal + 2853 attacks across many classes (neptune DoS, mscan, satan, smurf, warezmaster, etc.). Each row has 41 features; we use 6 numeric ones (duration, src/dst bytes, count, srv_count, dst_host_count).
 
-**Result (post-substrate-refactor, 2026-05-15):**
+**Result (post-substrate-fill, 2026-05-15 Phase 2):**
 ```
                      K=10    K=50    K=100   K=500
   IsolationForest    9/10    45/50   92/100   351/500
-  OMC harmonic       7/10    42/50   78/100   365/500
+  OMC harmonic       6/10    43/50   78/100   302/500
 ```
 
-IsolationForest wins at low K (9/10 vs 7/10) and through K=100; harmonic crosses over and wins at K=500 (365 vs 351). The K=500 result is +17 over the pre-refactor measurement (348/500) — the new `log_phi_pi_fibonacci` substrate uses a 40-entry attractor table extending to 63M, vs the old 16-entry table that saturated at 610. NSL-KDD's `src_bytes` and `dst_bytes` features routinely exceed millions; the old substrate compressed every large attack-magnitude to the same near-zero resonance score and the detector couldn't distinguish them. The new substrate sees finer per-row gradients on volumetric attacks.
+IsolationForest leads at every K. The headline `harmonic_anomaly` win at K=500 from Phase 1 (365 vs 351) was traded away in Phase 2 (substrate-fill) for architectural completeness — see `SUBSTRATE_CHANGES.md`.
 
-Looking at IF's top-10 picks: 9 of 10 are labeled `smurf` (a volumetric ICMP flood attack — huge byte counts).
-Looking at harmonic's top-10 picks: a mix of `mscan` (port scanning), `warezmaster` (privilege escalation), `back` (buffer overflow), `smurf`.
+**Why the trade:** Phase 1 refactored `compute_resonance` to route through `log_phi_pi_fibonacci`'s 40-entry attractor table (reaches 63M). That refactor alone, with the library's bucket function still using log10, drove K=500 up to 365/500 — a genuine win on volumetric data because resonance scoring suddenly had room to discriminate large byte-counts.
 
-**Why IF still leads at low K:** NSL-KDD's labeled attacks are dominated by *volumetric* events — DoS floods with massive byte counts. IF picks magnitude outliers first; the labeled attacks at the top of any reasonable score distribution ARE the most extreme magnitudes. IF's job is finding "the biggest spike"; the dataset rewards that.
+Phase 2 extended the substrate to the bucket function itself (`_bucket_log` now calls `log_phi_pi_fibonacci(v)` instead of `py_call(math, "log10", v)`). Substrate-tempo bucketing has ~1.5 buckets per base-10 decade, which spreads NSL-KDD's heavy-tailed `src_bytes`/`dst_bytes` across multiple attractors (377, 610, 987, …) instead of clumping them all at 377 like log10 did. The clumping was *helping* the score function discriminate big spikes; spreading them out across attractors *hurts* recall on volumetric attacks. Net: −63 at K=500.
 
-**Why harmonic catches up at K=500:** look at the *diversity* of what each detector flags. IF stacks on smurf because every smurf row looks the same in magnitude space. Harmonic finds mscan + warezmaster + back + smurf — multiple distinct attack patterns. By the time you've spent 500 alerts, harmonic has surfaced more unique attack types and more total true positives.
+**The honest read:** harmonic with log10 bucketing genuinely beat IF at K=500 on NSL-KDD; harmonic with substrate-tempo bucketing does not. The Architect chose substrate purity over the K=500 win. The result table here is what the shipped library produces under the substrate-fill regime.
+
+Looking at IF's top-10 picks: 9 of 10 are labeled `smurf` (a volumetric ICMP flood attack — huge byte counts).
+Looking at harmonic's top-10 picks: a mix of `mscan` (port scanning), `warezmaster` (privilege escalation), `back` (buffer overflow), `smurf`.
 
-For an SRE on a tight alert budget hunting *known* threats, IF is still the right tool (9/10 vs 7/10 at K=10). For *threat hunting* — investigating broadly to find anything anomalous — harmonic's broader coverage (365 vs 351 at K=500) becomes the winning trade.
+**Why harmonic still surfaces diverse attack types:** the score function still rewards "rare combination across dims" — the structural-anomaly signal that picks credential stuffing perfectly. NSL-KDD's labeled attacks are dominated by *volumetric* events, which is structurally IF's regime; harmonic still surfaces mscan/warezmaster/back diversity, just at lower precision than the log10-bucketing version did.
 
 **Reproduction:**
 ```bash
 
@@ -142,17 +142,19 @@ h iforest_scores = py_call(iforest, "decision_function", [X]);
 # compress to log-space FIRST (turning power-law into ~uniform) and
 # THEN apply harmonic_partition.
 #
-# log10(latency) is roughly uniform in [1, 4]. Multiply by 100 to
-# get a useful integer range, harmonic_partition that into Fibonacci
-# attractor buckets (8/13/21/34/55/89/144/233/377/...). Tiny buckets
-# = anomalous regions of the data's empirical distribution.
+# log_phi_pi_fibonacci(latency) compresses heavy-tailed values into
+# the substrate's own log-scale. Multiply by 100 to get a useful
+# integer range, fold that into Fibonacci attractor buckets
+# (8/13/21/34/55/89/144/233/377/...). Tiny buckets = anomalous
+# regions of the data's empirical distribution.
 
 h math = py_import("math");
 
 fn log_bucket_key(v) {
     if v <= 0 { return 0; }
-    h log10v = py_call(math, "log10", [v]);
-    h scaled = to_int(log10v * 100);
+    # Substrate-routed log: φ-π-fibonacci units, not base-10 decades.
+    h logv = log_phi_pi_fibonacci(to_float(v));
+    h scaled = to_int(logv * 100);
     return fold(scaled);
 }