You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Substrate fill-in: log_phi_pi_fibonacci as THE base algorithm everywhere
After the substrate refactor (Phase 1) routed compute_resonance through
log_phi_pi_fibonacci, an audit found five sites still bypassing the
substrate via Python math.log10/math.log round-trips or hardcoded
Fibonacci arrays. Phase 2 closes those gaps.
Migrations:
- B1+B2: harmonic_anomaly.omc bucket + score now substrate-routed
(log_phi_pi_fibonacci instead of py_call(math, log10/log))
- B3: harmonic_clustering.omc decade-rescale through the substrate
(log_phi_pi_fibonacci(v) / log_phi_pi_fibonacci(10) — substrate
computation, log10-equivalent decade output)
- B4: interpreter.rs harmonic_split drops its hardcoded 14-entry
fibs[] array and routes through new phi_pi_fib helper
largest_attractor_at_most(n) — sign-preserving, 40-entry table
reaches 63M instead of saturating at 610
- B5: datascience demos mirror the library updates
- D2: deprecated log_phi alias deleted; new code uses
log_phi_pi_fibonacci
Validation: 44/45 byte-identical TW vs VM (benchmarks.omc timing-only
diff, same as before), 149/149 Rust unit tests pass (net +1 from D2
removal + 2 new substrate tests), 18/18 OMC harmonic-lib tests pass.
Empirical impact (Architect chose substrate purity over benchmark
numbers): NSL-KDD K=500 went 365 -> 302, losing the Phase-1
"harmonic beats IF on volumetric data" claim. Power-law K=5 went
4/5 -> 1/5. Cred-stuffing K=25/K=50 each -1. Phase 1 traded those
wins for architectural completeness — substrate-tempo bucketing
(~1.5 buckets per base-10 decade) spreads heavy-tailed bytes across
multiple attractors instead of clumping at 377. README and
docs/anomaly_detection.md updated with the new numbers and an honest
explanation of the trade.
SUBSTRATE_CHANGES.md gets a "Phase 2 — Substrate Fill-in" section
capturing the audit, migrations, and the deliberate empirical
trade-off.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
| Multi-dim K=50 |**49/50**| 40/50 | Same as above, broader recall |
222
+
| NSL-KDD real intrusion data, K=500 |302/500 |**351/500**| Threat hunting on volumetric-dominated data|
223
+
| NSL-KDD K=10 / K=50 / K=100 |6 / 43 / 78 |**9 / 45 / 92**| Volumetric DoS — IF wins on low-K when biggest spike = real |
225
224
| NAB realKnownCause (1-D time series) | 7/19 | 7/19 | Tie at naive baseline tier (SOTA needs CUSUM/HMM) |
226
-
| Power-law K=30 (broad recall) | 5/30 | 15/30 | IF wins when you can investigate everything |
225
+
| Power-law K=30 (broad recall) | 12/30 |**15/30**| IF still leads on total recall |
226
+
| Power-law K=5 (alert budget) | 1/5 | 0/5 | Both struggle at extreme low-K on this synthetic data |
227
227
228
-
The pattern: **harmonic decisively wins on multi-dim structural anomalies** (the credential-stuffing regime — values that look normal per-dim but rare in combination), and **crosses over to wins on broad-recall threat hunting** even on volumetric-dominated data like NSL-KDD once K is large enough to reward diversity. Ties on simple time-series benchmarks where neither approach exploits temporal structure. Loses at low K on data where the labeled anomalies are all magnitude outliers (IF's home turf).
228
+
The pattern: **harmonic decisively wins on multi-dim structural anomalies** (the credential-stuffing regime — values that look normal per-dim but rare in combination). Ties on simple time-series benchmarks where neither approach exploits temporal structure. Loses on volumetric-dominated data where the labeled anomalies are all magnitude outliers (IF's home turf).
229
229
230
-
NSL-KDD K=500 flipped from a tie (348 vs 351) to a harmonic win (365 vs 351) after the 2026-05-15 substrate refactor — the `log_phi_pi_fibonacci` substrate uses a 40-entry attractor table extending to 63M, vs the old 16-entry table that saturated at 610 and collapsed every large-magnitude attack into the same score. See [`SUBSTRATE_CHANGES.md`](SUBSTRATE_CHANGES.md).
230
+
Two substrate-architecture changes on 2026-05-15 affected these numbers. **Phase 1** (refactor `compute_resonance` to `log_phi_pi_fibonacci`) flipped NSL-KDD K=500 from a tie to a harmonic win (348→365 vs IF's 351). **Phase 2** (substrate-fill: route the harmonic_anomaly bucket function through the substrate too) traded that K=500 win and the K=5 alert-budget win for architectural completeness — substrate-tempo bucketing produces empirically different bucket distributions on heavy-tailed data than base-10 decades, and on NSL-KDD that's a net loss. The choice was deliberate: substrate purity over benchmark numbers. See [`SUBSTRATE_CHANGES.md`](SUBSTRATE_CHANGES.md).
231
231
232
232
The harmonic_anomaly library at [`examples/lib/harmonic_anomaly.omc`](examples/lib/harmonic_anomaly.omc) packages the multi-dim detector with a clean `new` / `fit` / `top_k` API. Install it:
After the validation sweep above, the Architect declared `log_phi_pi_fibonacci` THE base algorithm of all of OMC and asked for a comprehensive audit + migration of every site that uses or should use the substrate. Five Bucket-B findings (sites that bypassed the substrate via Python `math.log10`/`math.log` round-trips or hardcoded Fibonacci arrays) plus one deprecated alias removal.
261
+
262
+
## Migrations applied
263
+
264
+
| ID | File / location | Old | New | Type |
265
+
|---|---|---|---|---|
266
+
| B1 |`examples/lib/harmonic_anomaly.omc``_bucket_log`|`py_call(math, "log10", v) * 50` then `fold`|`log_phi_pi_fibonacci(v) * 50` then `fold`| substrate-tempo |
| B5 |`examples/datascience/multidim_anomaly.omc` and `anomaly_detection.omc`| inline copies of B1/B2 patterns | mirrored to substrate-tempo | substrate-tempo |
271
+
| D2 |`omnimcode-core/src/phi_pi_fib.rs`| deprecated `log_phi(n)` alias | DELETED — new code uses `log_phi_pi_fibonacci`| DEPRECATION removed |
272
+
273
+
New helper added: `phi_pi_fib::largest_attractor_at_most(value: i64) -> i64` — sign-preserving, returns the greatest attractor ≤ |value|. Replaces ad-hoc reverse linear scans over hardcoded Fibonacci arrays. Two new unit tests pin its behavior (basics + large-magnitude range that the old 16-entry table couldn't reach).
274
+
275
+
## Architectural decision: substrate purity over benchmark numbers
276
+
277
+
The Architect was presented with three resolution options for B1 (the bucket function in harmonic_anomaly) after observing that **substrate-tempo bucketing measurably hurts empirical results on real heavy-tailed data**:
278
+
279
+
| Option | Substrate-routed | Empirical impact |
280
+
|---|---|---|
281
+
| Revert B1 to log10 (via OMC's native log builtin) | NO | Restores all numbers |
282
+
| Decade-rescale (window-dressing route) | yes (mathematically equivalent to log10) | Restores all numbers |
283
+
|**Keep current substrate-tempo (CHOSEN)**|**YES, fully**|**K=500 NSL-KDD: 365 → 302 (−63)**|
284
+
285
+
The Architect chose substrate purity. The substrate now governs magnitude-slicing semantics throughout OMC, even where its grain (~1.5 buckets per base-10 decade) produces empirically worse anomaly recall than base-10 decades would.
286
+
287
+
## Validation: empirical impact of the fill-in
288
+
289
+
Engine parity and infrastructure tests all held:
290
+
291
+
- 44/45 functional examples byte-identical TW vs VM (the diverger is `benchmarks.omc` — timing-only, same as before)
292
+
- 149/149 Rust unit tests pass (was 148; one removed via D2, two added for `largest_attractor_at_most` and `log_phi_pi_fibonacci` monotonicity)
The pattern: substrate-tempo bucketing **trades low-K precision for high-K-on-spread-data**. Where the old log10-bucketing concentrated big spikes into a single attractor (e.g. all DoS-attack byte counts landing in bucket-377), substrate-tempo spreads them across multiple attractors (377/610/987/...), which weakens "biggest spike wins" alerting but improves diversity at high K. Real-world heavy-tailed data (NSL-KDD's volumetric DoS) is the worst case for this trade — those attacks were structurally the same and benefited from concentration.
315
+
316
+
## What's groundbreaking, what's an unimprovement
317
+
318
+
**GROUNDBREAKING** — Phase 2:
319
+
- The substrate is now THE base algorithm everywhere. Five sites that bypassed it via Python round-trips or hardcoded arrays are now routed through `phi_pi_fib::*`. Architectural completeness over benchmark numbers.
320
+
- New helper `largest_attractor_at_most` retires the last hardcoded Fibonacci array inside core (`harmonic_split` was the holdout).
321
+
322
+
**UNIMPROVEMENT** — Phase 2:
323
+
- NSL-KDD K=500: 365 → 302. We lose the "harmonic beats IF on volumetric data at K=500" claim from Phase 1. This was the most-cited Phase-1 win and it's been deliberately traded for substrate consistency.
324
+
- Power-law K=5 (alert budget): 4/5 → 1/5. The headline "harmonic surfaces structural anomalies before magnitude outliers" claim weakens — at top-5 we now mostly miss.
325
+
- Credential stuffing K=25/K=50: 25→24, 50→49. Small slippage on the synthetic benchmark that was a Phase-1 anchor.
326
+
327
+
**DEPRECATION** — Phase 2:
328
+
-`phi_pi_fib::log_phi` deleted. New code uses `log_phi_pi_fibonacci`. The substrate naming convention is now consistent.
329
+
330
+
## Doc updates needed
331
+
332
+
1.**README's "Where harmonic detection actually wins" table** — Phase-2 numbers replace Phase-1 numbers. The K=500 win flips back to a tie (302 vs 351 → IF leads). The K=5 power-law win weakens.
333
+
2.**`docs/anomaly_detection.md`** — Result 5 NSL-KDD K=500 narrative needs to drop the "harmonic now beats IF" framing; the K=500 crossover from Phase 1 is gone.
334
+
3.**`SUBSTRATE_CHANGES.md`** (this doc) — captures the Phase-2 trade in full so future readers know the choice was deliberate.
335
+
336
+
## What's NOT in scope of this fill-in (deferred)
337
+
338
+
-**D3: HBit harmony substrate-routing.**`hbit.rs:43` uses Euclidean `1.0/(1.0+diff)`; the dual-band α/β/harmony channel doesn't yet speak substrate units. The Architect flagged this has "bigger implications" and deferred to its own session. Next on the queue.
339
+
-**LLM evolution experiments (Experiments 0-9).** Developed ON the new substrate; no migration needed but worth a separate audit pass to identify which findings would've failed under the old substrate (substrate-aware vs substrate-dependent classification).
Copy file name to clipboardExpand all lines: docs/anomaly_detection.md
+21-17Lines changed: 21 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,21 +4,23 @@
4
4
5
5
## TL;DR
6
6
7
+
Numbers reflect the substrate-fill (Phase 2, 2026-05-15) where the library's `_bucket_log` now routes through `log_phi_pi_fibonacci` end-to-end. The Phase 1 K=500 win on NSL-KDD (365 vs 351) was traded for that architectural consistency. See `SUBSTRATE_CHANGES.md` for the full diff.
**The pattern:** harmonic wins on *structural* anomalies (rare combinations of normal-looking values), loses on *magnitude* anomalies (values that are simply unusual in scale). NAB and NSL-KDD are mostly magnitude anomalies; credential stuffing is structural.
23
+
**The pattern:** harmonic still wins decisively on *structural* anomalies (rare combinations of normal-looking values — credential stuffing, attack zoo). On *magnitude* anomalies (NAB, NSL-KDD, power-law top-K), IF leads. The Phase-2 substrate-fill widened IF's lead on volumetric data — see Result 5 for the trade.
22
24
23
25
---
24
26
@@ -163,27 +165,29 @@ The NAB result documents what doesn't work — and where the next architectural
## Result 5: NSL-KDD network intrusion (IF leads — substrate-fill traded the K=500 crossover)
167
169
168
170
**Setup:** Real labeled network intrusion dataset from University of New Brunswick. 22,544 captured connections; we use a 5000-row sample with 2147 normal + 2853 attacks across many classes (neptune DoS, mscan, satan, smurf, warezmaster, etc.). Each row has 41 features; we use 6 numeric ones (duration, src/dst bytes, count, srv_count, dst_host_count).
IsolationForest wins at low K (9/10 vs 7/10) and through K=100; harmonic crosses over and wins at K=500 (365 vs 351). The K=500 result is +17 over the pre-refactor measurement (348/500) — the new `log_phi_pi_fibonacci`substrate uses a 40-entry attractor table extending to 63M, vs the old 16-entry table that saturated at 610. NSL-KDD's `src_bytes` and `dst_bytes` features routinely exceed millions; the old substrate compressed every large attack-magnitude to the same near-zero resonance score and the detector couldn't distinguish them. The new substrate sees finer per-row gradients on volumetric attacks.
179
+
IsolationForest leads at every K. The headline `harmonic_anomaly` win at K=500 from Phase 1 (365 vs 351) was traded away in Phase 2 (substrate-fill) for architectural completeness — see `SUBSTRATE_CHANGES.md`.
178
180
179
-
Looking at IF's top-10 picks: 9 of 10 are labeled `smurf` (a volumetric ICMP flood attack — huge byte counts).
180
-
Looking at harmonic's top-10 picks: a mix of `mscan` (port scanning), `warezmaster` (privilege escalation), `back` (buffer overflow), `smurf`.
181
+
**Why the trade:** Phase 1 refactored `compute_resonance` to route through `log_phi_pi_fibonacci`'s 40-entry attractor table (reaches 63M). That refactor alone, with the library's bucket function still using log10, drove K=500 up to 365/500 — a genuine win on volumetric data because resonance scoring suddenly had room to discriminate large byte-counts.
181
182
182
-
**Why IF still leads at low K:** NSL-KDD's labeled attacks are dominated by *volumetric* events — DoS floods with massive byte counts. IF picks magnitude outliers first; the labeled attacks at the top of any reasonable score distribution ARE the most extreme magnitudes. IF's job is finding "the biggest spike"; the dataset rewards that.
183
+
Phase 2 extended the substrate to the bucket function itself (`_bucket_log` now calls `log_phi_pi_fibonacci(v)` instead of `py_call(math, "log10", v)`). Substrate-tempo bucketing has ~1.5 buckets per base-10 decade, which spreads NSL-KDD's heavy-tailed `src_bytes`/`dst_bytes` across multiple attractors (377, 610, 987, …) instead of clumping them all at 377 like log10 did. The clumping was *helping*the score function discriminate big spikes; spreading them out across attractors *hurts* recall on volumetric attacks. Net: −63 at K=500.
183
184
184
-
**Why harmonic catches up at K=500:** look at the *diversity* of what each detector flags. IF stacks on smurf because every smurf row looks the same in magnitude space. Harmonic finds mscan + warezmaster + back + smurf — multiple distinct attack patterns. By the time you've spent 500 alerts, harmonic has surfaced more unique attack types and more total true positives.
185
+
**The honest read:** harmonic with log10 bucketing genuinely beat IF at K=500 on NSL-KDD; harmonic with substrate-tempo bucketing does not. The Architect chose substrate purity over the K=500 win. The result table here is what the shipped library produces under the substrate-fill regime.
186
+
187
+
Looking at IF's top-10 picks: 9 of 10 are labeled `smurf` (a volumetric ICMP flood attack — huge byte counts).
188
+
Looking at harmonic's top-10 picks: a mix of `mscan` (port scanning), `warezmaster` (privilege escalation), `back` (buffer overflow), `smurf`.
185
189
186
-
For an SRE on a tight alert budget hunting *known* threats, IF is still the right tool (9/10 vs 7/10 at K=10). For *threat hunting* — investigating broadly to find anything anomalous — harmonic's broader coverage (365 vs 351 at K=500) becomes the winning trade.
190
+
**Why harmonic still surfaces diverse attack types:** the score function still rewards "rare combination across dims" — the structural-anomaly signal that picks credential stuffing perfectly. NSL-KDD's labeled attacks are dominated by *volumetric* events, which is structurally IF's regime; harmonic still surfaces mscan/warezmaster/back diversity, just at lower precision than the log10-bucketing version did.
0 commit comments