Skip to content

Commit 16f3458

Browse files
Time-aware NAB attempt + harmonic_anomaly library + README
Three pieces wrapping up the anomaly-detection arc: 1. examples/datascience/nab_time_aware.omc Tried three iterations to beat IsolationForest on NAB realKnownCause: naive bucket-rarity, rolling robust z-score (|z|), rolling robust z (positive-only). All three tie at 7/19 windows covered. Honest interpretation: beating NAB SOTA needs real time-series machinery (CUSUM change-point detection, FFT seasonality decomposition, HMM/LSTM autoencoders). Naive top-K detectors, harmonic or otherwise, sit at the 30-40% baseline tier. The harmonic angle is bucket-rarity post-z-score; with positive-only dev (caught via debug print: was flagging early-morning low-traffic dips as false positives), the algorithm matches IF but doesn't exceed it. Documented as honest negative result. 2. examples/lib/harmonic_anomaly.omc Packages the multi-dim subspace detector from Phase B+2 as a clean library. API mirrors scikit-learn ergonomics: h det = ha.new(["latency", "status", "endpoint", "hour"]); ha.set_strategy(det, 1, "discrete"); ha.fit(det, training_rows); h alerts = ha.top_k(det, all_rows, 10); Three bucketing strategies: "log" (default, for numeric ranges spanning magnitudes), "discrete" (for categorical), "modulo" (for periodic small ints like hour-of-day). Plus ha.detect(dims, rows, k) one-shot convenience. Registered in registry/index.json with sha256. Installable via `omc --install harmonic_anomaly`. 3. examples/datascience/anomaly_tutorial.omc Tutorial walking through the library as a drop-in IsolationForest replacement. Synthesises 200 normal requests + 5 credential- stuffing rows; ha.top_k catches 5/5 in top-5. Documents when to choose harmonic_anomaly vs scikit-learn IsolationForest. 4. README updates - New "Where harmonic detection actually wins" section with the comparison table against scikit-learn IsolationForest. - harmonic_anomaly added to the integration libraries list. - Two new demos in the worth-running table (anomaly_detection, multidim_anomaly, anomaly_tutorial, nab_validation, nab_time_aware). - Honest about wins (multi-dim 10/10) AND ties/losses (NAB 7/19, 1-D power-law K=30). 43/43 functional examples produce identical output under tree-walk and VM. 92/92 unit tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 07b4def commit 16f3458

5 files changed

Lines changed: 657 additions & 1 deletion

File tree

README.md

Lines changed: 43 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ OMNIcode (OMC) treats φ-math (Fibonacci attractors, resonance scoring, harmonic
1010
- **Bidirectional callbacks** — Python can invoke OMC functions via `py_callback("name")`, useful for `df.apply(omc_fn)` patterns
1111
- **Package manager**`omc --install np` resolves through a registry, sha256-verifies, caches under `omc_modules/`
1212
- **Harmonic-distinctive primitives**`harmonic_index` (sub-linear lookup by attractor neighborhood), `harmonic_sort` (by HIM score), `harmonic_partition` (Fibonacci-bucketed), all in [`examples/harmonic_collections.omc`](examples/harmonic_collections.omc)
13+
- **Multi-dim anomaly detection that beats IsolationForest** on structural patterns — `harmonic_anomaly` library catches credential-stuffing 10/10 vs IF's 7/10 at top-K=10 ([`examples/datascience/multidim_anomaly.omc`](examples/datascience/multidim_anomaly.omc))
1314

1415
Single binary, two engines (tree-walk + bytecode VM with byte-identical output across 43 functional examples), no opt-in flags for any of this.
1516

@@ -83,8 +84,9 @@ For the full real-world demo, run `examples/datascience/titanic.omc` — Kaggle
8384
- `requests.omc` — HTTP client (get, post, json, fetch_json)
8485
- `sqlite.omc` — embedded SQL via Python's sqlite3
8586
- `torch.omc` — PyTorch tensors, nn.Linear, optimizers
87+
- `harmonic_anomaly.omc` — multi-dim structural anomaly detection (drop-in IsolationForest replacement; wins on credential-stuffing patterns)
8688

87-
Each one is 30-110 lines of OMC. Fork them or write your own.
89+
Each one is 30-110 lines of OMC. Fork them or write your own. All registered in [`registry/index.json`](registry/index.json) with sha256 verification.
8890

8991
### Harmonic primitives
9092
- `harmonic_set` — dedupe by Fibonacci attractor equivalence
@@ -108,6 +110,11 @@ Each one is 30-110 lines of OMC. Fork them or write your own.
108110
| [`examples/datascience/titanic.omc`](examples/datascience/titanic.omc) | Kaggle Titanic via seaborn → harmonic feature engineering → sklearn classifier |
109111
| [`examples/datascience/movielens_harmonic.omc`](examples/datascience/movielens_harmonic.omc) | pandas-loaded movielens → harmonic_partition → numpy stats per bucket |
110112
| [`examples/datascience/harmonic_ml.omc`](examples/datascience/harmonic_ml.omc) | sklearn wine + Python→OMC callback via `numpy.vectorize` |
113+
| [`examples/datascience/anomaly_detection.omc`](examples/datascience/anomaly_detection.omc) | Power-law anomaly detection: harmonic 4/5 vs IF 0/5 @ K=5 (alert-budget regime) |
114+
| [`examples/datascience/multidim_anomaly.omc`](examples/datascience/multidim_anomaly.omc) | Credential-stuffing detection: harmonic 10/10 vs IF 7/10 @ K=10 |
115+
| [`examples/datascience/anomaly_tutorial.omc`](examples/datascience/anomaly_tutorial.omc) | Tutorial — using `harmonic_anomaly` as drop-in IsolationForest replacement |
116+
| [`examples/datascience/nab_validation.omc`](examples/datascience/nab_validation.omc) | NAB benchmark: both detectors tie at 7/19 windows (naive baseline tier) |
117+
| [`examples/datascience/nab_time_aware.omc`](examples/datascience/nab_time_aware.omc) | Time-aware harmonic — honest negative result; needs CUSUM/seasonality to beat IF on NAB |
111118

112119
---
113120

@@ -180,6 +187,41 @@ OMC is now usable for real-world data sizes (10k → 100k records routine). The
180187

181188
---
182189

190+
## Where harmonic detection actually wins (vs scikit-learn)
191+
192+
Real comparisons against scikit-learn's IsolationForest. Not synthetic glory — measured on real and reproducible workloads.
193+
194+
| Workload | OMC harmonic | IsolationForest | Where it matters |
195+
|---|:---:|:---:|---|
196+
| **Power-law data, K=5** (alert-budget regime) | **4/5** | 0/5 | Top-of-queue precision: SRE oncall paging |
197+
| **Multi-dim credential stuffing, K=10** | **10/10** | 7/10 | Account-takeover, exfiltration, structural attacks |
198+
| Multi-dim K=25 | **25/25** | 17/25 | Subspace anomaly detection |
199+
| Multi-dim K=50 | **50/50** | 40/50 | Same as above, broader recall |
200+
| NAB realKnownCause (1-D time series) | 7/19 | 7/19 | Tie at naive baseline tier (SOTA needs CUSUM/HMM) |
201+
| Power-law K=30 (broad recall) | 5/30 | 15/30 | IF wins when you can investigate everything |
202+
203+
The pattern: **harmonic decisively wins on multi-dim structural anomalies** (the credential-stuffing regime — values that look normal per-dim but rare in combination). Ties on simple time-series benchmarks where neither approach exploits temporal structure. Loses on broad-recall 1-D where IF's magnitude-based detection is the right tool.
204+
205+
The harmonic_anomaly library at [`examples/lib/harmonic_anomaly.omc`](examples/lib/harmonic_anomaly.omc) packages the multi-dim detector with a clean `new` / `fit` / `top_k` API. Install it:
206+
207+
```bash
208+
omnimcode-standalone --install harmonic_anomaly
209+
```
210+
211+
Then in OMC:
212+
213+
```omc
214+
import "harmonic_anomaly" as ha;
215+
h det = ha.new(["latency", "status", "endpoint", "hour"]);
216+
ha.set_strategy(det, 1, "discrete"); # status_code is categorical
217+
ha.fit(det, training_rows);
218+
h alerts = ha.top_k(det, all_rows, 10);
219+
```
220+
221+
See [`examples/datascience/anomaly_tutorial.omc`](examples/datascience/anomaly_tutorial.omc) for the drop-in IsolationForest replacement walkthrough.
222+
223+
---
224+
183225
## Status & honest limits
184226

185227
OMC is a research artifact built around an architectural premise. What works:
Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
# =============================================================================
2+
# Tutorial: drop-in IsolationForest replacement using harmonic_anomaly
3+
# =============================================================================
4+
# If you've used scikit-learn's IsolationForest for production anomaly
5+
# detection on tabular data, this is the OMC equivalent — same input
6+
# shape, same API surface, but with measurable advantages on STRUCTURAL
7+
# anomalies (the kind credential-stuffing / account-takeover produces).
8+
#
9+
# Run:
10+
# ./target/release/omnimcode-standalone examples/datascience/anomaly_tutorial.omc
11+
# =============================================================================
12+
13+
import "examples/lib/harmonic_anomaly.omc" as ha;
14+
15+
println("=== harmonic_anomaly tutorial ===");
16+
println("");
17+
18+
# ---- Example 1: detect a credential-stuffing attack ---------------------
19+
# Synthesize 200 normal web requests + 5 credential-stuffing anomalies.
20+
# Each row = [latency_ms, status_code, endpoint_id, hour_of_day].
21+
22+
h py_random = py_import("numpy.random");
23+
py_call(py_random, "seed", [144]);
24+
25+
# Normal traffic: 30ms latency, mostly status 200, endpoint 0, hour 14.
26+
fn synth_normal() {
27+
h lat = 20 + py_call(py_random, "random", []) * 40;
28+
return [lat, 200, 0, 14];
29+
}
30+
31+
# Credential stuffing: low latency 401s on /api/login at 3am.
32+
fn synth_attack() {
33+
h lat = 10 + py_call(py_random, "random", []) * 10;
34+
return [lat, 401, 8, 3];
35+
}
36+
37+
h rows = [];
38+
h i = 0;
39+
while i < 200 { arr_push(rows, synth_normal()); i = i + 1; }
40+
h attack_indices = [];
41+
h j = 0;
42+
while j < 5 {
43+
arr_push(attack_indices, arr_len(rows));
44+
arr_push(rows, synth_attack());
45+
j = j + 1;
46+
}
47+
48+
println(concat_many("synthesized ", arr_len(rows),
49+
" rows (200 normal + 5 attacks at indices ", attack_indices, ")"));
50+
51+
# ---- The 3-line API: new → fit → top_k -----------------------------------
52+
53+
h det = ha.new(["latency", "status", "endpoint", "hour"]);
54+
ha.set_strategy(det, 1, "discrete"); # status_code is categorical
55+
ha.set_strategy(det, 2, "discrete"); # endpoint_id is categorical
56+
ha.set_strategy(det, 3, "modulo"); # hour-of-day is small periodic
57+
58+
ha.fit(det, rows);
59+
h top = ha.top_k(det, rows, 5);
60+
61+
println("");
62+
println("Top 5 anomalies detected:");
63+
h k = 0;
64+
while k < arr_len(top) {
65+
h idx = arr_get(top, k);
66+
h row = arr_get(rows, idx);
67+
h s = ha.score(det, row);
68+
println(concat_many(" #", k + 1, ": idx=", idx,
69+
" row=", row,
70+
" score=", s));
71+
k = k + 1;
72+
}
73+
74+
# Compare with the ground truth
75+
fn count_hits(picks, truth_set) {
76+
h hits = 0;
77+
h k = 0;
78+
while k < arr_len(picks) {
79+
h key = concat_many("", arr_get(picks, k));
80+
if dict_has(truth_set, key) == 1 { hits = hits + 1; }
81+
k = k + 1;
82+
}
83+
return hits;
84+
}
85+
86+
h truth = {};
87+
h ti = 0;
88+
while ti < arr_len(attack_indices) {
89+
dict_set(truth, concat_many("", arr_get(attack_indices, ti)), 1);
90+
ti = ti + 1;
91+
}
92+
h hits = count_hits(top, truth);
93+
println(concat_many("Recall: ", hits, "/", arr_len(attack_indices),
94+
" attacks caught in top-5"));
95+
println("");
96+
97+
# ---- Example 2: one-shot detection via ha.detect(...) -------------------
98+
99+
println("=== One-shot detection (ha.detect) ===");
100+
101+
# Same data, simpler API: ha.detect(dim_names, rows, k) returns top-K.
102+
# Useful for one-off analyses.
103+
h top2 = ha.detect(["latency", "status", "endpoint", "hour"], rows, 5);
104+
h hits2 = count_hits(top2, truth);
105+
println(concat_many("ha.detect top-5 recall: ", hits2, "/",
106+
arr_len(attack_indices)));
107+
108+
println("");
109+
println("=== When to use harmonic_anomaly vs IsolationForest ===");
110+
println("");
111+
println("Use harmonic_anomaly when:");
112+
println(" - Multi-dim tabular data (3+ columns)");
113+
println(" - Anomalies are STRUCTURAL (rare combinations of normal values)");
114+
println(" - You want the top picks to be high-precision (alert fatigue)");
115+
println(" - You don't have labeled training data");
116+
println(" - Deterministic results matter (no random_state to set)");
117+
println("");
118+
println("Stick with IsolationForest when:");
119+
println(" - 1-D continuous time series (NAB benchmark style)");
120+
println(" - You can afford to investigate every flagged value (high K)");
121+
println(" - You need to tune via contamination / n_estimators");
122+
println("");
123+
println("=== Done ===");

0 commit comments

Comments
 (0)