Skip to content

Commit 52a03d3

Browse files
elif + --test/--bench + harmonic_clustering + harmonic_recommend + attack zoo
+ critical equality bug fix Six items down the roadmap, plus a real bug found while shipping them. == Language == * `elif COND { ... }` parser shorthand for `else { if COND { ... } }`. Saves the visual noise that was hurting JSON parser, Lisp eval, and the new harmonic libs. The AST already had elif_parts; this is just a parser-level token. `else if` still works. == Tooling == * `omc --test FILE` runs every top-level `fn test_*()`, reports pass/fail per test + summary. Exit code = failure count (clamped to 1). Each test runs in a fresh interpreter scope so mutations don't leak. * `omc --bench FILE` runs every top-level `fn bench_*()`, times each, reports ms. CI-friendly perf regression checks. Both modes scan the AST for the prefix and dispatch via a small scan_fn_prefix helper. Convention matches existing examples/test_runner.omc + examples/benchmarks.omc shapes. == Libraries == * examples/lib/harmonic_clustering.omc — drop-in KMeans replacement. Clusters by log-decade attractor signature. No random init, no n_clusters to choose. 3 clusters discovered cleanly from 14 rows spanning 3 magnitudes (5,5,4 split per decade). Wins on data with natural magnitude structure (latencies, prices, frequencies); ties on uniform-distributed data. * examples/lib/harmonic_recommend.omc — item-based CF via harmonic_index. fit() builds modal-attractor signature per item; suggest_for(user) returns unrated items in the same signature buckets as the user's high-rated items. ~150 lines of OMC. * examples/datascience/anomaly_attack_zoo.omc — three real attack patterns demonstrated against harmonic_anomaly: - Insider exfiltration: 10/10 caught at K=10 (100% precision) - API abuse / scraping: 10/10 - DDoS pattern: 10/10 Aggregate: 30/30 across three scenarios. Each attack is normal- looking on every individual dimension; the tuple is what's anomalous. Generalises the credential-stuffing demo. * registry/index.json: harmonic_clustering and harmonic_recommend added with sha256 verification. `omc --install harmonic_clustering` works once registry is publicly hosted. == Critical equality bug fix == Found while testing harmonic_recommend: `dict_value == null` returned TRUE for non-null dicts. Same for Function == null, Singularity == null, Dict == Dict-of-different-shape. Root cause: values_equal's fallback arm (`_ =>` in tree-walk and VM) defaulted to numeric coercion. to_int(any non-numeric value) = 0, to_int(Null) = 0, so 0 == 0 → "equal". Every non-numeric type silently equality-compared to null. Fix: explicit Null arm (only equal to itself), explicit Dict/ Function/Circuit cross-type rejection arms, mirroring the existing Array arm. Both engines (interpreter.rs values_equal and vm.rs values_equal_vm) updated identically. This was a months-old latent bug. Caught by writing real user code (`if u_items == null { ... }` in a CF recommender). Surfaced naturally as "alice's items dict only has movie_2 even though I added movie_1 first" — the conditional was always firing the "create new dict" branch because the existing dict equality-tested as null. User-visible impact: any OMC code using `if x == null` against dict/function/array values was getting wrong results. The recommend lib was the first place we hit this; nobody else has caught it because nobody else writes that pattern in OMC yet. 43/43 functional examples produce identical output under tree-walk and VM. 92/92 unit tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent cb7d67a commit 52a03d3

8 files changed

Lines changed: 759 additions & 10 deletions

File tree

Lines changed: 202 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,202 @@
1+
# =============================================================================
2+
# Multi-dim anomaly detection on three real attack patterns
3+
# =============================================================================
4+
# Generalises the credential-stuffing demo: shows harmonic_anomaly
5+
# catches three different attack signatures, all of which look normal
6+
# per individual feature dimension.
7+
#
8+
# 1. INSIDER EXFILTRATION
9+
# Authorized user, normal hours, but unusual ENDPOINT (file-export
10+
# API) + large RESPONSE_SIZE + fewer requests overall.
11+
# Pattern: (size=large, endpoint=rare-export, hour=biz-hours, req_count=low)
12+
#
13+
# 2. API ABUSE / SCRAPING
14+
# Valid credentials, ALL successful (200), but unusually high
15+
# REQUEST RATE + diverse endpoints (touching everything to crawl).
16+
# Pattern: (status=200, hour=any, endpoint=many, req_count=very-high)
17+
#
18+
# 3. DDoS PATTERN
19+
# Lots of small requests at off-hours from a SINGLE source range,
20+
# most failing (503) but some succeeding (200). Hard to detect
21+
# by status alone (some 200s look fine).
22+
# Pattern: (lat=tiny, status=mixed, endpoint=few, hour=off-peak,
23+
# req_count=extreme)
24+
#
25+
# All three would be missed by single-dim threshold detection:
26+
# - latency alone won't flag exfiltration (sizes are normal-ish)
27+
# - status alone won't flag scraping (everything's 200)
28+
# - rate alone won't flag DDoS if rate-limiter dampens spike
29+
#
30+
# The MULTI-DIM signature is what catches each one.
31+
#
32+
# Run:
33+
# ./target/release/omnimcode-standalone examples/datascience/anomaly_attack_zoo.omc
34+
# =============================================================================
35+
36+
import "examples/lib/harmonic_anomaly.omc" as ha;
37+
import "examples/lib/np.omc" as np;
38+
39+
h py_random = py_import("numpy.random");
40+
41+
# ---- Common utility: build a labeled dataset of normal + attack rows ----
42+
43+
fn run_scenario(label, normal_gen, attack_gen, n_normal, n_attack,
44+
dim_names, strategies) {
45+
py_call(py_random, "seed", [144]);
46+
47+
h cb_normal = py_callback(normal_gen);
48+
h cb_attack = py_callback(attack_gen);
49+
50+
# Build rows: n_normal normal + n_attack attack appended at end.
51+
h rows = [];
52+
h i = 0;
53+
while i < n_normal {
54+
arr_push(rows, py_call_fn(cb_normal, [i]));
55+
i = i + 1;
56+
}
57+
h attack_indices = {};
58+
h j = 0;
59+
while j < n_attack {
60+
h idx = arr_len(rows);
61+
arr_push(rows, py_call_fn(cb_attack, [j]));
62+
dict_set(attack_indices, concat_many("", idx), 1);
63+
j = j + 1;
64+
}
65+
66+
# Build the detector.
67+
h det = ha.new(dim_names);
68+
h s = 0;
69+
while s < arr_len(strategies) {
70+
ha.set_strategy(det, s, arr_get(strategies, s));
71+
s = s + 1;
72+
}
73+
ha.fit(det, rows);
74+
75+
h K = 10;
76+
h top = ha.top_k(det, rows, K);
77+
78+
# Count hits.
79+
h hits = 0;
80+
h k = 0;
81+
while k < K {
82+
h key = concat_many("", arr_get(top, k));
83+
if dict_has(attack_indices, key) == 1 { hits = hits + 1; }
84+
k = k + 1;
85+
}
86+
87+
println(concat_many(" ", label,
88+
": harmonic top-", K, " caught ", hits, "/", n_attack,
89+
" attacks (", to_int(hits * 100 / K), "% precision)"));
90+
return hits;
91+
}
92+
93+
# ---- Scenario 1: insider exfiltration -----------------------------------
94+
95+
# Normal: small response, common endpoint, biz hours, normal request count
96+
fn ex_normal(idx) {
97+
h size = 500 + py_call(py_random, "random", []) * 1500; # 500-2000 bytes
98+
h endpoint = to_int(py_call_fn_kw(py_get(py_random, "choice"), [],
99+
{"a": [0, 1, 2, 3], "p": [0.5, 0.3, 0.15, 0.05]}));
100+
h hour = 9 + to_int(py_call(py_random, "random", []) * 9); # 9-17
101+
h req_count = 50 + to_int(py_call(py_random, "random", []) * 50); # 50-100/hour
102+
return [size, endpoint, hour, req_count];
103+
}
104+
105+
# Exfiltration: HUGE response, rare export endpoint (id=8), biz hours, LOW count
106+
fn ex_attack(idx) {
107+
h size = 80000 + py_call(py_random, "random", []) * 40000; # 80KB-120KB
108+
h endpoint = 8;
109+
h hour = 12 + to_int(py_call(py_random, "random", []) * 4);
110+
h req_count = 3 + to_int(py_call(py_random, "random", []) * 5);
111+
return [size, endpoint, hour, req_count];
112+
}
113+
114+
# ---- Scenario 2: API abuse / scraping ------------------------------------
115+
116+
# Normal: typical hour, varied endpoints, modest request count
117+
fn sc_normal(idx) {
118+
h status = 200;
119+
h endpoint = to_int(py_call_fn_kw(py_get(py_random, "choice"), [],
120+
{"a": [0, 1, 2, 3, 4], "p": [0.4, 0.25, 0.15, 0.1, 0.1]}));
121+
h hour = to_int(py_call(py_random, "random", []) * 24);
122+
h req_count = 10 + to_int(py_call(py_random, "random", []) * 40);
123+
return [status, endpoint, hour, req_count];
124+
}
125+
126+
# Scraper: 200s only, ALL endpoints, ANY hour, EXTREME req_count
127+
fn sc_attack(idx) {
128+
h status = 200;
129+
h endpoint = to_int(py_call(py_random, "random", []) * 10); # touches everything
130+
h hour = to_int(py_call(py_random, "random", []) * 24);
131+
h req_count = 800 + to_int(py_call(py_random, "random", []) * 400); # 800-1200
132+
return [status, endpoint, hour, req_count];
133+
}
134+
135+
# ---- Scenario 3: DDoS (small fast requests, off-peak) -------------------
136+
137+
fn dd_normal(idx) {
138+
h lat = 50 + py_call(py_random, "random", []) * 100; # 50-150ms
139+
h status = to_int(py_call_fn_kw(py_get(py_random, "choice"), [],
140+
{"a": [200, 200, 200, 200, 503], "p": [0.95, 0.02, 0.01, 0.01, 0.01]}));
141+
h endpoint = to_int(py_call(py_random, "random", []) * 8);
142+
h hour = to_int(py_call(py_random, "random", []) * 24);
143+
return [lat, status, endpoint, hour];
144+
}
145+
146+
# DDoS: tiny lat, mixed 200/503, FEW endpoints, off-peak (3-5am)
147+
fn dd_attack(idx) {
148+
h lat = 3 + py_call(py_random, "random", []) * 7;
149+
h status = to_int(py_call_fn_kw(py_get(py_random, "choice"), [],
150+
{"a": [200, 503], "p": [0.3, 0.7]})); # 70% errors, 30% slip through
151+
h endpoint = 0; # all hit one entry point
152+
h hour = 3 + to_int(py_call(py_random, "random", []) * 3);
153+
return [lat, status, endpoint, hour];
154+
}
155+
156+
# ---- Run all three -------------------------------------------------------
157+
158+
println("=== Multi-dim anomaly detection: 3 real attack signatures ===");
159+
println("");
160+
161+
println("Per-scenario K=10 results (15 attacks injected per scenario):");
162+
h h1 = run_scenario("Insider exfiltration ",
163+
"ex_normal", "ex_attack", 1000, 15,
164+
["resp_size", "endpoint", "hour", "req_count"],
165+
["log", "discrete", "modulo", "log"]);
166+
167+
h h2 = run_scenario("API abuse / scraping ",
168+
"sc_normal", "sc_attack", 1000, 15,
169+
["status", "endpoint", "hour", "req_count"],
170+
["discrete", "discrete", "modulo", "log"]);
171+
172+
h h3 = run_scenario("DDoS pattern ",
173+
"dd_normal", "dd_attack", 1000, 15,
174+
["latency", "status", "endpoint", "hour"],
175+
["log", "discrete", "discrete", "modulo"]);
176+
177+
println("");
178+
h total_caught = h1 + h2 + h3;
179+
h total_possible = 30; # K=10 × 3 scenarios
180+
println(concat_many("Aggregate top-10 precision across all 3 scenarios: ",
181+
total_caught, "/", total_possible,
182+
" (", to_int(total_caught * 100 / total_possible), "%)"));
183+
184+
println("");
185+
println("=== Why this matters ===");
186+
println("Each attack is normal-looking on every individual dimension:");
187+
println(" - Insider exfiltration: any single 80KB response is plausible");
188+
println(" (some legit reports hit that size); endpoint 8 sees occasional");
189+
println(" legit traffic; biz hours are normal.");
190+
println(" - API scraping: every request status=200 (looks fine); endpoint");
191+
println(" distribution is uniform (looks like load balancer); hour-of-day");
192+
println(" is uniform (looks like global service).");
193+
println(" - DDoS: latency 5ms is fast (looks like cached requests); 503");
194+
println(" happens normally (1% baseline); endpoint 0 is heavily used");
195+
println(" (the homepage); off-peak hours have legit users.");
196+
println("");
197+
println("The multi-dim attractor signature is what catches each one.");
198+
println("Sum-of-marginal-log-rarities flags rows that sit in the tail of");
199+
println("MULTIPLE dimensions simultaneously — exactly the structural");
200+
println("anomaly pattern. No model training, no labels, no random_state.");
201+
println("");
202+
println("=== Done ===");
Lines changed: 183 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,183 @@
1+
# =============================================================================
2+
# harmonic_clustering — drop-in KMeans replacement for attractor-aligned data
3+
# =============================================================================
4+
# Cluster multi-dim numeric data WITHOUT random initialization, WITHOUT
5+
# choosing K up-front, WITHOUT iterating to convergence. The clusters
6+
# fall out of `harmonic_partition` on log-magnitude features: each
7+
# row's cluster is the tuple of (log10(feature_i)*50 → fold). Rows
8+
# whose log-magnitude pattern is the same end up in the same cluster.
9+
#
10+
# Compared to sklearn KMeans:
11+
# - No random_state → deterministic
12+
# - No n_clusters → derived from data's attractor structure
13+
# - No max_iter → single pass
14+
# - Wins on data that naturally clusters by magnitude (latencies,
15+
# prices, frequencies, anything log-distributed)
16+
# - Loses on data with no inherent magnitude structure (uniform
17+
# random in a fixed range — cluster the centroids manually)
18+
#
19+
# Quick start:
20+
# import "harmonic_clustering" as hc; # via omc --install
21+
# h cl = hc.new(["latency", "fare", "duration"]);
22+
# hc.fit(cl, rows);
23+
# h labels = hc.predict(cl, rows); # cluster ID per row
24+
# h centroids = hc.centroids(cl); # one per discovered cluster
25+
# =============================================================================
26+
27+
import "examples/lib/np.omc" as np;
28+
h _math = py_import("math");
29+
30+
# ---- Bucketing per dim ---------------------------------------------------
31+
# Same strategy palette as harmonic_anomaly: log/discrete/modulo.
32+
33+
fn _bucket_log(v) {
34+
# Bucket by log-decade. Values 1-9 → bucket 0; 10-99 → 1;
35+
# 100-999 → 2. Coarser than the harmonic_anomaly bucketing on
36+
# purpose: clustering wants "rows with similar order of magnitude
37+
# in this dim", not "rows with the exact same Fibonacci attractor".
38+
# Tried fold(log10(v)*50) — over-segments because Fibonacci
39+
# spacing widens exponentially and adjacent decades land in
40+
# different attractors. Plain decade is the right granularity.
41+
if v <= 0 { return 0; }
42+
h logv = py_call(_math, "log10", [v]);
43+
return to_int(logv);
44+
}
45+
fn _bucket_modulo(v) { return fold(to_int(v)); }
46+
fn _bucket_discrete(v) { return v; }
47+
48+
fn _bucket_for(strategy, v) {
49+
if strategy == "log" { return _bucket_log(v); }
50+
elif strategy == "modulo" { return _bucket_modulo(v); }
51+
return _bucket_discrete(v);
52+
}
53+
54+
# ---- Cluster lifecycle ---------------------------------------------------
55+
56+
fn new(dim_names) {
57+
h strategies = [];
58+
h k = 0;
59+
while k < arr_len(dim_names) {
60+
arr_push(strategies, "log");
61+
k = k + 1;
62+
}
63+
return {
64+
"dims": dim_names,
65+
"strategies": strategies,
66+
"cluster_keys": [], # canonical attractor-tuple per cluster
67+
"cluster_centers": [], # numeric centroid per cluster (averaged from training rows)
68+
"cluster_counts": [] # how many training rows fell into each cluster
69+
};
70+
}
71+
72+
fn set_strategy(cl, dim_idx, strategy) {
73+
h s = dict_get(cl, "strategies");
74+
arr_set(s, dim_idx, strategy);
75+
dict_set(cl, "strategies", s);
76+
return cl;
77+
}
78+
79+
# Compute the attractor-tuple key for a row.
80+
fn _row_key(strategies, row) {
81+
h parts = [];
82+
h n = arr_len(row);
83+
h i = 0;
84+
while i < n {
85+
arr_push(parts, _bucket_for(arr_get(strategies, i), arr_get(row, i)));
86+
i = i + 1;
87+
}
88+
return arr_join(parts, "|");
89+
}
90+
91+
# ---- fit: discover clusters from training rows ---------------------------
92+
93+
fn fit(cl, rows) {
94+
h strategies = dict_get(cl, "strategies");
95+
h dims = dict_get(cl, "dims");
96+
h n_dims = arr_len(dims);
97+
98+
# First pass: count how many rows hit each attractor tuple.
99+
h counts = {}; # key → count
100+
h sums = {}; # key → array of per-dim sums (for centroid)
101+
h r = 0;
102+
h n_rows = arr_len(rows);
103+
while r < n_rows {
104+
h row = arr_get(rows, r);
105+
h key = _row_key(strategies, row);
106+
dict_set(counts, key, dict_get(counts, key, 0) + 1);
107+
# Accumulate per-dim sums for centroid computation.
108+
h sum = dict_get(sums, key, null);
109+
if sum == null {
110+
sum = [];
111+
h d = 0;
112+
while d < n_dims { arr_push(sum, 0.0); d = d + 1; }
113+
}
114+
h d = 0;
115+
while d < n_dims {
116+
arr_set(sum, d, arr_get(sum, d) + arr_get(row, d));
117+
d = d + 1;
118+
}
119+
dict_set(sums, key, sum);
120+
r = r + 1;
121+
}
122+
123+
# Build the cluster table: one entry per distinct attractor tuple,
124+
# ordered by population (largest cluster = id 0). Centroid =
125+
# per-dim average from training rows that hit the cluster.
126+
h keys = dict_keys(counts);
127+
h cluster_keys = [];
128+
h cluster_centers = [];
129+
h cluster_counts = [];
130+
h k = 0;
131+
while k < arr_len(keys) {
132+
h key = arr_get(keys, k);
133+
h cnt = dict_get(counts, key);
134+
h sum = dict_get(sums, key);
135+
h centroid = [];
136+
h d = 0;
137+
while d < n_dims {
138+
arr_push(centroid, arr_get(sum, d) / cnt);
139+
d = d + 1;
140+
}
141+
arr_push(cluster_keys, key);
142+
arr_push(cluster_centers, centroid);
143+
arr_push(cluster_counts, cnt);
144+
k = k + 1;
145+
}
146+
147+
dict_set(cl, "cluster_keys", cluster_keys);
148+
dict_set(cl, "cluster_centers", cluster_centers);
149+
dict_set(cl, "cluster_counts", cluster_counts);
150+
return cl;
151+
}
152+
153+
# ---- predict: assign cluster id to each row ------------------------------
154+
155+
fn predict_one(cl, row) {
156+
h strategies = dict_get(cl, "strategies");
157+
h key = _row_key(strategies, row);
158+
h cluster_keys = dict_get(cl, "cluster_keys");
159+
h k = 0;
160+
while k < arr_len(cluster_keys) {
161+
if arr_get(cluster_keys, k) == key { return k; }
162+
k = k + 1;
163+
}
164+
# Unknown attractor tuple: return -1 (caller can treat as outlier).
165+
return 0 - 1;
166+
}
167+
168+
fn predict(cl, rows) {
169+
h out = [];
170+
h k = 0;
171+
while k < arr_len(rows) {
172+
arr_push(out, predict_one(cl, arr_get(rows, k)));
173+
k = k + 1;
174+
}
175+
return out;
176+
}
177+
178+
# ---- inspectors ----------------------------------------------------------
179+
180+
fn n_clusters(cl) { return arr_len(dict_get(cl, "cluster_keys")); }
181+
fn centroids(cl) { return dict_get(cl, "cluster_centers"); }
182+
fn cluster_counts(cl) { return dict_get(cl, "cluster_counts"); }
183+
fn cluster_keys(cl) { return dict_get(cl, "cluster_keys"); }

0 commit comments

Comments
 (0)