Skip to content

Commit b660e59

Browse files
csv_parse + selective imports + heal context awareness + NSL-KDD validation
+ transitive-import aliasing bug fix Four roadmap items cleared in one batch, plus a real bug found via real-data validation. == csv_parse(text, sep, skip_header) == Native CSV parser builtin. 2.8x faster than per-line str_split on 10k MovieLens rows: 5ms vs 14ms. The full 10k load was originally 9.9s before Rc-shared collections, then 28ms after, now 5ms. PAIN_POINTS HIGH-2 closed. Defaults to comma separator. Pass sep="\t" for TSV, skip_header=1 to drop the first line. == Selective imports: `from "path" import name1, name2;` == Pulls only listed names into the global namespace, unprefixed. Mutually exclusive with the `as alias` form. Helper functions the module relies on internally must be in the list too. from "examples/lib/np.omc" import _np, array, mean, median; If a name isn't found in the module, gives a clear error pointing at the missing helper. PAIN_POINTS MED-4 closed. == Heal pass context awareness == Two changes that fix PAIN_POINTS MED-3 (heal was unsafe-by-default for any program with domain semantics on small ints): 1. @no_heal pragma opts a whole fn out of healing. Also added the short-form @name pragma syntax (previously only @pragma[name] worked) — matches Rust attribute style. 2. Literal harmonic rewriting is now OPT-IN, fires ONLY when a numeric literal appears in an array-index position (`xs[7]` → `xs[8]`). Outside index position — function args, return values, comparison operands, variable bindings — literal values are PRESERVED. Domain values like rating=4 no longer get silently rewritten to 3. Other heal classes (typo correction via Levenshtein, divide-by-zero → safe_divide, arity auto-pad/truncate) all still fire unchanged. Existing heal demos (heal_pass_demo, self_healing_h2..h5) all still work — they were already using array-index patterns or relied on the other heal classes that didn't change. == Real-data anomaly validation: NSL-KDD == 22,544-row labeled network intrusion dataset from UNB. Sampled to 5000 rows: 2147 normal, 2853 attacks across multiple classes (neptune DoS, guess_passwd, mscan, smurf, satan, etc.). Real captured packets, not synthesis. Results — IsolationForest wins on this dataset: K=10: IF 9/10 vs harmonic 7/10 K=50: IF 45/50 vs harmonic 42/50 K=100: IF 92/100 vs harmonic 76/100 K=500: IF 351/500 vs harmonic 348/500 Honest interpretation in the file: NSL-KDD attacks are dominated by volumetric DoS (smurf, neptune) with massive byte counts. IF picks these magnitude outliers first — exactly its strength. Harmonic spreads picks across diverse attack TYPES (mscan, warezmaster, back, smurf) but lower per-pick precision. This is the OPPOSITE of the synthesized credential-stuffing result (10/10 vs 7/10) because credential stuffing is structural (rare combinations of normal values) while NSL-KDD attacks are mostly magnitude outliers. Right tool for the right threat model: - Harmonic for structural / multi-vector / "looks normal per dim" - IF for volumetric / magnitude-outlier == Critical bug fix: transitive-import aliasing == Found via NSL-KDD validation. When `ha` (harmonic_anomaly) imports `np` internally and the user file ALSO imports `ha as ha`, the aliasing pass was renaming np's already-prefixed functions (np.argsort, np.median, ...) to ha.np.argsort, ha.np.median, ... Symptom: `np.argsort([3, 1, 2])` after `import "harmonic_anomaly" as ha;` failed with "Undefined function: argsort". Fix in interpreter.rs import_module_with_alias: skip names that already contain a dot. Those came from a transitively-imported child module and belong to that child, not the outer alias. Found because real validation imports BOTH ha (which depends on np) AND np directly. Synthetic tests never hit this because demo files only imported np directly. 43/43 functional examples produce identical output under tree-walk and VM. 18/18 OMC tests pass via --test mode. 92/92 unit tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 4f15dae commit b660e59

6 files changed

Lines changed: 5539 additions & 53 deletions

File tree

examples/datascience/nsl_kdd_data/sample_5k.csv

Lines changed: 5000 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 233 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,233 @@
1+
# =============================================================================
2+
# Real-world validation: harmonic_anomaly on NSL-KDD network intrusion data
3+
# =============================================================================
4+
# NSL-KDD is the canonical labeled dataset for network intrusion detection.
5+
# Each row is a network connection with 41 features + a label
6+
# (normal vs attack-class). Real captured traffic from the late 90s,
7+
# cleaned by the University of New Brunswick (NSL = "NSL improvements
8+
# over KDD'99").
9+
#
10+
# This file runs harmonic_anomaly + sklearn IsolationForest on a 5000-row
11+
# subset and reports honestly. No synthesis, no curated examples — real
12+
# packet captures, real attacks.
13+
#
14+
# Sample composition:
15+
# 2147 normal, 1028 neptune (DoS), 274 guess_passwd, 230 mscan,
16+
# 190 warezmaster, 167 smurf, 165 satan, 163 processtable,
17+
# 159 apache2, 78 snmpguess, etc.
18+
#
19+
# We treat ALL non-"normal" rows as anomalies. Top-K alert budget
20+
# regime: which detector surfaces real attacks first?
21+
#
22+
# Run:
23+
# ./target/release/omnimcode-standalone examples/datascience/nsl_kdd_validation.omc
24+
# =============================================================================
25+
26+
import "examples/lib/harmonic_anomaly.omc" as ha;
27+
import "examples/lib/np.omc" as np;
28+
29+
# ---- Load + parse via native csv_parse (the one we just shipped) --------
30+
31+
h t0 = now_ms();
32+
h raw = read_file("examples/datascience/nsl_kdd_data/sample_5k.csv");
33+
h rows_raw = csv_parse(raw, ",", 0);
34+
h t1 = now_ms();
35+
println(concat_many("loaded ", arr_len(rows_raw), " rows in ", t1 - t0, " ms"));
36+
37+
# ---- Extract a manageable feature subset -------------------------------
38+
# 41 features is too many; pick the most-informative numeric ones.
39+
# Schema:
40+
# col 0 = duration (seconds)
41+
# col 4 = src_bytes
42+
# col 5 = dst_bytes
43+
# col 22 = count (connections to same host in last 2 seconds)
44+
# col 23 = srv_count (connections to same service in last 2 seconds)
45+
# col 31 = dst_host_count
46+
# col 32 = dst_host_srv_count
47+
# col 41 = label ("normal" or attack name)
48+
# We use 6-dim feature vectors — enough for harmonic to find structure.
49+
50+
fn extract_features(row) {
51+
return [
52+
to_int(arr_get(row, 0)),
53+
to_int(arr_get(row, 4)),
54+
to_int(arr_get(row, 5)),
55+
to_int(arr_get(row, 22)),
56+
to_int(arr_get(row, 23)),
57+
to_int(arr_get(row, 31))
58+
];
59+
}
60+
61+
h features = [];
62+
h labels = [];
63+
h attack_indices = {};
64+
h i = 0;
65+
while i < arr_len(rows_raw) {
66+
h row = arr_get(rows_raw, i);
67+
if arr_len(row) >= 42 {
68+
arr_push(features, extract_features(row));
69+
h label = arr_get(row, 41);
70+
arr_push(labels, label);
71+
if label != "normal" {
72+
dict_set(attack_indices, concat_many("", i), 1);
73+
}
74+
}
75+
i = i + 1;
76+
}
77+
78+
h n_total = arr_len(features);
79+
h n_attacks = dict_len(attack_indices);
80+
println(concat_many("extracted ", n_total, " feature vectors (",
81+
n_total - n_attacks, " normal, ", n_attacks, " attacks)"));
82+
println("");
83+
84+
# ---- harmonic_anomaly setup ----------------------------------------------
85+
# All 6 features are log-distributed (counts, byte sizes, durations).
86+
87+
h det = ha.new(["duration", "src_bytes", "dst_bytes", "count", "srv_count", "dst_host_count"]);
88+
h t2 = now_ms();
89+
ha.fit(det, features);
90+
h t3 = now_ms();
91+
println(concat_many("harmonic_anomaly fit: ", t3 - t2, " ms"));
92+
93+
# ---- IsolationForest baseline -------------------------------------------
94+
95+
h sk_ensemble = py_import("sklearn.ensemble");
96+
h iforest_cls = py_get(sk_ensemble, "IsolationForest");
97+
h t4 = now_ms();
98+
h iforest = py_call_fn_kw(iforest_cls, [],
99+
{"contamination": 0.5, "random_state": 89, "n_estimators": 100});
100+
py_call(iforest, "fit", [features]);
101+
h if_raw = py_call(iforest, "decision_function", [features]);
102+
h t5 = now_ms();
103+
println(concat_many("IsolationForest fit: ", t5 - t4, " ms"));
104+
println("");
105+
106+
# ---- Score under both detectors -----------------------------------------
107+
108+
h h_scores = ha.score_all(det, features);
109+
# IsolationForest convention: lower = more anomalous → negate.
110+
h if_scores = [];
111+
h ix = 0;
112+
while ix < arr_len(if_raw) {
113+
arr_push(if_scores, 0 - arr_get(if_raw, ix));
114+
ix = ix + 1;
115+
}
116+
117+
# ---- Top-K precision per detector ---------------------------------------
118+
119+
fn topk(scores, k) {
120+
# Build negated scores via explicit loop. arr_map with an inline
121+
# closure that itself uses module-aliased calls (np.argsort below)
122+
# in the SAME fn was triggering "Undefined function: argsort" —
123+
# likely a closure-capture interaction with aliased imports.
124+
h neg = [];
125+
h ni = 0;
126+
while ni < arr_len(scores) {
127+
arr_push(neg, 0 - arr_get(scores, ni));
128+
ni = ni + 1;
129+
}
130+
h sorted = np.argsort(neg);
131+
h out = [];
132+
h j = 0;
133+
while j < k {
134+
if j < arr_len(sorted) { arr_push(out, arr_get(sorted, j)); }
135+
j = j + 1;
136+
}
137+
return out;
138+
}
139+
140+
fn count_hits(top_idx, truth_set) {
141+
h hits = 0;
142+
h k = 0;
143+
while k < arr_len(top_idx) {
144+
h key = concat_many("", arr_get(top_idx, k));
145+
if dict_has(truth_set, key) == 1 { hits = hits + 1; }
146+
k = k + 1;
147+
}
148+
return hits;
149+
}
150+
151+
println(concat_many("=== Recall @ K (truth = ", n_attacks,
152+
" labeled attacks in real captured traffic) ==="));
153+
println(" K=10 K=50 K=100 K=500");
154+
155+
h ks = [10, 50, 100, 500];
156+
h k_idx = 0;
157+
h h_results = [];
158+
h if_results = [];
159+
while k_idx < arr_len(ks) {
160+
h k = arr_get(ks, k_idx);
161+
h h_top = topk(h_scores, k);
162+
h if_top = topk(if_scores, k);
163+
h h_hit = count_hits(h_top, attack_indices);
164+
h if_hit = count_hits(if_top, attack_indices);
165+
arr_push(h_results, h_hit);
166+
arr_push(if_results, if_hit);
167+
k_idx = k_idx + 1;
168+
}
169+
170+
println(concat_many(" IsolationForest ",
171+
arr_get(if_results, 0), "/10 ",
172+
arr_get(if_results, 1), "/50 ",
173+
arr_get(if_results, 2), "/100 ",
174+
arr_get(if_results, 3), "/500"));
175+
println(concat_many(" OMC harmonic ",
176+
arr_get(h_results, 0), "/10 ",
177+
arr_get(h_results, 1), "/50 ",
178+
arr_get(h_results, 2), "/100 ",
179+
arr_get(h_results, 3), "/500"));
180+
181+
println("");
182+
println("=== Sample top-10 picks (each detector) ===");
183+
184+
fn show_picks(label, top, labels, n) {
185+
println(concat_many(" ", label, ":"));
186+
h k = 0;
187+
while k < n {
188+
h idx = arr_get(top, k);
189+
h tag = " ";
190+
h lbl = arr_get(labels, idx);
191+
if lbl != "normal" { tag = " <-"; }
192+
println(concat_many(" #", k + 1, ": idx=", idx,
193+
" label=", lbl, tag));
194+
k = k + 1;
195+
}
196+
}
197+
198+
h h_top10 = topk(h_scores, 10);
199+
h if_top10 = topk(if_scores, 10);
200+
show_picks("OMC harmonic ", h_top10, labels, 10);
201+
show_picks("IsolationForest", if_top10, labels, 10);
202+
203+
println("");
204+
println("=== Honest interpretation ===");
205+
println("On NSL-KDD network intrusion data, IsolationForest wins at");
206+
println("low K (9/10 vs 7/10 at K=10, 45/50 vs 42/50 at K=50).");
207+
println("");
208+
println("Why: ");
209+
println(" - NSL-KDD attacks include massive volumetric DoS (smurf,");
210+
println(" neptune) with huge byte counts. IF picks these first");
211+
println(" because they're magnitude outliers — exactly its strength.");
212+
println(" - Harmonic spreads picks across diverse attack TYPES");
213+
println(" (mscan, warezmaster, back, smurf) — better DIVERSITY");
214+
println(" but lower per-pick precision.");
215+
println("");
216+
println("Where each shines:");
217+
println(" - IF: when 'find the biggest spike' IS the task (DoS, brute");
218+
println(" force, volumetric attacks dominate the threat model).");
219+
println(" - Harmonic: when you need to surface DIVERSE attack patterns");
220+
println(" rather than concentrate on one (credential stuffing,");
221+
println(" multi-vector campaigns, low-and-slow attacks).");
222+
println("");
223+
println("This is the OPPOSITE of multidim_anomaly.omc's result, where");
224+
println("harmonic won 10/10 on credential stuffing — because credential");
225+
println("stuffing is by definition STRUCTURAL (looks normal per-dim,");
226+
println("rare in combination). NSL-KDD's labeled attacks are mostly");
227+
println("magnitude-outliers, the regime IF was designed for.");
228+
println("");
229+
println("The credible story: pick the right tool for the threat model.");
230+
println("Harmonic for structural / multi-vector / 'looks normal per dim'");
231+
println("attacks. IF for volumetric / magnitude-outlier attacks.");
232+
println("");
233+
println("=== Done ===");

omnimcode-core/src/ast.rs

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,13 @@ pub enum Statement {
7878
Import {
7979
module: String,
8080
alias: Option<String>,
81+
/// Selective imports: `from "path" import name1, name2;`.
82+
/// When `Some(names)`, only the listed names are imported into
83+
/// the global namespace (no alias prefix). When `None`, the
84+
/// whole module imports per `alias` (None = flat merge,
85+
/// Some = prefix all with `alias.`). Mutually exclusive with
86+
/// `alias` — parser enforces this.
87+
selected: Option<Vec<String>>,
8188
},
8289
/// `try { ... } catch err { ... }`. If the try block raises an
8390
/// error (via `error("msg")` or any builtin failure), execution

omnimcode-core/src/formatter.rs

Lines changed: 16 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -160,15 +160,23 @@ fn format_stmt(stmt: &Statement, level: usize, out: &mut String) {
160160
}
161161
Statement::Break => out.push_str("break;\n"),
162162
Statement::Continue => out.push_str("continue;\n"),
163-
Statement::Import { module, alias } => {
164-
out.push_str("import \"");
165-
out.push_str(module);
166-
out.push('"');
167-
if let Some(a) = alias {
168-
out.push_str(" as ");
169-
out.push_str(a);
163+
Statement::Import { module, alias, selected } => {
164+
if let Some(names) = selected {
165+
out.push_str("from \"");
166+
out.push_str(module);
167+
out.push_str("\" import ");
168+
out.push_str(&names.join(", "));
169+
out.push_str(";\n");
170+
} else {
171+
out.push_str("import \"");
172+
out.push_str(module);
173+
out.push('"');
174+
if let Some(a) = alias {
175+
out.push_str(" as ");
176+
out.push_str(a);
177+
}
178+
out.push_str(";\n");
170179
}
171-
out.push_str(";\n");
172180
}
173181
Statement::Try { body, err_var, handler } => {
174182
out.push_str("try {\n");

0 commit comments

Comments
 (0)