|
| 1 | +# Ontology-Augment Eval — in-flight data ledger (2026-06-14) |
| 2 | + |
| 3 | +## Objective |
| 4 | +Measure the value of the PRD-020/ADR-112 ontology binding on LLM recall/accuracy. |
| 5 | +A/B: ontology-**augmented** vs **control** (parametric-only), across **opus** & **sonnet**. |
| 6 | + |
| 7 | +## KG-as-oracle ground truth |
| 8 | +Gold derived from the live KG (the oracle), not human judgement: |
| 9 | +- concept→IRI via `POST /api/ontology-agent/discover` |
| 10 | +- neighbours/subclasses via SPARQL over `GRAPH <urn:ngm:graph:ontology:assert>` |
| 11 | + (predicates: enables 11450, relatedTo 10016, requires 9156, hasPart 9071, |
| 12 | + uses 7509, subClassOf 6056, supports 4350, implements, dependsOn, …) |
| 13 | + |
| 14 | +## Design |
| 15 | +- 16 questions: 8 neighbour-recall, 4 subclass, 4 existence. Mix of familiar tech |
| 16 | + (control can partly guess) + DreamLab-niche (control cannot). |
| 17 | +- Isolation: each cell = a fresh `general-purpose` subagent given ONLY the question. |
| 18 | + AUG arm grounds via the `ontology-augment` CLI (`ontology_ask`); control uses |
| 19 | + parametric knowledge only (no tools/lookup). 3 reps/cell → 16×2×2×3 = 192 runs. |
| 20 | +- Grader (deterministic): token-set greedy 1:1 match (singularised), recall@12 for |
| 21 | + neighbour/subclass, any-match recall for existence, precision, F1, hallucination. |
| 22 | + Smoke-validated: simulated AUG F1 0.83 vs ctl 0.50; non-saturated. |
| 23 | + |
| 24 | +## Qualitative pattern (observed across waves 1–3) |
| 25 | +AUG recovers the KG's *exact* localnames; control gives plausible-but-generic terms |
| 26 | +that miss the KG vocabulary (→ high hallucination): |
| 27 | +- ZKP: AUG {zk-snarks, zk-starks, bulletproofs, trusted-setup, zk-rollup} vs |
| 28 | + ctl {fiat-shamir-heuristic, prover, verifier, soundness, completeness}. |
| 29 | +- Gaussian splatting: AUG {gpu-rasterisation, adaptive-density-control, |
| 30 | + ssim-loss, photorealistic-telepresence} vs ctl {neural-radiance-field, |
| 31 | + multi-view-stereo, volumetric-rendering}. |
| 32 | +- multi-agent-system: AUG {sandboxed-code-execution, model-context-protocol, |
| 33 | + orchestration-protocol} vs ctl {agent-spawning, swarm-topology, message-passing}. |
| 34 | + |
| 35 | +## Cost (in-flight timing sample, n=41 → timing_sample.csv) |
| 36 | +AUG mean ≈ 36.8k tokens / 41.0s (5–10 tool calls); |
| 37 | +ctl mean ≈ 29.8k tokens / 14.8s (3–4 tool calls). |
| 38 | +Grounding costs ~24% more tokens and ~2.8× wall-clock — the cost side of the delta. |
| 39 | + |
| 40 | +## Known nuance (a Phase A target) |
| 41 | +`ontology_ask` expand returns OUTGOING triples only. Subclass-**children** are |
| 42 | +incoming (`?child subClassOf seed`), so the CLI may under-serve subclass questions |
| 43 | +unless the agent surfaces children via the discover seed list. Expect a smaller/ |
| 44 | +weaker AUG delta on the 4 subclass items → candidate skill improvement (document |
| 45 | +graph_query SPARQL for incoming edges). |
| 46 | + |
| 47 | +## Artefacts |
| 48 | +- gold.json (16 Q + gold), evals/evals.json |
| 49 | +- workspace/iteration-1/<name>/<model>/{with_skill,without_skill}/r<rep>/output.json |
| 50 | +- grade.py, batch_grade.py, timing_sample.csv |
0 commit comments