Skip to content

Commit 19292ee

Browse files
committed
docs(eval): ontology-augment KG-as-oracle A/B evaluation (PRD-020)
192-run isolated evaluation measuring the value of the ontology binding on LLM recall/accuracy. KG-as-its-own-oracle ground truth; augmented vs control x {Opus,Sonnet} x 3 reps. Headline: F1 0.352 -> 0.714 (+0.362), hallucination 0.655 -> 0.309; model-agnostic; gains largest on project-specific concepts. - ontology-augment-evaluation.pdf (5pp, PGFPlots charts) + report.tex - summary.json, results.csv, gold.json, timing_sample.csv, REPORT_DATA.md - identifies subclass-children retrieval gap (outgoing-only expand) as the next skill-tuning target. Co-Authored-By: jjohare <github@thedreamlab.uk>
1 parent 1ce89c0 commit 19292ee

7 files changed

Lines changed: 1019 additions & 0 deletions

File tree

docs/eval/REPORT_DATA.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
# Ontology-Augment Eval — in-flight data ledger (2026-06-14)
2+
3+
## Objective
4+
Measure the value of the PRD-020/ADR-112 ontology binding on LLM recall/accuracy.
5+
A/B: ontology-**augmented** vs **control** (parametric-only), across **opus** & **sonnet**.
6+
7+
## KG-as-oracle ground truth
8+
Gold derived from the live KG (the oracle), not human judgement:
9+
- concept→IRI via `POST /api/ontology-agent/discover`
10+
- neighbours/subclasses via SPARQL over `GRAPH <urn:ngm:graph:ontology:assert>`
11+
(predicates: enables 11450, relatedTo 10016, requires 9156, hasPart 9071,
12+
uses 7509, subClassOf 6056, supports 4350, implements, dependsOn, …)
13+
14+
## Design
15+
- 16 questions: 8 neighbour-recall, 4 subclass, 4 existence. Mix of familiar tech
16+
(control can partly guess) + DreamLab-niche (control cannot).
17+
- Isolation: each cell = a fresh `general-purpose` subagent given ONLY the question.
18+
AUG arm grounds via the `ontology-augment` CLI (`ontology_ask`); control uses
19+
parametric knowledge only (no tools/lookup). 3 reps/cell → 16×2×2×3 = 192 runs.
20+
- Grader (deterministic): token-set greedy 1:1 match (singularised), recall@12 for
21+
neighbour/subclass, any-match recall for existence, precision, F1, hallucination.
22+
Smoke-validated: simulated AUG F1 0.83 vs ctl 0.50; non-saturated.
23+
24+
## Qualitative pattern (observed across waves 1–3)
25+
AUG recovers the KG's *exact* localnames; control gives plausible-but-generic terms
26+
that miss the KG vocabulary (→ high hallucination):
27+
- ZKP: AUG {zk-snarks, zk-starks, bulletproofs, trusted-setup, zk-rollup} vs
28+
ctl {fiat-shamir-heuristic, prover, verifier, soundness, completeness}.
29+
- Gaussian splatting: AUG {gpu-rasterisation, adaptive-density-control,
30+
ssim-loss, photorealistic-telepresence} vs ctl {neural-radiance-field,
31+
multi-view-stereo, volumetric-rendering}.
32+
- multi-agent-system: AUG {sandboxed-code-execution, model-context-protocol,
33+
orchestration-protocol} vs ctl {agent-spawning, swarm-topology, message-passing}.
34+
35+
## Cost (in-flight timing sample, n=41 → timing_sample.csv)
36+
AUG mean ≈ 36.8k tokens / 41.0s (5–10 tool calls);
37+
ctl mean ≈ 29.8k tokens / 14.8s (3–4 tool calls).
38+
Grounding costs ~24% more tokens and ~2.8× wall-clock — the cost side of the delta.
39+
40+
## Known nuance (a Phase A target)
41+
`ontology_ask` expand returns OUTGOING triples only. Subclass-**children** are
42+
incoming (`?child subClassOf seed`), so the CLI may under-serve subclass questions
43+
unless the agent surfaces children via the discover seed list. Expect a smaller/
44+
weaker AUG delta on the 4 subclass items → candidate skill improvement (document
45+
graph_query SPARQL for incoming edges).
46+
47+
## Artefacts
48+
- gold.json (16 Q + gold), evals/evals.json
49+
- workspace/iteration-1/<name>/<model>/{with_skill,without_skill}/r<rep>/output.json
50+
- grade.py, batch_grade.py, timing_sample.csv

0 commit comments

Comments
 (0)