6767% =============================================================================
6868\section* {Abstract }
6969We measure whether grounding a language model in DreamLab's formal knowledge graph (the
70- \texttt {ontology-augment } skill, PRD-020) improves factual recall--- and whether that improvement
70+ \texttt {ontology-augment } skill, PRD-020) improves factual recall, and whether that improvement
7171lets a \emph {local } model outperform cloud frontier models running without grounding.
7272Using the knowledge graph as its own oracle, we auto-generate 16 questions with exact ground truth
7373and run a $ 2 \times 7 \times 3 $ design: \{ augmented, control\} $ \times $ 7 models $ \times $ 3 reps,
@@ -97,7 +97,7 @@ \section{Related Work}
9797\section {Method }
9898
9999\begin {samepage }
100- \textbf {KG-as-oracle. } Ground truth is generated from the live KG, not human judgement:
100+ \textbf {KG-as-oracle. } We generate ground truth from the live KG, not human judgement:
101101concept\, $ \to $ \, IRI via the discovery endpoint, then neighbours / subclass edges via read-only
102102SPARQL over the asserted graph (predicates \texttt {enables, requires, uses, supports, hasPart,
103103implements, dependsOn, subClassOf }). This makes scoring objective by construction.
@@ -258,7 +258,7 @@ \section{Cost, Privacy, and Inference Economics}
258258\textbf {DG 26B AUG (local) } & \textbf {0.505 } & \textbf {No } & \textbf {\$\, 0\rlap {*} } & \textbf {2--4\, s } \\
259259\bottomrule
260260\end {tabular }
261- \caption {Cost and privacy comparison. DiffusionGemma runs entirely on-premise--- no data leaves
261+ \caption {Cost and privacy comparison. DiffusionGemma runs entirely on-premise: no data leaves
262262the organisation, no per-query API cost, lowest latency. The ontology condense pass is a one-off
2632636-hour batch (7{,}445 classes), not per-query.\\
264264{\footnotesize *Amortised electricity/hardware only; no API billing.}}
@@ -278,11 +278,11 @@ \section{Cost, Privacy, and Inference Economics}
278278\section {Caveats on Lift Interpretation }
279279\label {sec:caveats }
280280
281- The headline finding--- that a local model with ontology grounding outperforms frontier models
282- without it--- requires careful interpretation. We detail the caveats below .
281+ The headline finding, that a local model with ontology grounding outperforms frontier models
282+ without it, requires careful interpretation.
283283
284284\begin {samepage }
285- \subsection {The comparison is grounded -local vs.\ bare-frontier, not like-for-like }
285+ \subsection {Grounded -local vs.\ bare-frontier: not like-for-like }
286286DiffusionGemma+ontology (0.505) beats Opus bare (0.350), but this is not an apples-to-apples
287287capability comparison. The local model receives structured KG triples as input context; the
288288frontier models receive only the question. If frontier models were also grounded, they would
@@ -295,15 +295,15 @@ \subsection{The comparison is grounded-local vs.\ bare-frontier, not like-for-li
295295\subsection {Frontier models benefit far more from grounding }
296296The ontology amplifies capability rather than substituting for it. Claude models gain
297297+0.38--0.46 F1 from grounding; DiffusionGemma gains only +0.108. This asymmetry suggests that
298- larger models are better at \emph {exploiting } structured context--- they extract more from the
298+ larger models are better at \emph {exploiting } structured context: they extract more from the
299299same triples. The ontology is a capability multiplier, and the multiplier scales with the base.
300300\end {samepage }
301301
302302\begin {samepage }
303303\subsection {DiffusionGemma's own augmentation lift is modest }
304304At +0.108 F1, the local model's lift is statistically meaningful but operationally small compared
305305to the frontier lifts. The primary driver of DiffusionGemma's competitive bare score (0.397) is
306- surprisingly strong parametric knowledge--- it already knows many of the domain concepts. The
306+ surprisingly strong parametric knowledge; it already knows many of the domain concepts. The
307307ontology's contribution to the local model is more about \emph {precision } (hallucination drops
308308from 0.498 to 0.299) than about \emph {recall }.
309309\end {samepage }
@@ -318,8 +318,8 @@ \subsection{Performance varies sharply by question type}
318318\end {samepage }
319319
320320\begin {samepage }
321- \subsection {The condense pass is a one-off cost, not free }
322- The ontology alias index was built by running DiffusionGemma over all 7{,}445 OWL classes
321+ \subsection {Condense pass: a one-off cost }
322+ We built the ontology alias index by running DiffusionGemma over all 7{,}445 OWL classes
323323($ \sim $ 6 hours on a single GPU). This is amortised over all future queries, but it means the
324324system requires upfront investment. Re-indexing is needed when the KG changes materially.
325325\end {samepage }
@@ -354,11 +354,11 @@ \section{Findings}
354354 validates the thesis that structured retrieval compensates for raw parameter
355355 count~\cite {lewis2020rag ,liu2024chatqa }.
356356
357- \item \textbf {Grounding amplifies capability--- frontier models gain the most. } Opus AUG (0.770)
357+ \item \textbf {Grounding amplifies capability: frontier models gain the most. } Opus AUG (0.770)
358358 is the highest score observed. The ontology is a force multiplier, not an
359359 equaliser~(\S \ref {sec:caveats }).
360360
361- \item \textbf {Subclass recall is the largest relative win. } From 0.099 (CTL) to 0.671 (AUG)--- a
361+ \item \textbf {Subclass recall is the largest relative win. } From 0.099 (CTL) to 0.671 (AUG), a
362362 6.8$ \times $ improvement. Models have almost no parametric knowledge of ontology hierarchies;
363363 the KG supplies it directly.
364364
0 commit comments