Skip to content

Commit 2ba80dc

Browse files
committed
style(eval): prose sanitiser pass on report, deck, and related-work
B1: all em-dashes replaced (colons, semicolons, commas, periods). B2: "The Question" → "One Question", "The Numbers" → "Full Results", "The comparison..." → "Grounded-local vs...", "The condense..." → "Condense pass:...". B4: "comprehensiveness" → "answer breadth", "extraordinary" → "exceptional". B6: throat-clearing preview sentence deleted from caveats intro. B11: passive → active in KG-as-oracle and condense-pass subsections. Co-Authored-By: jjohare <github@thedreamlab.uk>
1 parent bd5bab5 commit 2ba80dc

5 files changed

Lines changed: 22 additions & 22 deletions

File tree

docs/eval/deck.pdf

-367 Bytes
Binary file not shown.

docs/eval/deck.tex

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@
5151
\end{frame}
5252

5353
% SLIDE 2: THE QUESTION
54-
\begin{frame}{The Question}
54+
\begin{frame}{One Question}
5555
\Large
5656
\begin{center}
5757
Can a \textcolor{softgreen}{\textbf{local model}} running on one GPU,\\[6pt]
@@ -85,7 +85,7 @@
8585
\end{frame}
8686

8787
% SLIDE 4: THE NUMBERS
88-
\begin{frame}{The Numbers}
88+
\begin{frame}{Full Results}
8989
\centering
9090
\small
9191
\begin{tabular}{lcccc}
@@ -191,10 +191,10 @@
191191
question. If frontier models were also grounded, they score much higher (Opus AUG: 0.770).
192192

193193
\item \textbf{Frontier models gain more from grounding.} Claude +0.38--0.46 vs.\ DiffusionGemma
194-
+0.108. The ontology is a \emph{capability multiplier}---it scales with the base model.
194+
+0.108. The ontology is a \emph{capability multiplier}; it scales with the base model.
195195

196196
\item \textbf{DG's own lift is modest.} +0.108 F1. Its strength is a surprisingly competitive
197-
bare score (0.397), not extraordinary use of the ontology.
197+
bare score (0.397), not exceptional ontology use.
198198

199199
\item \textbf{Performance varies by type.} DG excels on existence (0.875) but struggles on
200200
neighbour recall (0.323) due to the 256-token block decode limit.
@@ -207,15 +207,15 @@
207207
\begin{frame}{Key Takeaways}
208208
\Large
209209
\begin{enumerate}
210-
\item \textcolor{teal0}{\textbf{Ontology grounding works}} --- universally, across 7 models\\
210+
\item \textcolor{teal0}{\textbf{Ontology grounding works.}} Universally, across 7 models.\\
211211
{\normalsize +0.265 F1, hallucination halved}
212212

213213
\vspace{4mm}
214-
\item \textcolor{softgreen}{\textbf{Local + ontology $>$ frontier bare}} --- for domain recall\\
214+
\item \textcolor{softgreen}{\textbf{Local + ontology $>$ frontier bare.}} For domain recall.\\
215215
{\normalsize DiffusionGemma 0.505 beats Opus bare 0.350}
216216

217217
\vspace{4mm}
218-
\item \textcolor{burnt}{\textbf{Data sovereignty is achievable}} --- without sacrificing accuracy\\
218+
\item \textcolor{burnt}{\textbf{Data sovereignty without sacrificing accuracy.}}\\
219219
{\normalsize Zero API cost, zero data exfiltration, GDPR-ready}
220220
\end{enumerate}
221221

docs/eval/related-work.tex

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,15 @@
44
knowledge into sequence models on knowledge-intensive tasks~\cite{lewis2020rag}, and the
55
approach has since matured from flat vector retrieval into structured, graph-based variants.
66
Microsoft's GraphRAG constructs an entity knowledge graph over a corpus and reports
7-
substantial gains over conventional vector RAG in answer comprehensiveness and diversity for
7+
substantial gains over conventional vector RAG in answer breadth and diversity for
88
global ``sensemaking'' queries spanning million-token datasets~\cite{edge2024graphrag}, with
99
subsequent work formalising graph-augmented retrieval~\cite{guo2024lightrag,hu2024grag} and
1010
automated evaluation harnesses~\cite{microsoft2025benchmarkqed}. A complementary literature
1111
shows that retrieval narrows or eliminates the gap to frontier cloud models: ChatQA, built on
1212
open weights, reports surpassing GPT-4 on conversational QA and RAG benchmarks~\cite{liu2024chatqa},
1313
while open-domain QA studies confirm retrieval as the dominant lever for factual
1414
accuracy~\cite{li2024improvingrag}. This matters because small open-weight models have
15-
themselves closed much of the capability gap --- the Phi-3~\cite{abdin2024phi3},
15+
themselves closed much of the capability gap. The Phi-3~\cite{abdin2024phi3},
1616
Qwen2~\cite{yang2024qwen2}, Gemma~3~\cite{gemma2025gemma3} and Llama~3~\cite{dubey2024llama3}
1717
families demonstrate strong reasoning at parameter counts deployable on local hardware.
1818
Independently, structured knowledge has been shown to suppress hallucination and improve
@@ -21,7 +21,7 @@
2121
dedicated factuality benchmarks such as FACTS Grounding~\cite{jacovi2025facts}. A parallel
2222
architectural shift further strengthens the local-inference case: masked diffusion language
2323
models such as LLaDA now rival LLaMA3-8B in in-context learning~\cite{nie2025llada}, and
24-
diffusion decoders like Mercury achieve over 1{,}000 tokens/second on a single H100 --- up to
24+
diffusion decoders like Mercury achieve over 1{,}000 tokens/second on a single H100, up to
2525
10$\times$ faster than speed-optimised autoregressive models at comparable
2626
quality~\cite{labs2025mercury}, with commercial systems including Gemini
2727
Diffusion~\cite{deepmind2025geminidiffusion}. Together with evidence on inference unit

docs/eval/report.pdf

-165 Bytes
Binary file not shown.

docs/eval/report.tex

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@
6767
% =============================================================================
6868
\section*{Abstract}
6969
We measure whether grounding a language model in DreamLab's formal knowledge graph (the
70-
\texttt{ontology-augment} skill, PRD-020) improves factual recall---and whether that improvement
70+
\texttt{ontology-augment} skill, PRD-020) improves factual recall, and whether that improvement
7171
lets a \emph{local} model outperform cloud frontier models running without grounding.
7272
Using the knowledge graph as its own oracle, we auto-generate 16 questions with exact ground truth
7373
and run a $2 \times 7 \times 3$ design: \{augmented, control\} $\times$ 7 models $\times$ 3 reps,
@@ -97,7 +97,7 @@ \section{Related Work}
9797
\section{Method}
9898

9999
\begin{samepage}
100-
\textbf{KG-as-oracle.} Ground truth is generated from the live KG, not human judgement:
100+
\textbf{KG-as-oracle.} We generate ground truth from the live KG, not human judgement:
101101
concept\,$\to$\,IRI via the discovery endpoint, then neighbours / subclass edges via read-only
102102
SPARQL over the asserted graph (predicates \texttt{enables, requires, uses, supports, hasPart,
103103
implements, dependsOn, subClassOf}). This makes scoring objective by construction.
@@ -258,7 +258,7 @@ \section{Cost, Privacy, and Inference Economics}
258258
\textbf{DG 26B AUG (local)} & \textbf{0.505} & \textbf{No} & \textbf{\$\,0\rlap{*}} & \textbf{2--4\,s} \\
259259
\bottomrule
260260
\end{tabular}
261-
\caption{Cost and privacy comparison. DiffusionGemma runs entirely on-premise---no data leaves
261+
\caption{Cost and privacy comparison. DiffusionGemma runs entirely on-premise: no data leaves
262262
the organisation, no per-query API cost, lowest latency. The ontology condense pass is a one-off
263263
6-hour batch (7{,}445 classes), not per-query.\\
264264
{\footnotesize *Amortised electricity/hardware only; no API billing.}}
@@ -278,11 +278,11 @@ \section{Cost, Privacy, and Inference Economics}
278278
\section{Caveats on Lift Interpretation}
279279
\label{sec:caveats}
280280

281-
The headline finding---that a local model with ontology grounding outperforms frontier models
282-
without it---requires careful interpretation. We detail the caveats below.
281+
The headline finding, that a local model with ontology grounding outperforms frontier models
282+
without it, requires careful interpretation.
283283

284284
\begin{samepage}
285-
\subsection{The comparison is grounded-local vs.\ bare-frontier, not like-for-like}
285+
\subsection{Grounded-local vs.\ bare-frontier: not like-for-like}
286286
DiffusionGemma+ontology (0.505) beats Opus bare (0.350), but this is not an apples-to-apples
287287
capability comparison. The local model receives structured KG triples as input context; the
288288
frontier models receive only the question. If frontier models were also grounded, they would
@@ -295,15 +295,15 @@ \subsection{The comparison is grounded-local vs.\ bare-frontier, not like-for-li
295295
\subsection{Frontier models benefit far more from grounding}
296296
The ontology amplifies capability rather than substituting for it. Claude models gain
297297
+0.38--0.46 F1 from grounding; DiffusionGemma gains only +0.108. This asymmetry suggests that
298-
larger models are better at \emph{exploiting} structured context---they extract more from the
298+
larger models are better at \emph{exploiting} structured context: they extract more from the
299299
same triples. The ontology is a capability multiplier, and the multiplier scales with the base.
300300
\end{samepage}
301301

302302
\begin{samepage}
303303
\subsection{DiffusionGemma's own augmentation lift is modest}
304304
At +0.108 F1, the local model's lift is statistically meaningful but operationally small compared
305305
to the frontier lifts. The primary driver of DiffusionGemma's competitive bare score (0.397) is
306-
surprisingly strong parametric knowledge---it already knows many of the domain concepts. The
306+
surprisingly strong parametric knowledge; it already knows many of the domain concepts. The
307307
ontology's contribution to the local model is more about \emph{precision} (hallucination drops
308308
from 0.498 to 0.299) than about \emph{recall}.
309309
\end{samepage}
@@ -318,8 +318,8 @@ \subsection{Performance varies sharply by question type}
318318
\end{samepage}
319319

320320
\begin{samepage}
321-
\subsection{The condense pass is a one-off cost, not free}
322-
The ontology alias index was built by running DiffusionGemma over all 7{,}445 OWL classes
321+
\subsection{Condense pass: a one-off cost}
322+
We built the ontology alias index by running DiffusionGemma over all 7{,}445 OWL classes
323323
($\sim$6 hours on a single GPU). This is amortised over all future queries, but it means the
324324
system requires upfront investment. Re-indexing is needed when the KG changes materially.
325325
\end{samepage}
@@ -354,11 +354,11 @@ \section{Findings}
354354
validates the thesis that structured retrieval compensates for raw parameter
355355
count~\cite{lewis2020rag,liu2024chatqa}.
356356

357-
\item \textbf{Grounding amplifies capability---frontier models gain the most.} Opus AUG (0.770)
357+
\item \textbf{Grounding amplifies capability: frontier models gain the most.} Opus AUG (0.770)
358358
is the highest score observed. The ontology is a force multiplier, not an
359359
equaliser~(\S\ref{sec:caveats}).
360360

361-
\item \textbf{Subclass recall is the largest relative win.} From 0.099 (CTL) to 0.671 (AUG)---a
361+
\item \textbf{Subclass recall is the largest relative win.} From 0.099 (CTL) to 0.671 (AUG), a
362362
6.8$\times$ improvement. Models have almost no parametric knowledge of ontology hierarchies;
363363
the KG supplies it directly.
364364

0 commit comments

Comments
 (0)