DreamLab-AI
diff --git a/‎docs/eval/deck.pdf‎
-367 Bytes b/‎docs/eval/deck.pdf‎
-367 Bytes
diff --git a/‎docs/eval/deck.tex‎
Lines changed: 7 additions & 7 deletions b/‎docs/eval/deck.tex‎
Lines changed: 7 additions & 7 deletions
diff --git a/‎docs/eval/related-work.tex‎
Lines changed: 3 additions & 3 deletions b/‎docs/eval/related-work.tex‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎docs/eval/report.pdf‎
-165 Bytes b/‎docs/eval/report.pdf‎
-165 Bytes
diff --git a/‎docs/eval/report.tex‎
Lines changed: 12 additions & 12 deletions b/‎docs/eval/report.tex‎
Lines changed: 12 additions & 12 deletions
@@ -51,7 +51,7 @@
 \end{frame}
 
 % SLIDE 2: THE QUESTION
-\begin{frame}{The Question}
+\begin{frame}{One Question}
 \Large
 \begin{center}
 Can a \textcolor{softgreen}{\textbf{local model}} running on one GPU,\\[6pt]
@@ -85,7 +85,7 @@
 \end{frame}
 
 % SLIDE 4: THE NUMBERS
-\begin{frame}{The Numbers}
+\begin{frame}{Full Results}
 \centering
 \small
 \begin{tabular}{lcccc}
@@ -191,10 +191,10 @@
       question. If frontier models were also grounded, they score much higher (Opus AUG: 0.770).
 
 \item \textbf{Frontier models gain more from grounding.} Claude +0.38--0.46 vs.\ DiffusionGemma
-      +0.108. The ontology is a \emph{capability multiplier}---it scales with the base model.
+      +0.108. The ontology is a \emph{capability multiplier}; it scales with the base model.
 
 \item \textbf{DG's own lift is modest.} +0.108 F1. Its strength is a surprisingly competitive
-      bare score (0.397), not extraordinary use of the ontology.
+      bare score (0.397), not exceptional ontology use.
 
 \item \textbf{Performance varies by type.} DG excels on existence (0.875) but struggles on
       neighbour recall (0.323) due to the 256-token block decode limit.
@@ -207,15 +207,15 @@
 \begin{frame}{Key Takeaways}
 \Large
 \begin{enumerate}
-\item \textcolor{teal0}{\textbf{Ontology grounding works}} --- universally, across 7 models\\
+\item \textcolor{teal0}{\textbf{Ontology grounding works.}} Universally, across 7 models.\\
       {\normalsize +0.265 F1, hallucination halved}
 
 \vspace{4mm}
-\item \textcolor{softgreen}{\textbf{Local + ontology $>$ frontier bare}} --- for domain recall\\
+\item \textcolor{softgreen}{\textbf{Local + ontology $>$ frontier bare.}} For domain recall.\\
       {\normalsize DiffusionGemma 0.505 beats Opus bare 0.350}
 
 \vspace{4mm}
-\item \textcolor{burnt}{\textbf{Data sovereignty is achievable}} --- without sacrificing accuracy\\
+\item \textcolor{burnt}{\textbf{Data sovereignty without sacrificing accuracy.}}\\
       {\normalsize Zero API cost, zero data exfiltration, GDPR-ready}
 \end{enumerate}
 
 
@@ -4,15 +4,15 @@
 knowledge into sequence models on knowledge-intensive tasks~\cite{lewis2020rag}, and the
 approach has since matured from flat vector retrieval into structured, graph-based variants.
 Microsoft's GraphRAG constructs an entity knowledge graph over a corpus and reports
-substantial gains over conventional vector RAG in answer comprehensiveness and diversity for
+substantial gains over conventional vector RAG in answer breadth and diversity for
 global ``sensemaking'' queries spanning million-token datasets~\cite{edge2024graphrag}, with
 subsequent work formalising graph-augmented retrieval~\cite{guo2024lightrag,hu2024grag} and
 automated evaluation harnesses~\cite{microsoft2025benchmarkqed}. A complementary literature
 shows that retrieval narrows or eliminates the gap to frontier cloud models: ChatQA, built on
 open weights, reports surpassing GPT-4 on conversational QA and RAG benchmarks~\cite{liu2024chatqa},
 while open-domain QA studies confirm retrieval as the dominant lever for factual
 accuracy~\cite{li2024improvingrag}. This matters because small open-weight models have
-themselves closed much of the capability gap --- the Phi-3~\cite{abdin2024phi3},
+themselves closed much of the capability gap. The Phi-3~\cite{abdin2024phi3},
 Qwen2~\cite{yang2024qwen2}, Gemma~3~\cite{gemma2025gemma3} and Llama~3~\cite{dubey2024llama3}
 families demonstrate strong reasoning at parameter counts deployable on local hardware.
 Independently, structured knowledge has been shown to suppress hallucination and improve
@@ -21,7 +21,7 @@
 dedicated factuality benchmarks such as FACTS Grounding~\cite{jacovi2025facts}. A parallel
 architectural shift further strengthens the local-inference case: masked diffusion language
 models such as LLaDA now rival LLaMA3-8B in in-context learning~\cite{nie2025llada}, and
-diffusion decoders like Mercury achieve over 1{,}000 tokens/second on a single H100 --- up to
+diffusion decoders like Mercury achieve over 1{,}000 tokens/second on a single H100, up to
 10$\times$ faster than speed-optimised autoregressive models at comparable
 quality~\cite{labs2025mercury}, with commercial systems including Gemini
 Diffusion~\cite{deepmind2025geminidiffusion}. Together with evidence on inference unit
 
@@ -67,7 +67,7 @@
 % =============================================================================
 \section*{Abstract}
 We measure whether grounding a language model in DreamLab's formal knowledge graph (the
-\texttt{ontology-augment} skill, PRD-020) improves factual recall---and whether that improvement
+\texttt{ontology-augment} skill, PRD-020) improves factual recall, and whether that improvement
 lets a \emph{local} model outperform cloud frontier models running without grounding.
 Using the knowledge graph as its own oracle, we auto-generate 16 questions with exact ground truth
 and run a $2 \times 7 \times 3$ design: \{augmented, control\} $\times$ 7 models $\times$ 3 reps,
@@ -97,7 +97,7 @@ \section{Related Work}
 \section{Method}
 
 \begin{samepage}
-\textbf{KG-as-oracle.} Ground truth is generated from the live KG, not human judgement:
+\textbf{KG-as-oracle.} We generate ground truth from the live KG, not human judgement:
 concept\,$\to$\,IRI via the discovery endpoint, then neighbours / subclass edges via read-only
 SPARQL over the asserted graph (predicates \texttt{enables, requires, uses, supports, hasPart,
 implements, dependsOn, subClassOf}). This makes scoring objective by construction.
@@ -258,7 +258,7 @@ \section{Cost, Privacy, and Inference Economics}
 \textbf{DG 26B AUG (local)} & \textbf{0.505} & \textbf{No} & \textbf{\$\,0\rlap{*}} & \textbf{2--4\,s} \\
 \bottomrule
 \end{tabular}
-\caption{Cost and privacy comparison. DiffusionGemma runs entirely on-premise---no data leaves
+\caption{Cost and privacy comparison. DiffusionGemma runs entirely on-premise: no data leaves
 the organisation, no per-query API cost, lowest latency. The ontology condense pass is a one-off
 6-hour batch (7{,}445 classes), not per-query.\\
 {\footnotesize *Amortised electricity/hardware only; no API billing.}}
@@ -278,11 +278,11 @@ \section{Cost, Privacy, and Inference Economics}
 \section{Caveats on Lift Interpretation}
 \label{sec:caveats}
 
-The headline finding---that a local model with ontology grounding outperforms frontier models
-without it---requires careful interpretation. We detail the caveats below.
+The headline finding, that a local model with ontology grounding outperforms frontier models
+without it, requires careful interpretation.
 
 \begin{samepage}
-\subsection{The comparison is grounded-local vs.\ bare-frontier, not like-for-like}
+\subsection{Grounded-local vs.\ bare-frontier: not like-for-like}
 DiffusionGemma+ontology (0.505) beats Opus bare (0.350), but this is not an apples-to-apples
 capability comparison. The local model receives structured KG triples as input context; the
 frontier models receive only the question. If frontier models were also grounded, they would
@@ -295,15 +295,15 @@ \subsection{The comparison is grounded-local vs.\ bare-frontier, not like-for-li
 \subsection{Frontier models benefit far more from grounding}
 The ontology amplifies capability rather than substituting for it. Claude models gain
 +0.38--0.46 F1 from grounding; DiffusionGemma gains only +0.108. This asymmetry suggests that
-larger models are better at \emph{exploiting} structured context---they extract more from the
+larger models are better at \emph{exploiting} structured context: they extract more from the
 same triples. The ontology is a capability multiplier, and the multiplier scales with the base.
 \end{samepage}
 
 \begin{samepage}
 \subsection{DiffusionGemma's own augmentation lift is modest}
 At +0.108 F1, the local model's lift is statistically meaningful but operationally small compared
 to the frontier lifts. The primary driver of DiffusionGemma's competitive bare score (0.397) is
-surprisingly strong parametric knowledge---it already knows many of the domain concepts. The
+surprisingly strong parametric knowledge; it already knows many of the domain concepts. The
 ontology's contribution to the local model is more about \emph{precision} (hallucination drops
 from 0.498 to 0.299) than about \emph{recall}.
 \end{samepage}
@@ -318,8 +318,8 @@ \subsection{Performance varies sharply by question type}
 \end{samepage}
 
 \begin{samepage}
-\subsection{The condense pass is a one-off cost, not free}
-The ontology alias index was built by running DiffusionGemma over all 7{,}445 OWL classes
+\subsection{Condense pass: a one-off cost}
+We built the ontology alias index by running DiffusionGemma over all 7{,}445 OWL classes
 ($\sim$6 hours on a single GPU). This is amortised over all future queries, but it means the
 system requires upfront investment. Re-indexing is needed when the KG changes materially.
 \end{samepage}
@@ -354,11 +354,11 @@ \section{Findings}
       validates the thesis that structured retrieval compensates for raw parameter
       count~\cite{lewis2020rag,liu2024chatqa}.
 
-\item \textbf{Grounding amplifies capability---frontier models gain the most.} Opus AUG (0.770)
+\item \textbf{Grounding amplifies capability: frontier models gain the most.} Opus AUG (0.770)
       is the highest score observed. The ontology is a force multiplier, not an
       equaliser~(\S\ref{sec:caveats}).
 
-\item \textbf{Subclass recall is the largest relative win.} From 0.099 (CTL) to 0.671 (AUG)---a
+\item \textbf{Subclass recall is the largest relative win.} From 0.099 (CTL) to 0.671 (AUG), a
       6.8$\times$ improvement. Models have almost no parametric knowledge of ontology hierarchies;
       the KG supplies it directly.