Skip to content

Commit f38188d

Browse files
apartsinclaude
andcommitted
Add 9 frontier sections: reasoning, memory, interpretability, agency, tool orchestration, reliability, observability, adaptive agents, human-AI collaboration
Phase 4c: Engineering and foundational frontier topics for Part 10 - 34.5: A Theory of Reasoning in LLMs - 34.6: Memory as a Computational Primitive - 34.7: Mechanistic Interpretability at Scale - 34.8: The Nature of Agency - 35.5: Reliability Engineering for Agents - 35.6: Observability, Testing, CI/CD for Agents - 35.7: Memory Architectures That Improve Execution - 35.8: Self-Improving and Adaptive Agents - 35.9: The Future of Human-AI Collaboration Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 26850cc commit f38188d

19 files changed

Lines changed: 3201 additions & 39 deletions

part-10-frontiers/index.html

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -69,8 +69,6 @@ <h3>What's Next?</h3>
6969
<p>You have reached the final part of the book. Return to the <a href="../index.html">Book Home</a> to revisit any topic, or explore the <a href="../toc.html">Appendices</a> for additional reference material.</p>
7070
</div>
7171

72-
73-
7472
<footer>
7573
<p class="footer-title">Building Conversational AI with LLMs and Agents, Fifth Edition</p>
7674
<p>&copy; 2026 Alexander Apartsin &amp; Yehudit Aperstein &middot; <a href="../toc.html">Contents</a></p>

part-10-frontiers/module-34-emerging-architectures/index.html

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -116,8 +116,6 @@ <h3>What's Next?</h3>
116116
<a class="next" href="section-34.1.html">Emergent Abilities: Real or Mirage?</a>
117117
</nav>
118118

119-
120-
121119
<footer>
122120
<p class="footer-title">Building Conversational AI with LLMs and Agents, Fifth Edition</p>
123121
<p>&copy; 2026 Alexander Apartsin &amp; Yehudit Aperstein &middot; <a href="../../toc.html">Contents</a></p>

part-10-frontiers/module-34-emerging-architectures/section-34.1.html

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -275,13 +275,11 @@ <h2>Exercises <span class="level-badge intermediate" title="INTERMEDIATE">INTERM
275275
</details>
276276
</div>
277277

278-
279278
<div class="whats-next">
280279
<h2>What Comes Next</h2>
281280
<p>In the next section, <a href="section-34.2.html" class="cross-ref">Section 34.2: Scaling Frontiers</a>, we examine the data wall, compute economics, and architectural alternatives that will shape the next generation of frontier models.</p>
282281
</div>
283282

284-
285283
<div class="callout tip">
286284
<div class="callout-title">Tip: Prototype with Established Models, Ship with New Ones</div>
287285
<p>Build your prototype and evaluation pipeline using a well-understood model. Once you have a working system with clear metrics, swap in newer architectures and compare. This way, the novelty of the architecture does not mask pipeline bugs.</p>

part-10-frontiers/module-34-emerging-architectures/section-34.2.html

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -325,7 +325,6 @@ <h2>What Comes Next</h2>
325325
<p>In the next section, <a href="section-34.3.html" class="cross-ref">Section 34.3: Alignment Research Frontiers</a>, we explore the open problems in aligning AI systems with human values, including scalable oversight, weak-to-strong generalization, and reward hacking.</p>
326326
</div>
327327

328-
329328
<div class="callout tip">
330329
<div class="callout-title">Tip: Watch for State Space Model Deployments</div>
331330
<p>Models like Mamba offer O(n) inference instead of O(<span class="math">$n^{2}$</span>) for transformers, making them promising for very long sequences. If your application processes documents over 100K tokens, benchmark an SSM variant alongside your transformer baseline.</p>

part-10-frontiers/module-34-emerging-architectures/section-34.3.html

Lines changed: 0 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,6 @@ <h2>1. The Scaling Problem with Self-Attention <span class="level-badge basic" t
7878
for choosing the right architecture for a given deployment scenario.
7979
</p>
8080

81-
8281
<h2>2. State Space Models: S4, Mamba, and Mamba-2 <span class="level-badge advanced" title="Advanced">Advanced</span></h2>
8382

8483
<p>
@@ -430,7 +429,6 @@ <h3>3.3 Griffin and RecurrentGemma</h3>
430429
outperforms pure SSM or pure attention models of the same size.
431430
</p>
432431

433-
434432
<div class="callout key-insight">
435433
<div class="callout-title">Key Insight</div>
436434
<p><strong>The attention versus efficiency tradeoff is not all-or-nothing.</strong> The research trajectory is moving away from "replace attention entirely" toward "use attention surgically." Pure SSM models sacrifice recall precision on tasks that require exact matching or retrieval from earlier in the context. Pure attention models pay quadratic cost for every token, even when most tokens do not need to attend to most other tokens. The emerging consensus is that hybrid architectures (attention for precision-critical layers, linear recurrence for everything else) may dominate both pure approaches. For practitioners, this means that the <a class="cross-ref" href="../../part-2-understanding-llms/module-09-inference-optimization/section-9.1.html">inference optimization techniques from Chapter 09</a> (KV-cache management, continuous batching) will remain relevant even as architectures evolve, because attention layers will likely persist in some form.</p>
@@ -553,7 +551,6 @@ <h3>4.2 Design Principles for Hybrid Architectures</h3>
553551
keeping memory costs bounded. The SSM layers handle global context propagation.
554552
</p>
555553

556-
557554
<h2>5. Efficiency Comparisons and Benchmarks <span class="level-badge intermediate" title="Intermediate">Intermediate</span></h2>
558555

559556
<p>
@@ -667,7 +664,6 @@ <h2>5. Efficiency Comparisons and Benchmarks <span class="level-badge intermedia
667664
<p><strong>The "Needle in a Haystack" test reveals the retrieval gap.</strong> In this test, a specific fact is inserted at a random position in a long document, and the model must retrieve it from a distant point. Transformers with full attention achieve near-perfect accuracy at all positions and context lengths. Pure SSM models show degradation for facts placed in the middle of very long sequences (the "lost in the middle" effect is amplified by compressed state). Hybrid models split the difference, achieving strong retrieval when the fact falls within an attention window and moderate retrieval otherwise. This test remains the clearest diagnostic for evaluating alternative architectures.</p>
668665
</div>
669666

670-
671667
<h2>6. When to Consider Non-Transformer Architectures <span class="level-badge intermediate" title="Intermediate">Intermediate</span></h2>
672668

673669
<p>
@@ -711,7 +707,6 @@ <h2>6. When to Consider Non-Transformer Architectures <span class="level-badge i
711707
longer sequence lengths, and the model availability gap is significant.
712708
</p>
713709

714-
715710
<h2>7. Neuromorphic and Event-Driven Approaches <span class="level-badge advanced" title="Advanced">Advanced</span></h2>
716711

717712
<p>
@@ -737,13 +732,11 @@ <h2>7. Neuromorphic and Event-Driven Approaches <span class="level-badge advance
737732
remain in the research stage and are not yet practical for production deployment.
738733
</p>
739734

740-
741735
<div class="callout research-frontier">
742736
<div class="callout-title">Research Frontier</div>
743737
<p><strong>The convergence of architectures.</strong> Mamba-2's state space duality theorem suggests that SSMs and attention may be endpoints on a spectrum rather than fundamentally different approaches. Recent work on "linear attention" (Katharopoulos et al., Yang et al.) and "gated linear attention" further blurs the boundary. The research community is moving toward a unified framework where the architectural choice is a hyperparameter (how much to compress the state) rather than a philosophical commitment. Watch for architectures that can dynamically adjust their compression ratio per layer and per input, spending full attention on tokens that need it and using compressed state for the rest.</p>
744738
</div>
745739

746-
747740
<div class="callout exercise">
748741
<div class="callout-title">Exercise</div>
749742
<span class="exercise-type conceptual">Conceptual</span>
@@ -789,7 +782,6 @@ <h2>What Comes Next</h2>
789782
</p>
790783
</div>
791784

792-
793785
<section class="bibliography">
794786
<h2>Bibliography and Further Reading</h2>
795787
<ul>
@@ -810,7 +802,6 @@ <h2>Bibliography and Further Reading</h2>
810802
<a class="next" href="section-34.4.html">World Models: Video Generation, Simulation, and Embodied Reasoning</a>
811803
</nav>
812804

813-
814805
<footer>
815806
<p class="footer-title">Building Conversational AI with LLMs and Agents, Fifth Edition</p>
816807
<p>&copy; 2026 Alexander Apartsin &amp; Yehudit Aperstein &middot; <a href="../../toc.html">Contents</a></p>

part-10-frontiers/module-34-emerging-architectures/section-34.4.html

Lines changed: 0 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,6 @@ <h3>Prerequisites</h3>
4747
<p><strong>Language models predict the next token. World models predict the next state of reality.</strong> A new class of models is learning to simulate physical environments: generating photorealistic video, modeling driving scenarios, and building interactive 3D worlds from single images. These world models bridge the gap between language understanding and physical reasoning, enabling agents to plan by imagining future outcomes rather than relying solely on textual inference. This section surveys the architectures, applications, and open problems in this rapidly evolving frontier.</p>
4848
</div>
4949

50-
5150
<h2>1. What Are World Models? <span class="level-badge basic" title="Basic">Basic</span></h2>
5251

5352
<p>
@@ -96,7 +95,6 @@ <h2>1. What Are World Models? <span class="level-badge basic" title="Basic">Basi
9695
<p>The model is trained to minimize the prediction error between $\hat{z}_{t+1}$ and $z_{t+1}$ (the actual next latent state), enabling the agent to "imagine" trajectories without environment interaction.</p>
9796
</div>
9897

99-
10098
<h2>2. Video Generation as World Simulation <span class="level-badge intermediate" title="Intermediate">Intermediate</span></h2>
10199

102100
<p>
@@ -194,7 +192,6 @@ <h3>2.4 Evaluating Physical Plausibility</h3>
194192
move closer or farther from the camera?
195193
</p>
196194

197-
198195
<h2>3. Autonomous Driving World Models <span class="level-badge intermediate" title="Intermediate">Intermediate</span></h2>
199196

200197
<p>
@@ -306,7 +303,6 @@ <h3>3.3 Sim-to-Real Transfer</h3>
306303

307304
<p class="caption"><strong>Table 34.4.1:</strong> Comparison of major autonomous driving world models. Parameter counts are approximate and reflect the world model component only.</p>
308305

309-
310306
<h2>4. Interactive World Models <span class="level-badge advanced" title="Advanced">Advanced</span></h2>
311307

312308
<p>
@@ -358,7 +354,6 @@ <h3>4.2 Game Engine Replacement</h3>
358354
<p><strong>From image to playable world.</strong> Consider a game designer who has a concept painting of a medieval village. With a traditional engine, turning this into a playable environment requires weeks of 3D modeling, texturing, physics setup, and scripting. With an interactive world model like Genie 2, the designer uploads the painting and immediately gets a navigable environment: they can walk through the village streets, enter buildings, and interact with objects. The environment is not geometrically precise (it is hallucinated from the image), but it serves as a rapid prototype that can be refined. This workflow compresses the concepting phase from weeks to minutes.</p>
359355
</div>
360356

361-
362357
<h2>5. Agent Planning with World Models <span class="level-badge advanced" title="Advanced">Advanced</span></h2>
363358

364359
<p>
@@ -442,7 +437,6 @@ <h3>5.3 LLM-Scale World Models for Agent Planning</h3>
442437
<p><strong>World models as the missing link for embodied AI.</strong> Current LLM-based agents (covered in <a class="cross-ref" href="../../part-6-agentic-ai/module-22-ai-agents/index.html">Chapter 22</a>) operate primarily in digital environments: browsing the web, writing code, managing files. Extending these agents to physical environments (robots, autonomous vehicles, manufacturing) requires grounding their reasoning in physical reality. World models provide this grounding by giving the agent an internal simulator it can query. The convergence of LLM reasoning and world model simulation is widely considered one of the most important research directions in AI as of 2026.</p>
443438
</div>
444439

445-
446440
<h2>6. Limitations and Open Problems <span class="level-badge intermediate" title="Intermediate">Intermediate</span></h2>
447441

448442
<p>
@@ -546,7 +540,6 @@ <h3>6.5 When World Models Fail</h3>
546540
<p>A world model that is 99% accurate may still be catastrophically wrong in the 1% of cases that matter most. In safety-critical applications, world models should augment (not replace) traditional physics simulators and real-world testing. The convenience of unlimited synthetic data does not eliminate the need for rigorous validation against ground truth.</p>
547541
</div>
548542

549-
550543
<h2>7. Practical Lab: Video Prediction with a Diffusion Transformer <span class="level-badge advanced" title="Advanced">Advanced</span></h2>
551544

552545
<p>
@@ -571,7 +564,6 @@ <h2>7. Practical Lab: Video Prediction with a Diffusion Transformer <span class=
571564
from einops import rearrange, repeat
572565
import math
573566

574-
575567
class SinusoidalPosEmb(nn.Module):
576568
"""Sinusoidal positional embedding for diffusion timesteps."""
577569
def __init__(self, dim):
@@ -586,7 +578,6 @@ <h2>7. Practical Lab: Video Prediction with a Diffusion Transformer <span class=
586578
emb = t[:, None].float() * emb[None, :]
587579
return torch.cat([emb.sin(), emb.cos()], dim=-1)
588580

589-
590581
class PatchEmbed(nn.Module):
591582
"""Convert image frames into patch tokens."""
592583
def __init__(self, img_size=64, patch_size=8, in_channels=3, embed_dim=256):
@@ -602,7 +593,6 @@ <h2>7. Practical Lab: Video Prediction with a Diffusion Transformer <span class=
602593
x = self.proj(x)
603594
return rearrange(x, "b c h w -> b (h w) c")
604595

605-
606596
class WorldModelBlock(nn.Module):
607597
"""Transformer block with cross-attention for action conditioning."""
608598
def __init__(self, dim=256, heads=8, mlp_ratio=4):
@@ -629,7 +619,6 @@ <h2>7. Practical Lab: Video Prediction with a Diffusion Transformer <span class=
629619
x = x + self.mlp(self.norm3(x))
630620
return x
631621

632-
633622
class SimpleWorldModel(nn.Module):
634623
"""
635624
A minimal diffusion-transformer world model.
@@ -757,7 +746,6 @@ <h2>7. Practical Lab: Video Prediction with a Diffusion Transformer <span class=
757746
from torch.utils.data import DataLoader, Dataset
758747
import numpy as np
759748

760-
761749
class SyntheticPhysicsDataset(Dataset):
762750
"""
763751
Generate simple 2D physics scenarios: a ball bouncing in a box.
@@ -811,7 +799,6 @@ <h2>7. Practical Lab: Video Prediction with a Diffusion Transformer <span class=
811799
torch.from_numpy(action),
812800
)
813801

814-
815802
def train_world_model(epochs=50, batch_size=32, lr=1e-4):
816803
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
817804

@@ -847,7 +834,6 @@ <h2>7. Practical Lab: Video Prediction with a Diffusion Transformer <span class=
847834

848835
return model
849836

850-
851837
if __name__ == "__main__":
852838
model = train_world_model()
853839
torch.save(model.state_dict(), "world_model_bouncing_ball.pt")
@@ -859,7 +845,6 @@ <h2>7. Practical Lab: Video Prediction with a Diffusion Transformer <span class=
859845
<p><strong>The bouncing ball is a microcosm of world modeling.</strong> Even this simple scenario exercises the core challenges: the model must learn that the ball persists between frames (object permanence), that it moves in a consistent direction (momentum), that it reverses direction at boundaries (collision physics), and that the applied action influences its trajectory (action conditioning). Scaling these same principles to photorealistic video of complex environments is an engineering challenge, not a conceptual one.</p>
860846
</div>
861847

862-
863848
<div class="callout exercise">
864849
<div class="callout-title">Exercise</div>
865850
<span class="exercise-type conceptual">Conceptual</span>
@@ -895,7 +880,6 @@ <h2>7. Practical Lab: Video Prediction with a Diffusion Transformer <span class=
895880
<p>Explain the "drift" problem in autoregressive world models: why do prediction errors compound over time? Describe three approaches to mitigating drift (re-anchoring, hierarchical generation, explicit memory) and analyze the tradeoffs of each. Calculate: if a model introduces an average positional error of 0.5 pixels per frame for a moving object, how far off will the predicted position be after 100 frames? After 1,000 frames? What does this imply about the maximum useful planning horizon?</p>
896881
</div>
897882

898-
899883
<div class="whats-next">
900884
<h2>What Comes Next</h2>
901885
<p>
@@ -906,7 +890,6 @@ <h2>What Comes Next</h2>
906890
</p>
907891
</div>
908892

909-
910893
<section class="bibliography">
911894
<h2>Bibliography and Further Reading</h2>
912895
<ul>
@@ -930,7 +913,6 @@ <h2>Bibliography and Further Reading</h2>
930913
<a class="next" href="../module-35-ai-society/index.html">Chapter 35: AI, Society &amp; Open Problems</a>
931914
</nav>
932915

933-
934916
<footer>
935917
<p class="footer-title">Building Conversational AI with LLMs and Agents, Fifth Edition</p>
936918
<p>&copy; 2026 Alexander Apartsin &amp; Yehudit Aperstein &middot; <a href="../../toc.html">Contents</a></p>

0 commit comments

Comments
 (0)