ArmDeveloperEcosystem
diff --git a/‎content/learning-paths/laptops-and-desktops/dgx_persistent_agent/1_introduce.md‎
Lines changed: 31 additions & 90 deletions b/‎content/learning-paths/laptops-and-desktops/dgx_persistent_agent/1_introduce.md‎
Lines changed: 31 additions & 90 deletions
@@ -1,20 +1,20 @@
 ---
-title: Understand Persistent AI Runtime Architecture
+title: Understand persistent AI runtime architecture
 weight: 2
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
 
-## Understand Persistent AI Runtime Architecture
+## Understand persistent AI runtime architecture
 
-In this Learning Path, you will build a ***persistent local AI runtime*** on NVIDIA [DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/). The implementation is validated on DGX Spark, but the architecture also applies to other ***Arm Cortex-A platforms*** that can run containerized services and local AI runtimes.
+You will build a persistent local AI runtime on NVIDIA [DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/). The implementation is validated on DGX Spark, but the architecture also applies to other Arm Cortex-A platforms that can run containerized services and local AI runtimes.
 
 The final system is not a single chatbot process. It is a set of local services that run continuously, share a workspace, react to file events, generate summaries, create embeddings, store vector memory, retrieve context, and periodically reason about the state of the workspace.
 
-The core idea is: ***AI systems are orchestration systems, not just inference systems.***
+The core idea is: **AI systems are orchestration systems, not just inference systems.**
 
-DGX Spark is well suited to this type of workload because it combines ***Arm CPU orchestration*** with local GPU acceleration. In the [Grace Blackwell architecture](https://learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction/), the Arm Grace CPU coordinates background services, filesystem events, scheduling, document processing, metadata handling, and service-to-service communication. The Blackwell GPU accelerates ***local LLM inference***, token generation, summarization, and embedding generation.
+DGX Spark is well suited to this type of workload because it combines Arm CPU orchestration with local GPU acceleration. In the [Grace Blackwell architecture](https://learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction/), the Arm Grace CPU coordinates background services, filesystem events, scheduling, document processing, metadata handling, and service-to-service communication. The Blackwell GPU accelerates local LLM inference, token generation, summarization, and embedding generation.
 
 By the end of this Learning Path, you will have a local runtime with these capabilities:
 
@@ -27,7 +27,7 @@ By the end of this Learning Path, you will have a local runtime with these capab
 | Semantic retrieval | Hermes Agent + Qdrant + Ollama |
 | Autonomous workspace cognition | Hermes Agent + Ollama |
 
-## Runtime Architecture Overview
+## Runtime architecture overview
 
 The runtime uses four containerized services:
 
@@ -61,7 +61,7 @@ These services communicate over a local Docker network and share a persistent wo
                                       +---------------+   +---------------+
 ```
 
-The important architectural pattern is ***separation of responsibilities***. Each service has a narrow role, and Hermes coordinates the overall workflow.
+The important architectural pattern is separation of responsibilities. Each service has a narrow role, and Hermes coordinates the overall workflow.
 
 | Layer | Service | Purpose |
 |---|---|---|
@@ -70,85 +70,45 @@ The important architectural pattern is ***separation of responsibilities***. Eac
 | Memory layer | Qdrant | Stores and searches vector memory |
 | Orchestration layer | Hermes Agent | Watches files, schedules work, coordinates services |
 
-## Runtime Components
+## Runtime components
 
-### Hermes Runtime
+### Hermes runtime
 
-Hermes is the ***orchestration runtime*** you will build in this Learning Path.
+Hermes is the orchestration runtime you will build.
 
-It runs as a persistent Python service inside a container. It watches the shared workspace, detects new files, reads documents, sends requests to Ollama, stores memory in Qdrant, performs semantic retrieval, and later generates autonomous workspace summaries.
-
-Hermes is responsible for:
-
-- Filesystem monitoring
-- Workflow orchestration
-- Runtime scheduling
-- Document parsing
-- Prompt preparation
-- Inference coordination
-- Memory coordination
-- Autonomous cognition
-
-Hermes does not run the language model itself. Instead, it coordinates AI workflows across local services.
+It runs as a persistent Python service inside a container. It watches the shared workspace, detects new files, reads documents, sends requests to Ollama, stores memory in Qdrant, performs semantic retrieval, and later generates autonomous workspace summaries. Hermes doesn't run the language model itself. It coordinates AI workflows across local services.
 
 This is the main CPU-side workload in the system. The Arm CPU keeps the runtime alive, schedules background loops, tracks file events, moves data between services, and manages runtime state.
 
-### Ollama Runtime
+### Ollama runtime
 
-Ollama provides the local inference runtime in this Learning Path. It is used because it is a convenient way to run local models and expose a simple API, but the architecture is not limited to Ollama.
+Ollama provides the local inference runtime. It's a convenient way to run local models and expose a simple API, but the architecture isn't limited to Ollama.
 
-Conceptually, Ollama is one possible ***inference backend***. Hermes can orchestrate any local or remote inference service that exposes a compatible API, such as llama.cpp server, vLLM, a custom PyTorch service, or another model runtime.
+Conceptually, Ollama is one possible inference backend. Hermes can orchestrate any local or remote inference service that exposes a compatible API, such as llama.cpp server, vLLM, a custom PyTorch service, or another model runtime.
 
-In this Learning Path, Hermes uses Ollama for two types of model calls:
+Hermes uses Ollama for two types of model calls:
 
 - Chat completion, using [`qwen2.5:7b`](https://huggingface.co/Qwen/Qwen2.5-7B)
 - Embedding generation, using [`nomic-embed-text`](https://ollama.com/library/nomic-embed-text)
 
 
-The chat model is used to summarize files, answer questions over retrieved memory, and generate workspace-level insights. The embedding model converts text into vectors so Qdrant can store and search semantic memory.
-
-Ollama is responsible for:
-
-- Local LLM inference
-- Token generation
-- AI summarization
-- Embedding generation
-
-Ollama does not watch files, manage memory, or decide when work should happen. It provides model execution, and Hermes calls it when the workflow requires inference.
-
-### Qdrant Memory Service
-
-Qdrant provides ***persistent vector memory***.
+The chat model summarizes files, answers questions over retrieved memory, and generates workspace-level insights. The embedding model converts text into vectors so Qdrant can store and search semantic memory. Ollama doesn't watch files, manage memory, or decide when work happens. Hermes calls it when the workflow requires inference.
 
-Hermes stores document embeddings in a Qdrant collection named `workspace_memory`. Each stored point includes a vector and payload metadata, such as the document path, generated summary, and source content excerpt.
+### Qdrant memory service
 
-Qdrant is responsible for:
+Qdrant provides persistent vector memory.
 
-- Vector storage
-- Semantic indexing
-- Similarity search
-- Long-term memory persistence
-- Contextual retrieval
-
-Qdrant does not perform LLM inference. It stores vectors and returns semantically similar memories when Hermes performs a retrieval query.
+Hermes stores document embeddings in a Qdrant collection named `workspace_memory`. Each stored point includes a vector and payload metadata, such as the document path, generated summary, and source content excerpt. Qdrant stores vectors and returns semantically similar memories when Hermes performs a retrieval query. It doesn't run inference or decide when retrieval happens.
 
 ### Open WebUI
 
 Open WebUI provides a local browser interface for interacting with the Ollama runtime.
 
-It is useful for validating that local models are available, testing prompts, and giving users a simple interface to local inference. In this Learning Path, Open WebUI is not the orchestration layer and it is not the memory system.
-
-Open WebUI is responsible for:
-
-- Browser-based access
-- Local chat interaction
-- Model testing and exploration
-
-The persistent AI runtime is still coordinated by Hermes.
+Use it to confirm that local models are available, test prompts, or explore the inference runtime in a browser. The persistent AI runtime is still coordinated by Hermes.
 
-## Shared Workspace
+## Shared workspace
 
-The services use a ***shared workspace*** mounted into the containers.
+The services use a shared workspace mounted into the containers.
 
 The workspace structure is:
 
@@ -173,11 +133,11 @@ Each directory has a specific purpose:
 
 The shared workspace is what turns isolated containers into a coordinated local AI runtime. Hermes can observe files created on the host, use Ollama to process them, store memory in Qdrant, and write results back to persistent storage.
 
-## Event-driven AI Workflows
+## Event-driven AI workflows
 
 Persistent AI systems are long-running systems. They do not wait for a single prompt and then exit. They monitor runtime state and react when something changes.
 
-In this Learning Path, Hermes starts with a filesystem watcher:
+Hermes starts with a filesystem watcher:
 
 ```text
 [New document] -> [Filesystem event] -> [Hermes orchestration] -> [Document processing]
@@ -196,9 +156,9 @@ As you add capabilities, the workflow grows:
 
 This event-driven design is important because it shows how AI systems become continuous local runtimes. The model is only one part of the system. The surrounding runtime decides when to call the model, what context to provide, where to store results, and how later workflows can reuse those results.
 
-## Semantic Memory and Retrieval
+## Semantic memory and retrieval
 
-***Semantic memory*** gives the runtime a way to retain information over time.
+Semantic memory gives the runtime a way to retain information over time.
 
 | Flow | Runtime path |
 |---|---|
@@ -207,7 +167,7 @@ This event-driven design is important because it shows how AI systems become con
 
 This is different from storing plain text files and searching for keywords. Vector search allows the runtime to retrieve content based on semantic similarity. For example, a question about "CPU scheduling" can retrieve a document that discusses "runtime orchestration" even if the exact words are different.
 
-## Autonomous Workspace Cognition
+## Autonomous workspace cognition
 
 The final stage of this Learning Path adds autonomous workspace cognition.
 
@@ -227,33 +187,14 @@ Runtime behavior is controlled by a configuration file:
 
 This allows the runtime to adjust settings such as supported file extensions, retrieval depth, summary interval, and summary output path without hardcoding every behavior into the agent.
 
-## CPU and GPU Responsibilities
+## CPU and GPU responsibilities
 
 This Learning Path highlights heterogeneous AI computing. The CPU and GPU both matter, but they perform different roles.
 
-The Arm Grace CPU coordinates persistent runtime work:
-
-- Filesystem monitoring
-- Event scheduling
-- Runtime orchestration
-- Background service coordination
-- Document parsing
-- Metadata management
-- Vector database coordination
-- Runtime policy loading
-- Long-running process lifecycle management
-
-The Blackwell GPU accelerates model execution:
-
-- Local LLM inference
-- Token generation
-- AI summarization
-- Embedding generation
-- Contextual reasoning
-- Workspace summary generation
+The Arm Grace CPU handles the persistent runtime side: watching the filesystem, scheduling background loops, coordinating services, parsing documents, managing metadata, and keeping long-running processes alive. The Blackwell GPU handles model execution, including LLM inference, token generation, summarization, embedding generation, and contextual reasoning.
 
-This separation is central to the architecture. The GPU accelerates model-heavy operations, while the CPU keeps the distributed AI runtime organized and continuously operating.
+This separation is central to the architecture. The GPU accelerates model-heavy operations, while the CPU keeps the runtime organized and continuously operating.
 
-## Next Step
+## Next step
 
 Next, you will build the DGX Spark runtime foundation: Docker, GPU-enabled containers, the shared workspace, and the initial Ollama, Qdrant, and Open WebUI services.