Skip to content

Commit 863e886

Browse files
Merge pull request #3357 from jasonrandrews/review
Review Hermes agent on DGX Spark Learning Path
2 parents f11113c + 89128f5 commit 863e886

8 files changed

Lines changed: 298 additions & 836 deletions

File tree

Lines changed: 31 additions & 90 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,20 @@
11
---
2-
title: Understand Persistent AI Runtime Architecture
2+
title: Understand persistent AI runtime architecture
33
weight: 2
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
## Understand Persistent AI Runtime Architecture
9+
## Understand persistent AI runtime architecture
1010

11-
In this Learning Path, you will build a ***persistent local AI runtime*** on NVIDIA [DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/). The implementation is validated on DGX Spark, but the architecture also applies to other ***Arm Cortex-A platforms*** that can run containerized services and local AI runtimes.
11+
You will build a persistent local AI runtime on NVIDIA [DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/). The implementation is validated on DGX Spark, but the architecture also applies to other Arm Cortex-A platforms that can run containerized services and local AI runtimes.
1212

1313
The final system is not a single chatbot process. It is a set of local services that run continuously, share a workspace, react to file events, generate summaries, create embeddings, store vector memory, retrieve context, and periodically reason about the state of the workspace.
1414

15-
The core idea is: ***AI systems are orchestration systems, not just inference systems.***
15+
The core idea is: **AI systems are orchestration systems, not just inference systems.**
1616

17-
DGX Spark is well suited to this type of workload because it combines ***Arm CPU orchestration*** with local GPU acceleration. In the [Grace Blackwell architecture](https://learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction/), the Arm Grace CPU coordinates background services, filesystem events, scheduling, document processing, metadata handling, and service-to-service communication. The Blackwell GPU accelerates ***local LLM inference***, token generation, summarization, and embedding generation.
17+
DGX Spark is well suited to this type of workload because it combines Arm CPU orchestration with local GPU acceleration. In the [Grace Blackwell architecture](https://learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction/), the Arm Grace CPU coordinates background services, filesystem events, scheduling, document processing, metadata handling, and service-to-service communication. The Blackwell GPU accelerates local LLM inference, token generation, summarization, and embedding generation.
1818

1919
By the end of this Learning Path, you will have a local runtime with these capabilities:
2020

@@ -27,7 +27,7 @@ By the end of this Learning Path, you will have a local runtime with these capab
2727
| Semantic retrieval | Hermes Agent + Qdrant + Ollama |
2828
| Autonomous workspace cognition | Hermes Agent + Ollama |
2929

30-
## Runtime Architecture Overview
30+
## Runtime architecture overview
3131

3232
The runtime uses four containerized services:
3333

@@ -61,7 +61,7 @@ These services communicate over a local Docker network and share a persistent wo
6161
+---------------+ +---------------+
6262
```
6363

64-
The important architectural pattern is ***separation of responsibilities***. Each service has a narrow role, and Hermes coordinates the overall workflow.
64+
The important architectural pattern is separation of responsibilities. Each service has a narrow role, and Hermes coordinates the overall workflow.
6565

6666
| Layer | Service | Purpose |
6767
|---|---|---|
@@ -70,85 +70,45 @@ The important architectural pattern is ***separation of responsibilities***. Eac
7070
| Memory layer | Qdrant | Stores and searches vector memory |
7171
| Orchestration layer | Hermes Agent | Watches files, schedules work, coordinates services |
7272

73-
## Runtime Components
73+
## Runtime components
7474

75-
### Hermes Runtime
75+
### Hermes runtime
7676

77-
Hermes is the ***orchestration runtime*** you will build in this Learning Path.
77+
Hermes is the orchestration runtime you will build.
7878

79-
It runs as a persistent Python service inside a container. It watches the shared workspace, detects new files, reads documents, sends requests to Ollama, stores memory in Qdrant, performs semantic retrieval, and later generates autonomous workspace summaries.
80-
81-
Hermes is responsible for:
82-
83-
- Filesystem monitoring
84-
- Workflow orchestration
85-
- Runtime scheduling
86-
- Document parsing
87-
- Prompt preparation
88-
- Inference coordination
89-
- Memory coordination
90-
- Autonomous cognition
91-
92-
Hermes does not run the language model itself. Instead, it coordinates AI workflows across local services.
79+
It runs as a persistent Python service inside a container. It watches the shared workspace, detects new files, reads documents, sends requests to Ollama, stores memory in Qdrant, performs semantic retrieval, and later generates autonomous workspace summaries. Hermes doesn't run the language model itself. It coordinates AI workflows across local services.
9380

9481
This is the main CPU-side workload in the system. The Arm CPU keeps the runtime alive, schedules background loops, tracks file events, moves data between services, and manages runtime state.
9582

96-
### Ollama Runtime
83+
### Ollama runtime
9784

98-
Ollama provides the local inference runtime in this Learning Path. It is used because it is a convenient way to run local models and expose a simple API, but the architecture is not limited to Ollama.
85+
Ollama provides the local inference runtime. It's a convenient way to run local models and expose a simple API, but the architecture isn't limited to Ollama.
9986

100-
Conceptually, Ollama is one possible ***inference backend***. Hermes can orchestrate any local or remote inference service that exposes a compatible API, such as llama.cpp server, vLLM, a custom PyTorch service, or another model runtime.
87+
Conceptually, Ollama is one possible inference backend. Hermes can orchestrate any local or remote inference service that exposes a compatible API, such as llama.cpp server, vLLM, a custom PyTorch service, or another model runtime.
10188

102-
In this Learning Path, Hermes uses Ollama for two types of model calls:
89+
Hermes uses Ollama for two types of model calls:
10390

10491
- Chat completion, using [`qwen2.5:7b`](https://huggingface.co/Qwen/Qwen2.5-7B)
10592
- Embedding generation, using [`nomic-embed-text`](https://ollama.com/library/nomic-embed-text)
10693

10794

108-
The chat model is used to summarize files, answer questions over retrieved memory, and generate workspace-level insights. The embedding model converts text into vectors so Qdrant can store and search semantic memory.
109-
110-
Ollama is responsible for:
111-
112-
- Local LLM inference
113-
- Token generation
114-
- AI summarization
115-
- Embedding generation
116-
117-
Ollama does not watch files, manage memory, or decide when work should happen. It provides model execution, and Hermes calls it when the workflow requires inference.
118-
119-
### Qdrant Memory Service
120-
121-
Qdrant provides ***persistent vector memory***.
95+
The chat model summarizes files, answers questions over retrieved memory, and generates workspace-level insights. The embedding model converts text into vectors so Qdrant can store and search semantic memory. Ollama doesn't watch files, manage memory, or decide when work happens. Hermes calls it when the workflow requires inference.
12296

123-
Hermes stores document embeddings in a Qdrant collection named `workspace_memory`. Each stored point includes a vector and payload metadata, such as the document path, generated summary, and source content excerpt.
97+
### Qdrant memory service
12498

125-
Qdrant is responsible for:
99+
Qdrant provides persistent vector memory.
126100

127-
- Vector storage
128-
- Semantic indexing
129-
- Similarity search
130-
- Long-term memory persistence
131-
- Contextual retrieval
132-
133-
Qdrant does not perform LLM inference. It stores vectors and returns semantically similar memories when Hermes performs a retrieval query.
101+
Hermes stores document embeddings in a Qdrant collection named `workspace_memory`. Each stored point includes a vector and payload metadata, such as the document path, generated summary, and source content excerpt. Qdrant stores vectors and returns semantically similar memories when Hermes performs a retrieval query. It doesn't run inference or decide when retrieval happens.
134102

135103
### Open WebUI
136104

137105
Open WebUI provides a local browser interface for interacting with the Ollama runtime.
138106

139-
It is useful for validating that local models are available, testing prompts, and giving users a simple interface to local inference. In this Learning Path, Open WebUI is not the orchestration layer and it is not the memory system.
140-
141-
Open WebUI is responsible for:
142-
143-
- Browser-based access
144-
- Local chat interaction
145-
- Model testing and exploration
146-
147-
The persistent AI runtime is still coordinated by Hermes.
107+
Use it to confirm that local models are available, test prompts, or explore the inference runtime in a browser. The persistent AI runtime is still coordinated by Hermes.
148108

149-
## Shared Workspace
109+
## Shared workspace
150110

151-
The services use a ***shared workspace*** mounted into the containers.
111+
The services use a shared workspace mounted into the containers.
152112

153113
The workspace structure is:
154114

@@ -173,11 +133,11 @@ Each directory has a specific purpose:
173133

174134
The shared workspace is what turns isolated containers into a coordinated local AI runtime. Hermes can observe files created on the host, use Ollama to process them, store memory in Qdrant, and write results back to persistent storage.
175135

176-
## Event-driven AI Workflows
136+
## Event-driven AI workflows
177137

178138
Persistent AI systems are long-running systems. They do not wait for a single prompt and then exit. They monitor runtime state and react when something changes.
179139

180-
In this Learning Path, Hermes starts with a filesystem watcher:
140+
Hermes starts with a filesystem watcher:
181141

182142
```text
183143
[New document] -> [Filesystem event] -> [Hermes orchestration] -> [Document processing]
@@ -196,9 +156,9 @@ As you add capabilities, the workflow grows:
196156

197157
This event-driven design is important because it shows how AI systems become continuous local runtimes. The model is only one part of the system. The surrounding runtime decides when to call the model, what context to provide, where to store results, and how later workflows can reuse those results.
198158

199-
## Semantic Memory and Retrieval
159+
## Semantic memory and retrieval
200160

201-
***Semantic memory*** gives the runtime a way to retain information over time.
161+
Semantic memory gives the runtime a way to retain information over time.
202162

203163
| Flow | Runtime path |
204164
|---|---|
@@ -207,7 +167,7 @@ This event-driven design is important because it shows how AI systems become con
207167

208168
This is different from storing plain text files and searching for keywords. Vector search allows the runtime to retrieve content based on semantic similarity. For example, a question about "CPU scheduling" can retrieve a document that discusses "runtime orchestration" even if the exact words are different.
209169

210-
## Autonomous Workspace Cognition
170+
## Autonomous workspace cognition
211171

212172
The final stage of this Learning Path adds autonomous workspace cognition.
213173

@@ -227,33 +187,14 @@ Runtime behavior is controlled by a configuration file:
227187

228188
This allows the runtime to adjust settings such as supported file extensions, retrieval depth, summary interval, and summary output path without hardcoding every behavior into the agent.
229189

230-
## CPU and GPU Responsibilities
190+
## CPU and GPU responsibilities
231191

232192
This Learning Path highlights heterogeneous AI computing. The CPU and GPU both matter, but they perform different roles.
233193

234-
The Arm Grace CPU coordinates persistent runtime work:
235-
236-
- Filesystem monitoring
237-
- Event scheduling
238-
- Runtime orchestration
239-
- Background service coordination
240-
- Document parsing
241-
- Metadata management
242-
- Vector database coordination
243-
- Runtime policy loading
244-
- Long-running process lifecycle management
245-
246-
The Blackwell GPU accelerates model execution:
247-
248-
- Local LLM inference
249-
- Token generation
250-
- AI summarization
251-
- Embedding generation
252-
- Contextual reasoning
253-
- Workspace summary generation
194+
The Arm Grace CPU handles the persistent runtime side: watching the filesystem, scheduling background loops, coordinating services, parsing documents, managing metadata, and keeping long-running processes alive. The Blackwell GPU handles model execution, including LLM inference, token generation, summarization, embedding generation, and contextual reasoning.
254195

255-
This separation is central to the architecture. The GPU accelerates model-heavy operations, while the CPU keeps the distributed AI runtime organized and continuously operating.
196+
This separation is central to the architecture. The GPU accelerates model-heavy operations, while the CPU keeps the runtime organized and continuously operating.
256197

257-
## Next Step
198+
## Next step
258199

259200
Next, you will build the DGX Spark runtime foundation: Docker, GPU-enabled containers, the shared workspace, and the initial Ollama, Qdrant, and Open WebUI services.

0 commit comments

Comments
 (0)