You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
title: Understand Persistent AI Runtime Architecture
2
+
title: Understand persistent AI runtime architecture
3
3
weight: 2
4
4
5
5
### FIXED, DO NOT MODIFY
6
6
layout: learningpathall
7
7
---
8
8
9
-
## Understand Persistent AI Runtime Architecture
9
+
## Understand persistent AI runtime architecture
10
10
11
-
In this Learning Path, you will build a ***persistent local AI runtime*** on NVIDIA [DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/). The implementation is validated on DGX Spark, but the architecture also applies to other ***Arm Cortex-A platforms*** that can run containerized services and local AI runtimes.
11
+
You will build a persistent local AI runtime on NVIDIA [DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/). The implementation is validated on DGX Spark, but the architecture also applies to other Arm Cortex-A platforms that can run containerized services and local AI runtimes.
12
12
13
13
The final system is not a single chatbot process. It is a set of local services that run continuously, share a workspace, react to file events, generate summaries, create embeddings, store vector memory, retrieve context, and periodically reason about the state of the workspace.
14
14
15
-
The core idea is: ***AI systems are orchestration systems, not just inference systems.***
15
+
The core idea is: **AI systems are orchestration systems, not just inference systems.**
16
16
17
-
DGX Spark is well suited to this type of workload because it combines ***Arm CPU orchestration*** with local GPU acceleration. In the [Grace Blackwell architecture](https://learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction/), the Arm Grace CPU coordinates background services, filesystem events, scheduling, document processing, metadata handling, and service-to-service communication. The Blackwell GPU accelerates ***local LLM inference***, token generation, summarization, and embedding generation.
17
+
DGX Spark is well suited to this type of workload because it combines Arm CPU orchestration with local GPU acceleration. In the [Grace Blackwell architecture](https://learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction/), the Arm Grace CPU coordinates background services, filesystem events, scheduling, document processing, metadata handling, and service-to-service communication. The Blackwell GPU accelerates local LLM inference, token generation, summarization, and embedding generation.
18
18
19
19
By the end of this Learning Path, you will have a local runtime with these capabilities:
20
20
@@ -27,7 +27,7 @@ By the end of this Learning Path, you will have a local runtime with these capab
@@ -61,7 +61,7 @@ These services communicate over a local Docker network and share a persistent wo
61
61
+---------------+ +---------------+
62
62
```
63
63
64
-
The important architectural pattern is ***separation of responsibilities***. Each service has a narrow role, and Hermes coordinates the overall workflow.
64
+
The important architectural pattern is separation of responsibilities. Each service has a narrow role, and Hermes coordinates the overall workflow.
65
65
66
66
| Layer | Service | Purpose |
67
67
|---|---|---|
@@ -70,85 +70,45 @@ The important architectural pattern is ***separation of responsibilities***. Eac
Hermes is the ***orchestration runtime*** you will build in this Learning Path.
77
+
Hermes is the orchestration runtime you will build.
78
78
79
-
It runs as a persistent Python service inside a container. It watches the shared workspace, detects new files, reads documents, sends requests to Ollama, stores memory in Qdrant, performs semantic retrieval, and later generates autonomous workspace summaries.
80
-
81
-
Hermes is responsible for:
82
-
83
-
- Filesystem monitoring
84
-
- Workflow orchestration
85
-
- Runtime scheduling
86
-
- Document parsing
87
-
- Prompt preparation
88
-
- Inference coordination
89
-
- Memory coordination
90
-
- Autonomous cognition
91
-
92
-
Hermes does not run the language model itself. Instead, it coordinates AI workflows across local services.
79
+
It runs as a persistent Python service inside a container. It watches the shared workspace, detects new files, reads documents, sends requests to Ollama, stores memory in Qdrant, performs semantic retrieval, and later generates autonomous workspace summaries. Hermes doesn't run the language model itself. It coordinates AI workflows across local services.
93
80
94
81
This is the main CPU-side workload in the system. The Arm CPU keeps the runtime alive, schedules background loops, tracks file events, moves data between services, and manages runtime state.
95
82
96
-
### Ollama Runtime
83
+
### Ollama runtime
97
84
98
-
Ollama provides the local inference runtime in this Learning Path. It is used because it is a convenient way to run local models and expose a simple API, but the architecture is not limited to Ollama.
85
+
Ollama provides the local inference runtime. It's a convenient way to run local models and expose a simple API, but the architecture isn't limited to Ollama.
99
86
100
-
Conceptually, Ollama is one possible ***inference backend***. Hermes can orchestrate any local or remote inference service that exposes a compatible API, such as llama.cpp server, vLLM, a custom PyTorch service, or another model runtime.
87
+
Conceptually, Ollama is one possible inference backend. Hermes can orchestrate any local or remote inference service that exposes a compatible API, such as llama.cpp server, vLLM, a custom PyTorch service, or another model runtime.
101
88
102
-
In this Learning Path, Hermes uses Ollama for two types of model calls:
89
+
Hermes uses Ollama for two types of model calls:
103
90
104
91
- Chat completion, using [`qwen2.5:7b`](https://huggingface.co/Qwen/Qwen2.5-7B)
105
92
- Embedding generation, using [`nomic-embed-text`](https://ollama.com/library/nomic-embed-text)
106
93
107
94
108
-
The chat model is used to summarize files, answer questions over retrieved memory, and generate workspace-level insights. The embedding model converts text into vectors so Qdrant can store and search semantic memory.
109
-
110
-
Ollama is responsible for:
111
-
112
-
- Local LLM inference
113
-
- Token generation
114
-
- AI summarization
115
-
- Embedding generation
116
-
117
-
Ollama does not watch files, manage memory, or decide when work should happen. It provides model execution, and Hermes calls it when the workflow requires inference.
118
-
119
-
### Qdrant Memory Service
120
-
121
-
Qdrant provides ***persistent vector memory***.
95
+
The chat model summarizes files, answers questions over retrieved memory, and generates workspace-level insights. The embedding model converts text into vectors so Qdrant can store and search semantic memory. Ollama doesn't watch files, manage memory, or decide when work happens. Hermes calls it when the workflow requires inference.
122
96
123
-
Hermes stores document embeddings in a Qdrant collection named `workspace_memory`. Each stored point includes a vector and payload metadata, such as the document path, generated summary, and source content excerpt.
97
+
### Qdrant memory service
124
98
125
-
Qdrant is responsible for:
99
+
Qdrant provides persistent vector memory.
126
100
127
-
- Vector storage
128
-
- Semantic indexing
129
-
- Similarity search
130
-
- Long-term memory persistence
131
-
- Contextual retrieval
132
-
133
-
Qdrant does not perform LLM inference. It stores vectors and returns semantically similar memories when Hermes performs a retrieval query.
101
+
Hermes stores document embeddings in a Qdrant collection named `workspace_memory`. Each stored point includes a vector and payload metadata, such as the document path, generated summary, and source content excerpt. Qdrant stores vectors and returns semantically similar memories when Hermes performs a retrieval query. It doesn't run inference or decide when retrieval happens.
134
102
135
103
### Open WebUI
136
104
137
105
Open WebUI provides a local browser interface for interacting with the Ollama runtime.
138
106
139
-
It is useful for validating that local models are available, testing prompts, and giving users a simple interface to local inference. In this Learning Path, Open WebUI is not the orchestration layer and it is not the memory system.
140
-
141
-
Open WebUI is responsible for:
142
-
143
-
- Browser-based access
144
-
- Local chat interaction
145
-
- Model testing and exploration
146
-
147
-
The persistent AI runtime is still coordinated by Hermes.
107
+
Use it to confirm that local models are available, test prompts, or explore the inference runtime in a browser. The persistent AI runtime is still coordinated by Hermes.
148
108
149
-
## Shared Workspace
109
+
## Shared workspace
150
110
151
-
The services use a ***shared workspace*** mounted into the containers.
111
+
The services use a shared workspace mounted into the containers.
152
112
153
113
The workspace structure is:
154
114
@@ -173,11 +133,11 @@ Each directory has a specific purpose:
173
133
174
134
The shared workspace is what turns isolated containers into a coordinated local AI runtime. Hermes can observe files created on the host, use Ollama to process them, store memory in Qdrant, and write results back to persistent storage.
175
135
176
-
## Event-driven AI Workflows
136
+
## Event-driven AI workflows
177
137
178
138
Persistent AI systems are long-running systems. They do not wait for a single prompt and then exit. They monitor runtime state and react when something changes.
179
139
180
-
In this Learning Path, Hermes starts with a filesystem watcher:
@@ -196,9 +156,9 @@ As you add capabilities, the workflow grows:
196
156
197
157
This event-driven design is important because it shows how AI systems become continuous local runtimes. The model is only one part of the system. The surrounding runtime decides when to call the model, what context to provide, where to store results, and how later workflows can reuse those results.
198
158
199
-
## Semantic Memory and Retrieval
159
+
## Semantic memory and retrieval
200
160
201
-
***Semantic memory*** gives the runtime a way to retain information over time.
161
+
Semantic memory gives the runtime a way to retain information over time.
202
162
203
163
| Flow | Runtime path |
204
164
|---|---|
@@ -207,7 +167,7 @@ This event-driven design is important because it shows how AI systems become con
207
167
208
168
This is different from storing plain text files and searching for keywords. Vector search allows the runtime to retrieve content based on semantic similarity. For example, a question about "CPU scheduling" can retrieve a document that discusses "runtime orchestration" even if the exact words are different.
209
169
210
-
## Autonomous Workspace Cognition
170
+
## Autonomous workspace cognition
211
171
212
172
The final stage of this Learning Path adds autonomous workspace cognition.
213
173
@@ -227,33 +187,14 @@ Runtime behavior is controlled by a configuration file:
227
187
228
188
This allows the runtime to adjust settings such as supported file extensions, retrieval depth, summary interval, and summary output path without hardcoding every behavior into the agent.
229
189
230
-
## CPU and GPU Responsibilities
190
+
## CPU and GPU responsibilities
231
191
232
192
This Learning Path highlights heterogeneous AI computing. The CPU and GPU both matter, but they perform different roles.
233
193
234
-
The Arm Grace CPU coordinates persistent runtime work:
235
-
236
-
- Filesystem monitoring
237
-
- Event scheduling
238
-
- Runtime orchestration
239
-
- Background service coordination
240
-
- Document parsing
241
-
- Metadata management
242
-
- Vector database coordination
243
-
- Runtime policy loading
244
-
- Long-running process lifecycle management
245
-
246
-
The Blackwell GPU accelerates model execution:
247
-
248
-
- Local LLM inference
249
-
- Token generation
250
-
- AI summarization
251
-
- Embedding generation
252
-
- Contextual reasoning
253
-
- Workspace summary generation
194
+
The Arm Grace CPU handles the persistent runtime side: watching the filesystem, scheduling background loops, coordinating services, parsing documents, managing metadata, and keeping long-running processes alive. The Blackwell GPU handles model execution, including LLM inference, token generation, summarization, embedding generation, and contextual reasoning.
254
195
255
-
This separation is central to the architecture. The GPU accelerates model-heavy operations, while the CPU keeps the distributed AI runtime organized and continuously operating.
196
+
This separation is central to the architecture. The GPU accelerates model-heavy operations, while the CPU keeps the runtime organized and continuously operating.
256
197
257
-
## Next Step
198
+
## Next step
258
199
259
200
Next, you will build the DGX Spark runtime foundation: Docker, GPU-enabled containers, the shared workspace, and the initial Ollama, Qdrant, and Open WebUI services.
0 commit comments