Skip to content

Commit a558120

Browse files
peterjEItanya
andauthored
openclaw substrate support (#1939)
support running openclaw on substrate. --------- Signed-off-by: Peter Jausovec <peter.jausovec@solo.io> Signed-off-by: Eitan Yarmush <eitan.yarmush@solo.io> Co-authored-by: Eitan Yarmush <eitan.yarmush@solo.io>
1 parent 17cf823 commit a558120

106 files changed

Lines changed: 6722 additions & 677 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/ci.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@ jobs:
8181
- name: Install agent-sandbox
8282
run: |
8383
kubectl apply -f "https://github.com/kubernetes-sigs/agent-sandbox/releases/download/${AGENT_SANDBOX_VERSION}/manifest.yaml"
84-
kubectl wait --for=condition=Established crd/sandboxes.agents.x-k8s.io --timeout=90s
84+
timeout 90s bash -c 'until [ "$(kubectl get crd sandboxes.agents.x-k8s.io -o jsonpath="{.status.conditions[?(@.type==\"Established\")].status}" 2>/dev/null)" = "True" ]; do sleep 1; done'
8585
kubectl rollout status deployment/agent-sandbox-controller -n agent-sandbox-system --timeout=120s
8686
kubectl wait --for=condition=Ready pod -l app=agent-sandbox-controller -n agent-sandbox-system --timeout=120s
8787
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
# Substrate AgentHarness Lifecycle
2+
3+
This branch should use a single ownership model for `runtime: substrate` harnesses.
4+
5+
## Ownership
6+
7+
- Platform/Helm owns `WorkerPool` capacity.
8+
- kagent owns the generated per-harness `ActorTemplate`.
9+
- kagent owns the per-harness actor lifecycle through `ate-api`.
10+
- Substrate owns the `WorkerPool` deployment and the `ActorTemplate` golden snapshot process.
11+
12+
kagent should not create or delete `WorkerPool` resources from the `AgentHarness` reconciler. A chart may optionally install a default `WorkerPool`, and the controller may use that default when `spec.substrate.workerPoolRef` is unset.
13+
14+
## Spec Shape
15+
16+
`AgentHarness.spec.substrate` should contain only harness-level inputs:
17+
18+
- `workerPoolRef`, optional; falls back to the configured controller default.
19+
- `snapshotsConfig`, optional; defaults to `gs://ate-snapshots/<namespace>/<name>`.
20+
- `workloadImage`, optional.
21+
- exactly one of `gatewayToken` or `gatewayTokenSecretRef`.
22+
23+
There is no `actorTemplateRef`. kagent always generates the `ActorTemplate`, so adopting an external template is not part of the workflow.
24+
25+
## Status
26+
27+
Use top-level Kubernetes conditions for progress:
28+
29+
- `Accepted`
30+
- `ActorTemplateReady`
31+
- `ActorReady`
32+
- `Ready`
33+
34+
`Ready` is the aggregate condition. Specific blockers should be reflected in `reason` and `message`.
35+
36+
Do not store ownership booleans or cleanup markers in annotations or status. Ownership is deterministic:
37+
38+
- `WorkerPool` is external.
39+
- generated `ActorTemplate` is owned by the `AgentHarness` through an owner reference.
40+
41+
## Reconcile
42+
43+
The substrate reconcile path should:
44+
45+
1. Resolve `workerPoolRef` from spec or controller default.
46+
2. Verify the `WorkerPool` exists.
47+
3. Create or update the generated `ActorTemplate` with an owner reference to the `AgentHarness`.
48+
4. Wait for `ActorTemplate.status.phase == Ready`.
49+
5. Create or resume the actor through `ate-api`.
50+
6. Mark `ActorReady` and aggregate `Ready`.
51+
52+
## Delete
53+
54+
The finalizer should:
55+
56+
1. Delete the harness actor recorded in `status.backendRef.id`.
57+
2. Read the generated `ActorTemplate` and delete `status.goldenActorID`, if present.
58+
3. Remove the finalizer.
59+
60+
Kubernetes garbage collection deletes the generated `ActorTemplate` through the owner reference. kagent does not delete `WorkerPool`.
Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
# OpenClaw on Agent Substrate
2+
3+
## 1. Install Substrate on your Kind cluster
4+
5+
You can clone the kagent fork of substrate [here](https://github.com/kagent-dev/substrate).
6+
7+
These instructions use a Kind cluster called `kind` (`KIND_CLUSTER_NAME=kind`).
8+
9+
```bash
10+
cd substrate
11+
12+
./hack/create-kind-cluster.sh
13+
./hack/install-ate-kind.sh --deploy-ate-system
14+
```
15+
16+
`--deploy-ate-system` installs the **control plane only** (ate-api, ate-controller, atelet, atenet, …). Your registry catalog will show `ateapi-*`, `atelet-*`, etc., but **not** ateom until you build it.
17+
18+
Build and push **ateom-gvisor** (required for the WorkerPool `ateomImage`):
19+
20+
```bash
21+
# build the ateom-gvisor image from the substrate repo root
22+
export KO_DOCKER_REPO=localhost:5001
23+
export KO_DEFAULTPLATFORMS=linux/$(go env GOARCH)
24+
./hack/run-tool.sh ko build -B ./cmd/ateom-gvisor
25+
```
26+
27+
## kagent AgentHarness with substrate runtime
28+
29+
kagent generates a per-harness `ActorTemplate` and uses an existing `WorkerPool`.
30+
31+
Install kagent (Substrate must already be running in the cluster):
32+
33+
```bash
34+
export KIND_CLUSTER_NAME=kind
35+
make helm-install KAGENT_HELM_EXTRA_ARGS="\
36+
--set controller.substrate.enabled=true \
37+
--set controller.substrate.ateApiEndpoint=dns:///api.ate-system.svc:443 \
38+
--set controller.substrate.ateApiInsecure=true \
39+
--set substrateWorkerPool.create=true \
40+
--set substrateWorkerPool.ateomImage=localhost:5001/ateom-gvisor:latest"
41+
```
42+
43+
The generated `ActorTemplate` uses `controller.substrate.pauseImage`, `controller.substrate.runscAMD64URL`, `controller.substrate.runscAMD64SHA256`, `controller.substrate.runscARM64URL`, and `controller.substrate.runscARM64SHA256` from the Helm values Override them with `--set` or a values file when you need to pin a different gVisor build.
44+
45+
Create a harness. If `snapshotsConfig` is omitted, kagent defaults it to `gs://ate-snapshots/<namespace>/<agentharnessname>`.
46+
47+
- **Worker pool** — reference an existing pool (`workerPoolRef`) or configure a controller default WorkerPool
48+
- **Gateway token** — required per harness with either `gatewayToken` or `gatewayTokenSecretRef`
49+
50+
```yaml
51+
apiVersion: kagent.dev/v1alpha2
52+
kind: AgentHarness
53+
metadata:
54+
name: peterj-claw
55+
namespace: kagent
56+
spec:
57+
runtime: substrate
58+
backend: openclaw
59+
description: OpenClaw on Agent Substrate
60+
modelConfigRef: default-model-config
61+
substrate:
62+
# Optional: defaults to gs://ate-snapshots/kagent/peterj-claw
63+
# snapshotsConfig:
64+
# location: gs://ate-snapshots/kagent/peterj-claw
65+
66+
# Required unless the controller has a default WorkerPool configured.
67+
workerPoolRef:
68+
name: kagent-default
69+
70+
# Required: configure the OpenClaw gateway token for this harness.
71+
# Use either gatewayToken or gatewayTokenSecretRef. The Secret must contain key "token".
72+
gatewayToken: test-token
73+
74+
# gatewayTokenSecretRef:
75+
# name: openclaw-gateway-token
76+
77+
# Optional: override the sandbox image used in the ActorTemplate (must be digest-pinned).
78+
# workloadImage: ghcr.io/kagent-dev/nemoclaw/sandbox-base@sha256:d52bee415dc4c0dba7164f9eabe727574c056d4f211781f20af249707883a3b4
79+
```
80+
81+
kagent creates an `ActorTemplate` that looks roughly like this:
82+
83+
```yaml
84+
apiVersion: ate.dev/v1alpha1
85+
kind: ActorTemplate
86+
metadata:
87+
name: peterj-claw
88+
namespace: kagent
89+
labels:
90+
app.kubernetes.io/managed-by: kagent
91+
kagent.dev/agent-harness: peterj-claw
92+
spec:
93+
pauseImage: gcr.io/gke-release/pause@sha256:bcbd57ba5653580ec647b16d8163cdd1112df3609129b01f912a8032e48265da
94+
runsc:
95+
amd64:
96+
url: gs://gvisor/releases/nightly/2026-05-19/x86_64/runsc
97+
sha256Hash: a397be1abc2420d26bce6c70e6e2ff96c73aaaab929756c56f5e2089ea842b63
98+
arm64:
99+
url: gs://gvisor/releases/nightly/2026-05-19/aarch64/runsc
100+
sha256Hash: 1ba2366ae2efceba166046f51a4104f9261c9cb72c6db8f5b3fe2dc57dea86b9
101+
workerPoolRef:
102+
name: peterj-claw-wp
103+
namespace: kagent
104+
snapshotsConfig:
105+
location: gs://ate-snapshots/kagent/peterj-claw
106+
containers:
107+
- name: openclaw
108+
image: ghcr.io/kagent-dev/nemoclaw/sandbox-base@sha256:d52bee415dc4c0dba7164f9eabe727574c056d4f211781f20af249707883a3b4
109+
ports:
110+
- containerPort: 80
111+
command:
112+
- /bin/sh
113+
- -c
114+
- |
115+
# Generated by kagent:
116+
# 1. writes ~/.openclaw/openclaw.json from modelConfigRef/channels/gateway token
117+
# 2. configures gateway.controlUi.basePath for the kagent proxy path
118+
# 3. starts `openclaw gateway run --port 80 --allow-unconfigured`
119+
# 4. waits for the gateway and tails the log
120+
env:
121+
- name: HOME
122+
value: /root
123+
```
124+
125+
The generated `command` contains a base64-encoded `openclaw.json`, so the live object will be more verbose than the abbreviated example above. `pauseImage`, runsc URLs and hashes, and the default workload image come from controller/Helm configuration unless overridden on the `AgentHarness`; the gateway token comes from `spec.substrate.gatewayToken` or `gatewayTokenSecretRef`. kagent also sets `gateway.controlUi.basePath` to `/api/agentharnesses/<namespace>/<name>/gateway` so OpenClaw serves the Control UI under the same path kagent proxies.
126+
127+
When `modelConfigRef` or `spec.channels` are set, credentials are **not** copied into the ActorTemplate or `openclaw.json` as plaintext. kagent writes `valueFrom.secretKeyRef` (or inline `value` for harness inline tokens) on the ActorTemplate container env; Substrate `ate-api` resolves those refs at actor resume. In `openclaw.json`, kagent uses OpenClaw [env SecretRefs](https://docs.openclaw.ai/gateway/secrets) (`{source:"env",provider:"default",id:"<VAR>"}`) for `models.providers.*.apiKey`, `channels.telegram.accounts.*.botToken`, and `channels.slack.accounts.*.botToken` / `appToken`. Rotate a Secret and recreate the ActorTemplate golden snapshot when keys change.
128+
129+
With `controller.substrate.enabled=true`, the kagent Helm chart installs a namespace-scoped Role and RoleBinding so `ate-api-server` (in `ate-system` by default) can `get` Secrets and ConfigMaps referenced by generated ActorTemplates. Harnesses in other namespaces need that namespace listed in `rbac.namespaces` (or a matching RoleBinding applied manually).
130+
131+
Port-forward the UI:
132+
133+
```bash
134+
kubectl port-forward -n kagent svc/kagent-ui 8001:8080
135+
```
136+
137+
Navigate to the deployed agent harness. If the OpenClaw Control UI asks for a gateway connection, use:
138+
139+
- Gateway URL: `http://localhost:8001/api/agentharnesses/kagent/peterj-claw/gateway/`
140+
- Gateway token: `test-token`
141+
142+
The gateway URL must include the trailing slash. The token is the value configured in `spec.substrate.gatewayToken`, or the Secret value referenced by `spec.substrate.gatewayTokenSecretRef`; enter it in the token/credentials field rather than relying on a `token` query parameter.
143+
144+
kagent proxies UI traffic to the actor OpenClaw gateway through Substrate's **atenet-router** (Envoy) using the actor `Host` header (`<actor-id>.actors.resources.substrate.ate.dev`). The default router URL is `http://atenet-router.ate-system.svc:80`; override with `controller.substrate.atenetRouterURL` when needed.

go/api/config/crd/bases/kagent.dev_agentharnesses.yaml

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,9 @@ spec:
1919
scope: Namespaced
2020
versions:
2121
- additionalPrinterColumns:
22+
- jsonPath: .spec.runtime
23+
name: Runtime
24+
type: string
2225
- jsonPath: .spec.backend
2326
name: Backend
2427
type: string
@@ -511,6 +514,75 @@ spec:
511514
type: string
512515
type: array
513516
type: object
517+
runtime:
518+
default: openshell
519+
description: Runtime selects the harness provisioning stack. Defaults
520+
to openshell when unset.
521+
enum:
522+
- openshell
523+
- substrate
524+
type: string
525+
substrate:
526+
description: Substrate is required when runtime is substrate.
527+
properties:
528+
gatewayToken:
529+
description: |-
530+
GatewayToken is the OpenClaw gateway Bearer token for this harness.
531+
Prefer gatewayTokenSecretRef for production secrets.
532+
minLength: 1
533+
type: string
534+
gatewayTokenSecretRef:
535+
description: |-
536+
GatewayTokenSecretRef references a Secret key holding the OpenClaw gateway Bearer token.
537+
The Secret must contain a "token" key.
538+
properties:
539+
apiGroup:
540+
type: string
541+
kind:
542+
type: string
543+
name:
544+
type: string
545+
required:
546+
- name
547+
type: object
548+
snapshotsConfig:
549+
description: |-
550+
SnapshotsConfig configures actor memory snapshots. Defaults to
551+
gs://ate-snapshots/<namespace>/<agentharnessname> when unset.
552+
properties:
553+
location:
554+
description: |-
555+
Location is the GCS URI prefix for golden and incremental snapshots.
556+
Example: gs://ate-snapshots/kagent/my-namespace/my-harness/
557+
pattern: ^gs://
558+
type: string
559+
required:
560+
- location
561+
type: object
562+
workerPoolRef:
563+
description: |-
564+
WorkerPoolRef references an existing ate.dev WorkerPool in the harness namespace.
565+
When unset, the controller uses its configured default WorkerPool.
566+
properties:
567+
apiGroup:
568+
type: string
569+
kind:
570+
type: string
571+
name:
572+
type: string
573+
required:
574+
- name
575+
type: object
576+
workloadImage:
577+
description: WorkloadImage overrides the default nemoclaw/openclaw
578+
sandbox image in the ActorTemplate.
579+
type: string
580+
type: object
581+
x-kubernetes-validations:
582+
- message: Exactly one of gatewayToken or gatewayTokenSecretRef must
583+
be specified
584+
rule: (has(self.gatewayToken) && !has(self.gatewayTokenSecretRef))
585+
|| (!has(self.gatewayToken) && has(self.gatewayTokenSecretRef))
514586
required:
515587
- backend
516588
type: object
@@ -520,6 +592,10 @@ spec:
520592
|| (has(c.slack) && ((self.backend == ''hermes'' && has(c.slack.hermes)
521593
&& !has(c.slack.openclaw)) || ((self.backend == ''openclaw'' || self.backend
522594
== ''nemoclaw'') && has(c.slack.openclaw) && !has(c.slack.hermes)))))'
595+
- message: spec.substrate may only be set when runtime is substrate
596+
rule: '!has(self.substrate) || self.runtime == ''substrate'''
597+
- message: spec.substrate is required when runtime is substrate
598+
rule: self.runtime != 'substrate' || has(self.substrate)
523599
status:
524600
description: AgentHarnessStatus is the observed state of an AgentHarness.
525601
properties:

go/api/httpapi/substrate.go

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
package httpapi
2+
3+
// SubstrateStatusResponse aggregates Agent Substrate control-plane and Kubernetes state.
4+
type SubstrateStatusResponse struct {
5+
// Enabled is true when the controller is configured with an ate-api endpoint.
6+
Enabled bool `json:"enabled"`
7+
// AteAPIError is set when ate-api list calls fail (actors/workers may be partial or empty).
8+
AteAPIError string `json:"ateApiError,omitempty"`
9+
10+
WorkerPools []SubstrateWorkerPoolEntry `json:"workerPools"`
11+
ActorTemplates []SubstrateActorTemplateEntry `json:"actorTemplates"`
12+
Actors []SubstrateActorEntry `json:"actors"`
13+
Workers []SubstrateWorkerEntry `json:"workers"`
14+
}
15+
16+
// SubstrateWorkerPoolEntry is a ate.dev WorkerPool CR.
17+
type SubstrateWorkerPoolEntry struct {
18+
Namespace string `json:"namespace"`
19+
Name string `json:"name"`
20+
Replicas int32 `json:"replicas"`
21+
AteomImage string `json:"ateomImage"`
22+
}
23+
24+
// SubstrateActorTemplateEntry is a ate.dev ActorTemplate CR.
25+
type SubstrateActorTemplateEntry struct {
26+
Namespace string `json:"namespace"`
27+
Name string `json:"name"`
28+
Phase string `json:"phase,omitempty"`
29+
GoldenActorID string `json:"goldenActorId,omitempty"`
30+
GoldenSnapshot string `json:"goldenSnapshot,omitempty"`
31+
WorkerPoolRef string `json:"workerPoolRef,omitempty"`
32+
HarnessName string `json:"harnessName,omitempty"`
33+
ManagedByKagent bool `json:"managedByKagent"`
34+
}
35+
36+
// SubstrateActorEntry is runtime state from ate-api (redis).
37+
type SubstrateActorEntry struct {
38+
ActorID string `json:"actorId"`
39+
Status string `json:"status"`
40+
ActorTemplateNamespace string `json:"actorTemplateNamespace,omitempty"`
41+
ActorTemplateName string `json:"actorTemplateName,omitempty"`
42+
AteomPodNamespace string `json:"ateomPodNamespace,omitempty"`
43+
AteomPodName string `json:"ateomPodName,omitempty"`
44+
AteomPodIP string `json:"ateomPodIp,omitempty"`
45+
LastSnapshot string `json:"lastSnapshot,omitempty"`
46+
InProgressSnapshot string `json:"inProgressSnapshot,omitempty"`
47+
Version int64 `json:"version,omitempty"`
48+
}
49+
50+
// SubstrateWorkerEntry is a worker assignment from ate-api (redis).
51+
type SubstrateWorkerEntry struct {
52+
WorkerNamespace string `json:"workerNamespace"`
53+
WorkerPool string `json:"workerPool"`
54+
WorkerPod string `json:"workerPod"`
55+
ActorNamespace string `json:"actorNamespace,omitempty"`
56+
ActorTemplate string `json:"actorTemplate,omitempty"`
57+
ActorID string `json:"actorId,omitempty"`
58+
IP string `json:"ip,omitempty"`
59+
Version int64 `json:"version,omitempty"`
60+
}

go/api/httpapi/types.go

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -144,6 +144,17 @@ type OpenshellAgentHarnessListEntry struct {
144144
Endpoint string `json:"endpoint,omitempty"`
145145
}
146146

147+
// SubstrateAgentHarnessListEntry is set when runtime is substrate.
148+
type SubstrateAgentHarnessListEntry struct {
149+
Backend v1alpha2.AgentHarnessBackendType `json:"backend"`
150+
Runtime v1alpha2.AgentHarnessRuntime `json:"runtime"`
151+
ActorID string `json:"actorId,omitempty"`
152+
GatewayUIPath string `json:"gatewayUIPath,omitempty"`
153+
ModelConfigRef string `json:"modelConfigRef,omitempty"`
154+
BackendRefID string `json:"backendRefId,omitempty"`
155+
Endpoint string `json:"endpoint,omitempty"`
156+
}
157+
147158
type AgentResponse struct {
148159
ID string `json:"id"`
149160
Agent *AgentResource `json:"agent"`
@@ -157,6 +168,7 @@ type AgentResponse struct {
157168
Accepted bool `json:"accepted"`
158169
WorkloadMode v1alpha2.WorkloadMode `json:"workloadMode,omitempty"`
159170
OpenshellAgentHarness *OpenshellAgentHarnessListEntry `json:"openshellAgentHarness,omitempty"`
171+
SubstrateAgentHarness *SubstrateAgentHarnessListEntry `json:"substrateAgentHarness,omitempty"`
160172
}
161173

162174
// Session types

0 commit comments

Comments
 (0)