You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/design/multi-agent-runtime-proposal.md
+14-13Lines changed: 14 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -236,12 +236,12 @@ The environment variables injected into the dependent pod's containers point to
236
236
**Injection Scope:**
237
237
* The dependency endpoints are injected into the `Env` list of **all containers** (including primary, sidecar, and init-containers) defined in the pod spec. This ensures that any multi-container runtime configuration can reliably resolve the endpoints.
238
238
239
-
For a role with `dependencies: [my-planner]` in namespace `default` (where `my-planner` maps to service name `grp-xyz-my-planner` and exposes a port named `api` at `8080` and `metrics` at `9090`), the dependent pod's containers receive:
239
+
For a role with `dependencies: [my-planner]` in namespace `default` (where `my-planner` maps to service name `mar-abcdef12-my-planner` and exposes a port named `api` at `8080` and `metrics` at `9090`), the dependent pod's containers receive:
Injection happens in-memory inside `createSandboxGroup()` by mutating the pod template before it is passed to `buildSandboxByAgentRuntime()`. The referenced `AgentRuntime` CRD object in the informer cache is never written.
@@ -460,7 +460,7 @@ sequenceDiagram
460
460
461
461
WM->>Store: SaveAgentGroup(manifest)
462
462
WM-->>Router: CreateAgentGroupResponse
463
-
Router-->>Client: 200 OK + groupSessionId
463
+
Router-->>Client: 200 OK + CreateAgentGroupResponse
464
464
```
465
465
466
466
### Topological Sort and Cycle Detection
@@ -594,7 +594,7 @@ type AgentGroupRole struct {
594
594
595
595
### Store Interface Additions
596
596
597
-
Four new methods are added to the `Store` interface in `pkg/store/interface.go`. All existing methods are unchanged.
597
+
Five new methods are added to the `Store` interface in `pkg/store/interface.go`. All existing methods are unchanged.
598
598
599
599
```go
600
600
// SaveAgentGroup persists a group manifest keyed by groupSessionID.
@@ -795,11 +795,12 @@ The existing GC in `pkg/workloadmanager/garbage_collection.go` is extended with
795
795
Because the Router only proxies external traffic directly to the coordinator, only the coordinator's `LastActivityAt` timestamp in the store is updated during active sessions. Internal worker sandboxes that receive no direct external traffic would otherwise retain static `LastActivityAt` values, causing the GC to prematurely delete them while the coordinator is still active.
796
796
797
797
To prevent this, the GC evaluates idle timeouts group-wide:
798
-
1. When checking if a sandbox is idle, if its `GroupSessionID` is non-empty, the GC retrieves the group manifest once per GC cycle (cached in a `map[string]*AgentGroupManifest` local to the cycle) and looks up the coordinator sandbox from the manifest.
799
-
2. The idle duration for **all members of the group** is calculated based on the coordinator's `LastActivityAt` timestamp (or the maximum `LastActivityAt` among all group member sandboxes if the coordinator's timestamp is unavailable).
800
-
3. Individual sandboxes in a group are only deleted for inactivity if the group as a whole is determined to be idle.
798
+
1. When checking if a sandbox is idle, if its `GroupSessionID` is non-empty, the GC retrieves the group manifest once per GC cycle via `GetAgentGroup()` (result cached in a `map[string]*AgentGroupManifest` local to that cycle). The manifest's `role:*` fields contain the `SessionID` of each member, including the coordinator.
799
+
2. The coordinator's `SandboxInfo` (including `LastActivityAt`) is fetched with a single `GetSandbox(coordinatorSessionID)` call. This result is also cached per group per cycle, so it is only fetched once regardless of how many worker sandboxes belong to that group. If the coordinator's `SandboxInfo` is unavailable (e.g., already evicted), the GC falls back to the maximum `LastActivityAt` among all group member sandboxes whose `SandboxInfo` can be retrieved.
800
+
3. The idle duration for **all members of the group** is calculated based on the resolved coordinator (or fallback) `LastActivityAt` timestamp.
801
+
4. Individual sandboxes in a group are only deleted for inactivity if the group as a whole is determined to be idle.
801
802
802
-
Caching the manifest per group per GC cycle avoids O(N) redundant store lookups where N is the number of worker sandboxes in the group.
803
+
Caching both the group manifest and the coordinator `SandboxInfo`per group per GC cycle reduces the total number of store roundtrips to O(1) per group rather than O(N) per group member.
803
804
804
805
### Group Metadata Cleanup
805
806
@@ -908,9 +909,9 @@ This feature is fully backward compatible. No existing behavior changes unless t
908
909
|`pkg/workloadmanager/server.go`| Add 3 new routes under `/v1/multi-agent-runtime`|
909
910
|`pkg/workloadmanager/garbage_collection.go`| Group manifest cleanup when last member sandbox is GC'd |
@@ -933,7 +934,7 @@ Deliverables that satisfy the mentorship expected outcomes on their own.
933
934
- Role names must be valid DNS label fragments (lowercase alphanumeric and hyphens, max 63 characters).
934
935
- Implement `createSandboxGroup()` with `Atomic` rollback (no `BestEffort` yet).
935
936
- Add `GroupSessionID` + `Role` to `SandboxInfo`; propagate through `buildSandboxPlaceHolder()` + `buildSandboxInfo()`.
936
-
- Implement all 4 store methods in `store_redis.go` + `store_valkey.go` with full unit test coverage.
937
+
- Implement all 5 store methods in `store_redis.go` + `store_valkey.go` with full unit test coverage.
937
938
- Add `MultiAgentRuntimeKind` to Router endpoint switch.
938
939
- Extend GC to clean up `agentgroup:` manifest keys when last member sandbox is deleted.
939
940
- Unit tests: `createSandboxGroup()` with atomic rollback on partial failure, store CRUD, coordinator validation, cycle detection, admission webhook validation.
0 commit comments