You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -391,11 +411,12 @@ func (s *Server) createSandboxGroup(
391
411
392
412
createdMutex.Lock()
393
413
created = append(created, createdRole{
394
-
name: role.Name,
395
-
resp: resp,
396
-
sandbox: sandbox,
397
-
sessionID: sandboxEntry.SessionID,
398
-
failed: false,
414
+
name: role.Name,
415
+
resp: resp,
416
+
sandbox: sandbox,
417
+
sessionID: sandboxEntry.SessionID,
418
+
serviceDNS: svcDNS,
419
+
failed: false,
399
420
})
400
421
createdMutex.Unlock()
401
422
return nil
@@ -421,10 +442,11 @@ func (s *Server) createSandboxGroup(
421
442
422
443
-**Parallel Sandbox Creation:** To prevent HTTP gateway or client timeouts when launching large groups, roles that do not share mutual dependencies (i.e., reside at the same level of the dependency DAG) are created in parallel. Sandbox creation proceeds in "dependency waves": all sandboxes within a wave are launched concurrently, and the server waits for all to be ready before proceeding to the next dependent wave.
423
444
-**Consistent TTL:** A single `baseTime` is captured before the creation loop begins and used for all `ShutdownTime` calculations. This ensures every sandbox in the group shares a synchronized absolute TTL, regardless of how long it takes to create each wave.
424
-
- The deferred rollback calls the existing `rollbackSandboxCreation()` function, without modification, for every sandbox in `created`.
425
-
- Roles are created in topological order. A dependency's endpoint is guaranteed to be in `created` before the dependent role's sandbox is built.
445
+
-**Headless Service per role:** Before each sandbox is created, `createHeadlessServiceForRole()` provisions a Headless Service whose DNS name (`serviceDNS`) is stored in `createdRole`. The `injectDependencyEndpoints()` function reads `serviceDNS` from `created` to construct stable DNS-based environment variables. On rollback, all created Services are explicitly deleted alongside their sandboxes.
446
+
- The deferred rollback calls `rollbackSandboxCreation()` for every sandbox in `created`**and** deletes all Headless Services tracked in `createdServices`, preventing resource leaks on partial failure.
447
+
- Roles are created in topological order. A dependency's `serviceDNS` is guaranteed to be in `created` before the dependent role's sandbox is built.
426
448
-`buildSandboxByAgentRuntime()`, `buildSandboxByCodeInterpreter()`, `createSandbox()`, `WatchSandboxOnce()`, and `rollbackSandboxCreation()` are all called as-is. The correct builder is selected by the `role.Kind` field.
427
-
- The `needGroupRollback` flag is only cleared after `SaveAgentGroup` succeeds. A store failure after all sandboxes are created will roll back the Kubernetes resources, maintaining consistency between the cluster state and the store.
449
+
- The `needGroupRollback` flag is only cleared after `SaveAgentGroup` succeeds. A store failure after all sandboxes are created will roll back both Kubernetes resources and Headless Services, maintaining consistency.
428
450
429
451
> [!NOTE]
430
452
> **Future Improvement: Reconciler-Based Orchestration.** The current design executes `createSandboxGroup()` synchronously within the API handler. For very large groups where the cumulative creation time may approach HTTP proxy timeouts, a more resilient approach would be to have the API handler persist the `MultiAgentRuntime` CRD with a `Creating` status and delegate the actual sandbox orchestration to the `MultiAgentRuntimeReconciler`. This ensures the system can recover and resume group creation even if the Workload Manager restarts mid-process. This is left as a future optimization since the wave-based parallelism already significantly reduces total startup latency for practical group sizes.
The algorithm is Kahn's BFS-based topological sort grouped into level-order waves, O(V+E). Cycle detection is derived from the invariant that Kahn's algorithm only produces a complete ordering when no cycle exists. If `totalSorted < len(roles)`, the roles with remaining in-degree are in a cycle or have missing dependencies. Their names are included in the error message to aid debugging.
566
+
The algorithm is Kahn's BFS-based topological sort grouped into level-order waves, O(V+E). Missing dependencies are validated **upfront** before the sort begins, ensuring clear error messages. Cycle detection is then derived from the invariant that Kahn's algorithm only produces a complete ordering when no cycle exists. If `totalSorted < len(roles)` after the upfront check passes, the remaining roles must form a cycle. Their names are included in the error message to aid debugging.
542
567
543
568
---
544
569
@@ -549,7 +574,7 @@ The algorithm is Kahn's BFS-based topological sort grouped into level-order wave
549
574
| Method | Path | Description |
550
575
|--------|------|-------------|
551
576
|`POST`|`/v1/multi-agent-runtime`| Create a new agent group. Returns group session ID and coordinator entrypoints. |
552
-
|`DELETE`|`/v1/multi-agent-runtime/groups/:groupSessionId`| Delete all sandboxes in the groupand remove the group manifest from the store. |
577
+
|`DELETE`|`/v1/multi-agent-runtime/groups/:groupSessionId`| Delete all sandboxes in the group, their associated Headless Services (via OwnerReference cascading), and remove the group manifest from the store. Returns `204 No Content` on success. |
553
578
|`GET`|`/v1/multi-agent-runtime/groups/:groupSessionId/topology`| Return the group manifest including all role endpoints and statuses. Intended for use by the coordinator at startup to discover worker endpoints. |
554
579
555
580
### Request and Response Types
@@ -567,33 +592,28 @@ type CreateAgentGroupRequest struct {
567
592
#### Create Group Response
568
593
569
594
```go
570
-
typeCreateAgentGroupResponsestruct {
571
-
GroupSessionIDstring`json:"groupSessionId"`
572
-
Roles []AgentGroupRoleResponse`json:"roles"`
573
-
}
574
-
575
-
typeAgentGroupRoleResponsestruct {
595
+
// AgentGroupRoleState is the shared type used across the API response, group manifest,
596
+
// and topology endpoint. A single type prevents structural drift between these surfaces.
597
+
typeAgentGroupRoleStatestruct {
576
598
Namestring`json:"name"`
577
599
SessionIDstring`json:"sessionId"`
578
600
Endpointstring`json:"endpoint"`
579
601
Statusstring`json:"status"`// "ready" | "failed"
580
602
}
603
+
604
+
typeCreateAgentGroupResponsestruct {
605
+
GroupSessionIDstring`json:"groupSessionId"`
606
+
Roles []AgentGroupRoleState`json:"roles"`
607
+
}
581
608
```
582
609
583
610
#### Group Manifest (stored in Redis/Valkey)
584
611
585
612
```go
586
613
typeAgentGroupManifeststruct {
587
-
GroupSessionIDstring`json:"groupSessionId"`
588
-
Roles []AgentGroupRole`json:"roles"`
589
-
CreatedAt time.Time`json:"createdAt"`
590
-
}
591
-
592
-
typeAgentGroupRolestruct {
593
-
Namestring`json:"name"`
594
-
SessionIDstring`json:"sessionId"`
595
-
Endpointstring`json:"endpoint"`
596
-
Statusstring`json:"status"`// "ready" | "failed"
614
+
GroupSessionIDstring`json:"groupSessionId"`
615
+
Roles []AgentGroupRoleState`json:"roles"`
616
+
CreatedAt time.Time`json:"createdAt"`
597
617
}
598
618
```
599
619
@@ -671,6 +691,9 @@ type MultiAgentRuntimeSpec struct {
671
691
672
692
// SessionTimeout is the idle timeout applied to all sandboxes in the group.
673
693
// Defaults to 15m.
694
+
// NOTE: Although this is a pointer type (*metav1.Duration), kubebuilder applies the
695
+
// default value at admission time, so the pointer is always non-nil after defaulting.
696
+
// The nil check in createSandboxGroup() is a defensive guard for programmatic callers.
|`sdk-python/agentcube/multi_agent.py`|`MultiAgentRuntimeClient` for the Python SDK |
895
937
|`sdk-python/examples/multi_agent_usage.py`| End-to-end usage example |
896
938
|`test/e2e/multi_agent_runtime.yaml`| E2E test fixtures |
@@ -937,7 +979,7 @@ Deliverables that satisfy the mentorship expected outcomes on their own.
937
979
- No two roles may produce the same sanitized environment variable key (naming collision detection).
938
980
-`dependencies[]` references must point to roles defined within the same spec.
939
981
- Role names must be valid DNS label fragments (lowercase alphanumeric and hyphens, max 63 characters).
940
-
- Implement `createSandboxGroup()` with `Atomic` rollback (no `BestEffort` yet).
982
+
- Implement `createSandboxGroup()` with `Atomic` rollback (no `BestEffort` yet), including `topoSort()`, `injectDependencyEndpoints()`, and Headless Service creation per role.
941
983
- Add `GroupSessionID` + `Role` to `SandboxInfo`; propagate through `buildSandboxPlaceHolder()` + `buildSandboxInfo()`.
942
984
- Implement all 5 store methods in `store_redis.go` + `store_valkey.go` with full unit test coverage.
943
985
- Add `MultiAgentRuntimeKind` to Router endpoint switch.
@@ -953,9 +995,8 @@ Deliverables that satisfy the mentorship expected outcomes on their own.
953
995
- Group creation uses `SandboxClaim` for warm roles, cold `Sandbox` creation for others.
954
996
- Add E2E test comparing cold-start vs warm-start group creation latency.
955
997
956
-
### Phase 3 - DAG Startup and Topology (Weeks 7-8)
998
+
### Phase 3 - Topology Endpoint and SDK (Weeks 7-8)
0 commit comments