Skip to content

Commit 06b58b8

Browse files
committed
docs: service creation, DNS consistency, type unification, phase ordering, and CRD status
Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
1 parent fd4ad75 commit 06b58b8

1 file changed

Lines changed: 88 additions & 47 deletions

File tree

docs/design/multi-agent-runtime-proposal.md

Lines changed: 88 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -278,11 +278,12 @@ agentgroup:{grp-xxx} -> AgentGroupManifest{
278278
279279
```go
280280
type createdRole struct {
281-
name string
282-
resp *types.CreateSandboxResponse
283-
sandbox *sandboxv1alpha1.Sandbox
284-
sessionID string
285-
failed bool
281+
name string
282+
resp *types.CreateSandboxResponse
283+
sandbox *sandboxv1alpha1.Sandbox
284+
sessionID string
285+
serviceDNS string // stable DNS from the Headless Service (e.g., mar-abc-planner.ns.svc.cluster.local)
286+
failed bool
286287
}
287288
288289
func (s *Server) createSandboxGroup(
@@ -298,6 +299,7 @@ func (s *Server) createSandboxGroup(
298299
baseTime := time.Now()
299300
300301
needGroupRollback := true
302+
var createdServices []string // tracks Headless Service names for rollback
301303
defer func() {
302304
if !needGroupRollback {
303305
return
@@ -306,9 +308,16 @@ func (s *Server) createSandboxGroup(
306308
if c.failed {
307309
continue
308310
}
309-
// rollbackSandboxCreation is called as-is (no changes to the function).
310311
s.rollbackSandboxCreation(dynamicClient, c.sandbox, nil, c.sessionID)
311312
}
313+
// Clean up any Headless Services created during group setup.
314+
for _, svcName := range createdServices {
315+
if err := s.kubeClient.CoreV1().Services(mar.Namespace).Delete(
316+
ctx, svcName, metav1.DeleteOptions{},
317+
); err != nil && !apierrors.IsNotFound(err) {
318+
klog.Warningf("group %s: rollback service %s: %v", groupSessionID, svcName, err)
319+
}
320+
}
312321
}()
313322
314323
waves, err := topoSort(mar.Spec.Roles)
@@ -360,6 +369,17 @@ func (s *Server) createSandboxGroup(
360369
sandbox.Spec.Lifecycle.ShutdownTime = &shutdownTime
361370
}
362371
372+
// Create a Headless Service for this role to provide a stable DNS endpoint.
373+
svcName, svcDNS, err := s.createHeadlessServiceForRole(
374+
ctx, mar.Namespace, groupSessionID, role.Name, sandbox,
375+
)
376+
if err != nil {
377+
return fmt.Errorf("role %s: create headless service: %w", role.Name, err)
378+
}
379+
createdMutex.Lock()
380+
createdServices = append(createdServices, svcName)
381+
createdMutex.Unlock()
382+
363383
if len(role.Dependencies) > 0 {
364384
createdMutex.Lock()
365385
injectErr := injectDependencyEndpoints(&sandbox.Spec.PodTemplate, groupSessionID, role.Dependencies, created)
@@ -391,11 +411,12 @@ func (s *Server) createSandboxGroup(
391411
392412
createdMutex.Lock()
393413
created = append(created, createdRole{
394-
name: role.Name,
395-
resp: resp,
396-
sandbox: sandbox,
397-
sessionID: sandboxEntry.SessionID,
398-
failed: false,
414+
name: role.Name,
415+
resp: resp,
416+
sandbox: sandbox,
417+
sessionID: sandboxEntry.SessionID,
418+
serviceDNS: svcDNS,
419+
failed: false,
399420
})
400421
createdMutex.Unlock()
401422
return nil
@@ -421,10 +442,11 @@ func (s *Server) createSandboxGroup(
421442

422443
- **Parallel Sandbox Creation:** To prevent HTTP gateway or client timeouts when launching large groups, roles that do not share mutual dependencies (i.e., reside at the same level of the dependency DAG) are created in parallel. Sandbox creation proceeds in "dependency waves": all sandboxes within a wave are launched concurrently, and the server waits for all to be ready before proceeding to the next dependent wave.
423444
- **Consistent TTL:** A single `baseTime` is captured before the creation loop begins and used for all `ShutdownTime` calculations. This ensures every sandbox in the group shares a synchronized absolute TTL, regardless of how long it takes to create each wave.
424-
- The deferred rollback calls the existing `rollbackSandboxCreation()` function, without modification, for every sandbox in `created`.
425-
- Roles are created in topological order. A dependency's endpoint is guaranteed to be in `created` before the dependent role's sandbox is built.
445+
- **Headless Service per role:** Before each sandbox is created, `createHeadlessServiceForRole()` provisions a Headless Service whose DNS name (`serviceDNS`) is stored in `createdRole`. The `injectDependencyEndpoints()` function reads `serviceDNS` from `created` to construct stable DNS-based environment variables. On rollback, all created Services are explicitly deleted alongside their sandboxes.
446+
- The deferred rollback calls `rollbackSandboxCreation()` for every sandbox in `created` **and** deletes all Headless Services tracked in `createdServices`, preventing resource leaks on partial failure.
447+
- Roles are created in topological order. A dependency's `serviceDNS` is guaranteed to be in `created` before the dependent role's sandbox is built.
426448
- `buildSandboxByAgentRuntime()`, `buildSandboxByCodeInterpreter()`, `createSandbox()`, `WatchSandboxOnce()`, and `rollbackSandboxCreation()` are all called as-is. The correct builder is selected by the `role.Kind` field.
427-
- The `needGroupRollback` flag is only cleared after `SaveAgentGroup` succeeds. A store failure after all sandboxes are created will roll back the Kubernetes resources, maintaining consistency between the cluster state and the store.
449+
- The `needGroupRollback` flag is only cleared after `SaveAgentGroup` succeeds. A store failure after all sandboxes are created will roll back both Kubernetes resources and Headless Services, maintaining consistency.
428450

429451
> [!NOTE]
430452
> **Future Improvement: Reconciler-Based Orchestration.** The current design executes `createSandboxGroup()` synchronously within the API handler. For very large groups where the cumulative creation time may approach HTTP proxy timeouts, a more resilient approach would be to have the API handler persist the `MultiAgentRuntime` CRD with a `Creating` status and delegate the actual sandbox orchestration to the `MultiAgentRuntimeReconciler`. This ensures the system can recover and resume group creation even if the Workload Manager restarts mid-process. This is left as a future optimization since the wave-based parallelism already significantly reduces total startup latency for practical group sizes.
@@ -444,20 +466,23 @@ sequenceDiagram
444466
WM->>WM: topoSort(roles) -> [planner, researcher, coder]
445467
446468
Note over WM, K8s: Role 1: planner (coordinator)
469+
WM->>K8s: Create Headless Service [planner]
447470
WM->>Store: StoreSandbox(placeholder)
448471
WM->>K8s: Create Sandbox [planner]
449472
K8s-->>WM: Sandbox Ready
450473
WM->>Store: UpdateSandbox(ready)
451474
452475
Note over WM, K8s: Role 2: researcher (depends on planner)
453-
WM->>WM: injectDependencyEndpoints(planner IP)
476+
WM->>WM: injectDependencyEndpoints(planner Service DNS)
477+
WM->>K8s: Create Headless Service [researcher]
454478
WM->>Store: StoreSandbox(placeholder)
455479
WM->>K8s: Create Sandbox [researcher]
456480
K8s-->>WM: Sandbox Ready
457481
WM->>Store: UpdateSandbox(ready)
458482
459483
Note over WM, K8s: Role 3: coder (depends on planner, researcher)
460-
WM->>WM: injectDependencyEndpoints(planner IP, researcher IP)
484+
WM->>WM: injectDependencyEndpoints(planner Service DNS, researcher Service DNS)
485+
WM->>K8s: Create Headless Service [coder]
461486
WM->>Store: StoreSandbox(placeholder)
462487
WM->>K8s: Create Sandbox [coder]
463488
K8s-->>WM: Sandbox Ready
@@ -481,7 +506,15 @@ func topoSort(roles []RoleSpec) ([][]RoleSpec, error) {
481506
if _, exists := inDegree[r.Name]; !exists {
482507
inDegree[r.Name] = 0
483508
}
509+
}
510+
511+
// Validate all dependency references before running Kahn's algorithm.
512+
// This provides clear "missing dependency" errors instead of confusing cycle errors.
513+
for _, r := range roles {
484514
for _, dep := range r.Dependencies {
515+
if _, exists := roleMap[dep]; !exists {
516+
return nil, fmt.Errorf("role %s depends on non-existent role %s", r.Name, dep)
517+
}
485518
adj[dep] = append(adj[dep], r.Name)
486519
inDegree[r.Name]++
487520
}
@@ -516,15 +549,7 @@ func topoSort(roles []RoleSpec) ([][]RoleSpec, error) {
516549
}
517550

518551
if totalSorted != len(roles) {
519-
// Check for missing dependencies first to provide a better error message.
520-
for _, r := range roles {
521-
for _, dep := range r.Dependencies {
522-
if _, exists := roleMap[dep]; !exists {
523-
return nil, fmt.Errorf("role %s depends on non-existent role %s", r.Name, dep)
524-
}
525-
}
526-
}
527-
// Identify and name the roles involved in the cycle for the error message.
552+
// All dependencies are valid (checked above), so this must be a cycle.
528553
var cycled []string
529554
for name, deg := range inDegree {
530555
if deg > 0 {
@@ -538,7 +563,7 @@ func topoSort(roles []RoleSpec) ([][]RoleSpec, error) {
538563
}
539564
```
540565

541-
The algorithm is Kahn's BFS-based topological sort grouped into level-order waves, O(V+E). Cycle detection is derived from the invariant that Kahn's algorithm only produces a complete ordering when no cycle exists. If `totalSorted < len(roles)`, the roles with remaining in-degree are in a cycle or have missing dependencies. Their names are included in the error message to aid debugging.
566+
The algorithm is Kahn's BFS-based topological sort grouped into level-order waves, O(V+E). Missing dependencies are validated **upfront** before the sort begins, ensuring clear error messages. Cycle detection is then derived from the invariant that Kahn's algorithm only produces a complete ordering when no cycle exists. If `totalSorted < len(roles)` after the upfront check passes, the remaining roles must form a cycle. Their names are included in the error message to aid debugging.
542567

543568
---
544569

@@ -549,7 +574,7 @@ The algorithm is Kahn's BFS-based topological sort grouped into level-order wave
549574
| Method | Path | Description |
550575
|--------|------|-------------|
551576
| `POST` | `/v1/multi-agent-runtime` | Create a new agent group. Returns group session ID and coordinator entrypoints. |
552-
| `DELETE` | `/v1/multi-agent-runtime/groups/:groupSessionId` | Delete all sandboxes in the group and remove the group manifest from the store. |
577+
| `DELETE` | `/v1/multi-agent-runtime/groups/:groupSessionId` | Delete all sandboxes in the group, their associated Headless Services (via OwnerReference cascading), and remove the group manifest from the store. Returns `204 No Content` on success. |
553578
| `GET` | `/v1/multi-agent-runtime/groups/:groupSessionId/topology` | Return the group manifest including all role endpoints and statuses. Intended for use by the coordinator at startup to discover worker endpoints. |
554579

555580
### Request and Response Types
@@ -567,33 +592,28 @@ type CreateAgentGroupRequest struct {
567592
#### Create Group Response
568593

569594
```go
570-
type CreateAgentGroupResponse struct {
571-
GroupSessionID string `json:"groupSessionId"`
572-
Roles []AgentGroupRoleResponse `json:"roles"`
573-
}
574-
575-
type AgentGroupRoleResponse struct {
595+
// AgentGroupRoleState is the shared type used across the API response, group manifest,
596+
// and topology endpoint. A single type prevents structural drift between these surfaces.
597+
type AgentGroupRoleState struct {
576598
Name string `json:"name"`
577599
SessionID string `json:"sessionId"`
578600
Endpoint string `json:"endpoint"`
579601
Status string `json:"status"` // "ready" | "failed"
580602
}
603+
604+
type CreateAgentGroupResponse struct {
605+
GroupSessionID string `json:"groupSessionId"`
606+
Roles []AgentGroupRoleState `json:"roles"`
607+
}
581608
```
582609

583610
#### Group Manifest (stored in Redis/Valkey)
584611

585612
```go
586613
type AgentGroupManifest struct {
587-
GroupSessionID string `json:"groupSessionId"`
588-
Roles []AgentGroupRole `json:"roles"`
589-
CreatedAt time.Time `json:"createdAt"`
590-
}
591-
592-
type AgentGroupRole struct {
593-
Name string `json:"name"`
594-
SessionID string `json:"sessionId"`
595-
Endpoint string `json:"endpoint"`
596-
Status string `json:"status"` // "ready" | "failed"
614+
GroupSessionID string `json:"groupSessionId"`
615+
Roles []AgentGroupRoleState `json:"roles"`
616+
CreatedAt time.Time `json:"createdAt"`
597617
}
598618
```
599619

@@ -671,6 +691,9 @@ type MultiAgentRuntimeSpec struct {
671691
672692
// SessionTimeout is the idle timeout applied to all sandboxes in the group.
673693
// Defaults to 15m.
694+
// NOTE: Although this is a pointer type (*metav1.Duration), kubebuilder applies the
695+
// default value at admission time, so the pointer is always non-nil after defaulting.
696+
// The nil check in createSandboxGroup() is a defensive guard for programmatic callers.
674697
// +kubebuilder:default="15m"
675698
SessionTimeout *metav1.Duration `json:"sessionTimeout,omitempty"`
676699
@@ -734,6 +757,21 @@ type MultiAgentRuntimeStatus struct {
734757
735758
// Ready is true when all required roles are running and healthy.
736759
Ready bool `json:"ready,omitempty"`
760+
761+
// RoleStatuses tracks per-role operational state. Updated by the reconciler
762+
// during self-healing (Phase 4). Complements the store-side group manifest
763+
// by providing Kubernetes-native observability via kubectl.
764+
// +optional
765+
RoleStatuses []RoleStatusEntry `json:"roleStatuses,omitempty"`
766+
}
767+
768+
type RoleStatusEntry struct {
769+
// Name is the role name matching RoleSpec.Name.
770+
Name string `json:"name"`
771+
// Status is the current operational state: "Ready", "Failed", "Replacing".
772+
Status string `json:"status"`
773+
// SessionID is the sandbox session ID for this role, if available.
774+
SessionID string `json:"sessionId,omitempty"`
737775
}
738776
```
739777
@@ -851,7 +889,8 @@ group = client.create_group(
851889
namespace="default",
852890
)
853891
print(f"Group created: {group.group_session_id}")
854-
print(f"Coordinator endpoint: {group.roles[0].endpoint}")
892+
coordinator = next(r for r in group.roles if r.name == "planner")
893+
print(f"Coordinator endpoint: {coordinator.endpoint}")
855894

856895
# Discover worker topology (coordinator calls this at startup)
857896
topology = client.get_topology(group.group_session_id)
@@ -891,6 +930,9 @@ This feature is fully backward compatible. No existing behavior changes unless t
891930
| `pkg/apis/runtime/v1alpha1/multiagentruntime_types.go` | CRD types with kubebuilder markers |
892931
| `pkg/workloadmanager/multiagent_controller.go` | `MultiAgentRuntimeReconciler` |
893932
| `pkg/workloadmanager/multiagent_controller_test.go` | Reconciler unit tests |
933+
| `pkg/workloadmanager/multiagent_webhook.go` | `ValidatingAdmissionWebhook` for `MultiAgentRuntime` configuration validation |
934+
| `pkg/workloadmanager/multiagent_webhook_test.go` | Webhook unit tests (coordinator count, naming collisions, missing deps, DNS label) |
935+
| `manifests/charts/base/templates/multiagentruntime-webhook.yaml` | Webhook `ValidatingWebhookConfiguration` manifest |
894936
| `sdk-python/agentcube/multi_agent.py` | `MultiAgentRuntimeClient` for the Python SDK |
895937
| `sdk-python/examples/multi_agent_usage.py` | End-to-end usage example |
896938
| `test/e2e/multi_agent_runtime.yaml` | E2E test fixtures |
@@ -937,7 +979,7 @@ Deliverables that satisfy the mentorship expected outcomes on their own.
937979
- No two roles may produce the same sanitized environment variable key (naming collision detection).
938980
- `dependencies[]` references must point to roles defined within the same spec.
939981
- Role names must be valid DNS label fragments (lowercase alphanumeric and hyphens, max 63 characters).
940-
- Implement `createSandboxGroup()` with `Atomic` rollback (no `BestEffort` yet).
982+
- Implement `createSandboxGroup()` with `Atomic` rollback (no `BestEffort` yet), including `topoSort()`, `injectDependencyEndpoints()`, and Headless Service creation per role.
941983
- Add `GroupSessionID` + `Role` to `SandboxInfo`; propagate through `buildSandboxPlaceHolder()` + `buildSandboxInfo()`.
942984
- Implement all 5 store methods in `store_redis.go` + `store_valkey.go` with full unit test coverage.
943985
- Add `MultiAgentRuntimeKind` to Router endpoint switch.
@@ -953,9 +995,8 @@ Deliverables that satisfy the mentorship expected outcomes on their own.
953995
- Group creation uses `SandboxClaim` for warm roles, cold `Sandbox` creation for others.
954996
- Add E2E test comparing cold-start vs warm-start group creation latency.
955997

956-
### Phase 3 - DAG Startup and Topology (Weeks 7-8)
998+
### Phase 3 - Topology Endpoint and SDK (Weeks 7-8)
957999

958-
- Implement `dependencies[]` field: `topoSort()` + `injectDependencyEndpoints()`.
9591000
- Add `GET /v1/multi-agent-runtime/groups/:groupSessionId/topology` endpoint.
9601001
- Add `get_topology()` to Python SDK `MultiAgentRuntimeClient`.
9611002
- E2E test: verify dependency endpoint env vars are present in dependent pod environment.

0 commit comments

Comments
 (0)