Skip to content

Commit a13f249

Browse files
committed
docs: address additional feedback on multi-agent design proposal
Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
1 parent cb5397c commit a13f249

1 file changed

Lines changed: 13 additions & 2 deletions

File tree

docs/design/multi-agent-runtime-proposal.md

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -219,7 +219,13 @@ AGENTCUBE_DEP_{ROLE_NAME_SANITISED_UPPER}_ENDPOINT = {podIP}:{port}
219219
* If the dependency's `AgentRuntime` CRD defines a single port, that port is used.
220220
* If it defines multiple ports, the system first looks for a port named `http` or `default`. If no such port is found, it falls back to the first port in the ports list.
221221

222-
For a role with `dependencies: [my-planner]` (where the planner exposes `8080` as the first port), the dependent pod receives:
222+
**Validation against Naming Collisions:**
223+
* Because multiple role names could map to the same sanitized environment variable (e.g., `my-agent` and `my.agent` both sanitize to `AGENTCUBE_DEP_MY_AGENT_ENDPOINT`), the API server validates the group configuration at request admission time. If any two roles within the group result in the same sanitized environment variable key, the request is rejected with a `400 Bad Request` validation error.
224+
225+
**Injection Scope:**
226+
* The dependency endpoints are injected into the `Env` list of **all containers** (including primary, sidecar, and init-containers) defined in the pod spec. This ensures that any multi-container runtime configuration can reliably resolve the endpoints.
227+
228+
For a role with `dependencies: [my-planner]` (where the planner exposes `8080` as the first port), the dependent pod's containers receive:
223229

224230
```
225231
AGENTCUBE_DEP_MY_PLANNER_ENDPOINT = 10.0.0.4:8080
@@ -305,7 +311,6 @@ func (s *Server) createSandboxGroup(
305311
if err != nil {
306312
if mar.Spec.StartupPolicy == StartupPolicyBestEffort && !role.IsCoordinator {
307313
klog.Warningf("group %s: role %s failed (BestEffort policy): %v", groupSessionID, role.Name, err)
308-
recordRoleFailure(groupSessionID, role.Name)
309314
continue
310315
}
311316
return nil, fmt.Errorf("role %s: %w", role.Name, err)
@@ -648,6 +653,12 @@ The reconciler watches for `Sandbox` objects whose `GroupSessionID` matches a kn
648653
- **`Atomic` policy**: the reconciler calls `handleDeleteAgentGroup()` to tear down all remaining sandboxes and delete the group manifest. It sets a `Failed` condition on the `MultiAgentRuntimeStatus`.
649654
- **`BestEffort` policy**: the reconciler attempts to create a replacement sandbox for the failed role. On success, it calls `UpdateAgentGroupRoleStatus()` with the new endpoint. On repeated failure, it sets a `Degraded` condition.
650655

656+
> [!WARNING]
657+
> **Stale Environment Variables in BestEffort Groups:**
658+
> When a failed worker pod is replaced under the `BestEffort` policy, the new pod receives a new IP address. Because environment variables are immutable once a pod is running, already active dependent pods (such as the coordinator) will retain the stale endpoint in their environment variables.
659+
>
660+
> To prevent communication failures, agents deployed in `BestEffort` groups must not rely solely on injected environment variables. Instead, they should utilize the `/topology` endpoint (`GET /v1/multi-agent-runtime/groups/:groupSessionId/topology`) for dynamic service discovery to retrieve current worker endpoints.
661+
651662
### Status Conditions
652663

653664
| Condition | Meaning |

0 commit comments

Comments
 (0)