doc: add multi-agent runtime design proposal for LFX mentorship#354
doc: add multi-agent runtime design proposal for LFX mentorship#354Abhinav-kodes wants to merge 16 commits into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Code Review
This pull request introduces a design proposal for MultiAgentRuntime, a declarative orchestration layer for managing groups of collaborating agents in AgentCube. The design covers CRD specifications, topological startup ordering, atomic/best-effort policies, and integration with the existing store and router. Feedback focused on technical refinements, including sanitizing environment variable names to handle hyphens, avoiding resource leaks from using defer in loops, and ensuring the group manifest accurately reflects failed roles in BestEffort mode. Additionally, improvements were suggested for error reporting regarding missing dependencies and addressing potential race conditions in the store's self-healing interface.
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #354 +/- ##
==========================================
+ Coverage 47.57% 49.53% +1.96%
==========================================
Files 30 30
Lines 2819 2881 +62
==========================================
+ Hits 1341 1427 +86
+ Misses 1338 1301 -37
- Partials 140 153 +13
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
113caaf to
f35c7c8
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a design proposal for MultiAgentRuntime, a declarative orchestration layer for managing groups of collaborating AgentRuntime roles. The design covers CRD specifications, topological startup ordering, dependency endpoint injection via environment variables, and self-healing policies. The review feedback highlights several critical areas for refinement: potential naming collisions during environment variable sanitization, the need to clarify injection scope for multi-container pods, a logic error in the failure recording sequence within the provided code snippet, and the staleness of environment variables when pods are replaced under the BestEffort policy.
5ad793b to
a13f249
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a design proposal for MultiAgentRuntime, a new custom resource for orchestrating groups of collaborating agents with unified lifecycle management. The proposal covers CRD specifications, topological startup ordering, dependency injection, and integration with existing components like the Router and Garbage Collector. Feedback focuses on addressing a flaw in the garbage collection logic where worker sandboxes might be prematurely deleted due to inactivity, ensuring group-level timeout settings are correctly applied, parallelizing sandbox creation to prevent request timeouts, and refining port resolution for dependency injection. Additionally, a type mismatch in the implementation snippet was identified.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request proposes the design for MultiAgentRuntime, a declarative orchestration layer for managing groups of collaborating agents as a single unit. The proposal includes details on CRD specifications, dependency injection via environment variables, and a phased implementation plan. Review feedback identifies an inconsistency between the described parallel startup logic and the provided Go code snippet, and suggests expanding the port resolution rules to support multiple named ports for better flexibility.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a design proposal for MultiAgentRuntime, a new custom resource for orchestrating groups of collaborating agents with unified lifecycle management and dependency-aware startup. The feedback identifies several technical issues in the provided code snippets, including field name discrepancies for timeouts and TTLs, as well as type mismatches. Furthermore, the reviewer suggests enhancing the BestEffort startup policy by ensuring failed roles are included in the group manifest and recommending the use of Kubernetes Services for stable DNS names to resolve issues with stale environment variables during pod replacements.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive design proposal for MultiAgentRuntime, a new declarative orchestration layer for managing groups of collaborating agents as a single unit. The design includes support for topological startup ordering, atomic rollbacks, per-role warm pools, and group-aware garbage collection. The review feedback highlights several critical implementation details, including potential Kubernetes service name length violations, the necessity of a Validating Admission Webhook for configuration validation, and concerns about HTTP timeouts when creating large groups synchronously. Additionally, improvements were suggested for consistent TTL calculation across group members and clarifying the Redis Hash storage structure for group manifests.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive design proposal for the MultiAgentRuntime custom resource, enabling declarative orchestration and lifecycle management for groups of collaborating agents. The proposal details the architecture, CRD specification, dependency injection via stable DNS endpoints, and integration with existing components like the Workload Manager and Router. The review feedback correctly identifies a broken internal link in the table of contents, suggests correcting spelling inconsistencies in environment variable naming conventions, and recommends standardizing API endpoint paths for better consistency.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive design proposal for MultiAgentRuntime, a new CRD for orchestrating groups of collaborating agents with unified lifecycle management. The proposal includes details on CRD specifications, topological startup ordering, dependency injection using headless services, and group-aware garbage collection. Feedback focuses on ensuring Kubernetes DNS label compliance for service names, extending the specification to support both AgentRuntime and CodeInterpreter types, and optimizing store operations and garbage collection efficiency to prevent race conditions and redundant lookups.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive design proposal for MultiAgentRuntime, a declarative orchestration layer for managing groups of collaborating agents in AgentCube. The proposal covers the CRD specification, dependency-based startup ordering, per-role warm pools, and group-aware garbage collection while maintaining backward compatibility. Feedback focused on correcting documentation inconsistencies regarding service naming, API response payloads in diagrams, and method counts. Additionally, a potential efficiency issue was raised concerning the retrieval of activity timestamps for group-wide idle timeout calculations.
d6746ef to
1637ac8
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive design proposal for MultiAgentRuntime, a new declarative orchestration layer for AgentCube that enables managing groups of collaborating agents as a single unit. The proposal details the CRD specification, topological startup ordering using Kahn's algorithm, and stable dependency injection via Headless Services. The review feedback highlights several critical edge cases in the proposed implementation, including potential naming collisions due to character truncation, DNS compliance issues with trailing hyphens, and missing error handling for dependencies without defined ports. A correction to the injectDependencyEndpoints function signature in the design document was also suggested to ensure it has access to the necessary group session ID.
4e82328 to
c7958df
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces the 'MultiAgentRuntime' design, a new custom resource for orchestrating groups of collaborating agents with unified lifecycle management, dependency-based startup ordering, and stable internal networking. The review identified several critical issues in the proposed implementation of createSandboxGroup(): the createdRole struct fails to track SandboxClaim objects, which prevents proper cleanup of warm-pool resources during rollback, and the current rollback logic incorrectly passes nil instead of the claim object. Additionally, it was suggested to clarify the failure behavior when a dependency lacks exposed ports.
Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
…d use Headless Services for DNS stability Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
… Redis Hash layout feedback Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
…n date Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
…or GC cleanup, and cache GC coordinator lookup Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
…C coordinator lookup Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
…-port error handling, and injectDependencyEndpoints signature Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
…ring, and CRD status Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
…est fields Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
538105c to
cf2dc75
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a design proposal for MultiAgentRuntime, a new custom resource for AgentCube that enables declarative orchestration of collaborating agent groups. The design includes support for topological startup ordering, atomic creation and rollback, per-role warm pools, and stable inter-agent communication using Headless Services. The review feedback highlights three key areas for improvement: ensuring Headless Service selectors use stable identifiers (GroupSessionID and Role) to support self-healing, utilizing atomic store operations like Redis Lua scripts for manifest cleanup to prevent race conditions during garbage collection, and optimizing the parallel creation logic to reduce mutex contention when injecting dependency endpoints.
Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a design proposal for MultiAgentRuntime, a new custom resource for AgentCube that enables declarative orchestration of multiple collaborating agents. The design covers CRD specifications, topological startup ordering, stable DNS-based endpoint injection via Headless Services, and group-aware garbage collection. Feedback was provided regarding the limitations of the ValidatingAdmissionWebhook in predicting runtime-generated service names and the necessity of using Redis Lua scripts for atomic role deletion to prevent race conditions.
|
|
||
| **Validation against Naming Collisions:** | ||
| * Because multiple role names or port names could map to the same sanitized environment variable (e.g., `my-agent` and `my.agent` both sanitize to `AGENTCUBE_DEP_MY_AGENT_ENDPOINT`), the API server validates the group configuration at request admission time. If any two roles or named ports within the group result in the same sanitized environment variable key, the request is rejected with a `400 Bad Request` validation error. | ||
| * The `ValidatingAdmissionWebhook` also explicitly checks for **Service name collisions after truncation**: after computing `mar-{shortHash}-{roleNameSanitized-truncated-stripped}` for each role, if any two roles in the same group produce an identical service name, the request is rejected. This prevents the edge case where two roles whose names are identical in their first 50 characters would silently share a single Headless Service. |
There was a problem hiding this comment.
The ValidatingAdmissionWebhook cannot compute the final Service name because the groupSessionID (which contains a random UUID) is only generated at runtime during the API request, not at resource admission time. The webhook should instead focus on validating that role names within the same MultiAgentRuntime spec do not collide after sanitization and truncation.
| // To prevent race conditions during concurrent GC, the check-then-delete sequence | ||
| // (removing the role, and deleting the manifest if it was the last role) MUST be | ||
| // implemented using a Redis Lua script or transaction. | ||
| DeleteAgentGroupRole(ctx context.Context, groupSessionID, roleName string) error |
There was a problem hiding this comment.
To ensure atomicity and prevent race conditions where the manifest might be deleted while a concurrent process is adding a role, the DeleteAgentGroupRole implementation should use a Redis Lua script. This script should perform the HDEL and then check HLEN to decide whether to delete the entire key in a single atomic operation.
What type of PR is this?
/kind documentation
What this PR does / why we need it:
This PR introduces the design proposal for
MultiAgentRuntime(docs/design/multi-agent-runtime-proposal.md) as part of the LFX Mentorship (June-August 2026) to support multi-agent orchestrations.The proposal outlines a declarative composition layer to manage the lifecycle of multiple collaborating agents:
MultiAgentRuntimeCRD: Declares a group of roles referencing existingAgentRuntimeCRDs.SandboxWarmPoolandSandboxClaimmachinery per role to reduce group startup latency.Which issue(s) this PR fixes:
Fixes #301
Special notes for your reviewer:
This PR adds the design proposal document
docs/design/multi-agent-runtime-proposal.md. The implementation plan is designed to be fully additive and backward-compatible, wrapping the existingcreateSandbox()transaction framework without modifying its inner mechanics.Does this PR introduce a user-facing change?: