Skip to content

doc: add multi-agent runtime design proposal for LFX mentorship#354

Draft
Abhinav-kodes wants to merge 16 commits into
volcano-sh:mainfrom
Abhinav-kodes:lfx-multi-agent-proposal
Draft

doc: add multi-agent runtime design proposal for LFX mentorship#354
Abhinav-kodes wants to merge 16 commits into
volcano-sh:mainfrom
Abhinav-kodes:lfx-multi-agent-proposal

Conversation

@Abhinav-kodes
Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind documentation

What this PR does / why we need it:
This PR introduces the design proposal for MultiAgentRuntime (docs/design/multi-agent-runtime-proposal.md) as part of the LFX Mentorship (June-August 2026) to support multi-agent orchestrations.

The proposal outlines a declarative composition layer to manage the lifecycle of multiple collaborating agents:

  • MultiAgentRuntime CRD: Declares a group of roles referencing existing AgentRuntime CRDs.
  • Atomic Group-Level Lifecycle: Ensures transactional startup and rollback (failure to create any mandatory role rolls back all others).
  • Topological Ordering & Dependency Injection: Computes a Directed Acyclic Graph (DAG) for role startup and injects worker IP endpoints as environment variables.
  • Warm Pools per Role: Reuses the existing SandboxWarmPool and SandboxClaim machinery per role to reduce group startup latency.
  • Lightweight Group Store Layout: Group metadata is stored alongside individual sandboxes in the Redis/Valkey store.

Which issue(s) this PR fixes:
Fixes #301

Special notes for your reviewer:
This PR adds the design proposal document docs/design/multi-agent-runtime-proposal.md. The implementation plan is designed to be fully additive and backward-compatible, wrapping the existing createSandbox() transaction framework without modifying its inner mechanics.

Does this PR introduce a user-facing change?:

NONE

@volcano-sh-bot volcano-sh-bot added do-not-merge/work-in-progress kind/documentation Improvements or additions to documentation labels May 19, 2026
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kevin-wangzefeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a design proposal for MultiAgentRuntime, a declarative orchestration layer for managing groups of collaborating agents in AgentCube. The design covers CRD specifications, topological startup ordering, atomic/best-effort policies, and integration with the existing store and router. Feedback focused on technical refinements, including sanitizing environment variable names to handle hyphens, avoiding resource leaks from using defer in loops, and ensuring the group manifest accurately reflects failed roles in BestEffort mode. Additionally, improvements were suggested for error reporting regarding missing dependencies and addressing potential race conditions in the store's self-healing interface.

Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 19, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 49.53%. Comparing base (524e55e) to head (eb29b26).
⚠️ Report is 72 commits behind head on main.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #354      +/-   ##
==========================================
+ Coverage   47.57%   49.53%   +1.96%     
==========================================
  Files          30       30              
  Lines        2819     2881      +62     
==========================================
+ Hits         1341     1427      +86     
+ Misses       1338     1301      -37     
- Partials      140      153      +13     
Flag Coverage Δ
unittests 49.53% <ø> (+1.96%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Abhinav-kodes Abhinav-kodes force-pushed the lfx-multi-agent-proposal branch from 113caaf to f35c7c8 Compare May 19, 2026 22:31
@Abhinav-kodes
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a design proposal for MultiAgentRuntime, a declarative orchestration layer for managing groups of collaborating AgentRuntime roles. The design covers CRD specifications, topological startup ordering, dependency endpoint injection via environment variables, and self-healing policies. The review feedback highlights several critical areas for refinement: potential naming collisions during environment variable sanitization, the need to clarify injection scope for multi-container pods, a logic error in the failure recording sequence within the provided code snippet, and the staleness of environment variables when pods are replaced under the BestEffort policy.

Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
Comment thread docs/design/multi-agent-runtime-proposal.md
@Abhinav-kodes Abhinav-kodes force-pushed the lfx-multi-agent-proposal branch from 5ad793b to a13f249 Compare May 19, 2026 22:59
@Abhinav-kodes
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a design proposal for MultiAgentRuntime, a new custom resource for orchestrating groups of collaborating agents with unified lifecycle management. The proposal covers CRD specifications, topological startup ordering, dependency injection, and integration with existing components like the Router and Garbage Collector. Feedback focuses on addressing a flaw in the garbage collection logic where worker sandboxes might be prematurely deleted due to inactivity, ensuring group-level timeout settings are correctly applied, parallelizing sandbox creation to prevent request timeouts, and refining port resolution for dependency injection. Additionally, a type mismatch in the implementation snippet was identified.

Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
@Abhinav-kodes
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request proposes the design for MultiAgentRuntime, a declarative orchestration layer for managing groups of collaborating agents as a single unit. The proposal includes details on CRD specifications, dependency injection via environment variables, and a phased implementation plan. Review feedback identifies an inconsistency between the described parallel startup logic and the provided Go code snippet, and suggests expanding the port resolution rules to support multiple named ports for better flexibility.

Comment thread docs/design/multi-agent-runtime-proposal.md
Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
@Abhinav-kodes
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a design proposal for MultiAgentRuntime, a new custom resource for orchestrating groups of collaborating agents with unified lifecycle management and dependency-aware startup. The feedback identifies several technical issues in the provided code snippets, including field name discrepancies for timeouts and TTLs, as well as type mismatches. Furthermore, the reviewer suggests enhancing the BestEffort startup policy by ensuring failed roles are included in the group manifest and recommending the use of Kubernetes Services for stable DNS names to resolve issues with stale environment variables during pod replacements.

Comment thread docs/design/multi-agent-runtime-proposal.md
Comment thread docs/design/multi-agent-runtime-proposal.md
Comment thread docs/design/multi-agent-runtime-proposal.md
Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
@Abhinav-kodes
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive design proposal for MultiAgentRuntime, a new declarative orchestration layer for managing groups of collaborating agents as a single unit. The design includes support for topological startup ordering, atomic rollbacks, per-role warm pools, and group-aware garbage collection. The review feedback highlights several critical implementation details, including potential Kubernetes service name length violations, the necessity of a Validating Admission Webhook for configuration validation, and concerns about HTTP timeouts when creating large groups synchronously. Additionally, improvements were suggested for consistent TTL calculation across group members and clarifying the Redis Hash storage structure for group manifests.

Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
Comment thread docs/design/multi-agent-runtime-proposal.md
Comment thread docs/design/multi-agent-runtime-proposal.md
Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
@Abhinav-kodes
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive design proposal for the MultiAgentRuntime custom resource, enabling declarative orchestration and lifecycle management for groups of collaborating agents. The proposal details the architecture, CRD specification, dependency injection via stable DNS endpoints, and integration with existing components like the Workload Manager and Router. The review feedback correctly identifies a broken internal link in the table of contents, suggests correcting spelling inconsistencies in environment variable naming conventions, and recommends standardizing API endpoint paths for better consistency.

Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
@Abhinav-kodes
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive design proposal for MultiAgentRuntime, a new CRD for orchestrating groups of collaborating agents with unified lifecycle management. The proposal includes details on CRD specifications, topological startup ordering, dependency injection using headless services, and group-aware garbage collection. Feedback focuses on ensuring Kubernetes DNS label compliance for service names, extending the specification to support both AgentRuntime and CodeInterpreter types, and optimizing store operations and garbage collection efficiency to prevent race conditions and redundant lookups.

Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
Comment thread docs/design/multi-agent-runtime-proposal.md
Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
Comment thread docs/design/multi-agent-runtime-proposal.md
@Abhinav-kodes
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive design proposal for MultiAgentRuntime, a declarative orchestration layer for managing groups of collaborating agents in AgentCube. The proposal covers the CRD specification, dependency-based startup ordering, per-role warm pools, and group-aware garbage collection while maintaining backward compatibility. Feedback focused on correcting documentation inconsistencies regarding service naming, API response payloads in diagrams, and method counts. Additionally, a potential efficiency issue was raised concerning the retrieval of activity timestamps for group-wide idle timeout calculations.

Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
@Abhinav-kodes Abhinav-kodes force-pushed the lfx-multi-agent-proposal branch from d6746ef to 1637ac8 Compare May 20, 2026 16:30
@Abhinav-kodes
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive design proposal for MultiAgentRuntime, a new declarative orchestration layer for AgentCube that enables managing groups of collaborating agents as a single unit. The proposal details the CRD specification, topological startup ordering using Kahn's algorithm, and stable dependency injection via Headless Services. The review feedback highlights several critical edge cases in the proposed implementation, including potential naming collisions due to character truncation, DNS compliance issues with trailing hyphens, and missing error handling for dependencies without defined ports. A correction to the injectDependencyEndpoints function signature in the design document was also suggested to ensure it has access to the necessary group session ID.

Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
Comment thread docs/design/multi-agent-runtime-proposal.md
Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
@Abhinav-kodes Abhinav-kodes force-pushed the lfx-multi-agent-proposal branch 2 times, most recently from 4e82328 to c7958df Compare May 20, 2026 16:54
@Abhinav-kodes
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the 'MultiAgentRuntime' design, a new custom resource for orchestrating groups of collaborating agents with unified lifecycle management, dependency-based startup ordering, and stable internal networking. The review identified several critical issues in the proposed implementation of createSandboxGroup(): the createdRole struct fails to track SandboxClaim objects, which prevents proper cleanup of warm-pool resources during rollback, and the current rollback logic incorrectly passes nil instead of the claim object. Additionally, it was suggested to clarify the failure behavior when a dependency lacks exposed ports.

Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
Comment thread docs/design/multi-agent-runtime-proposal.md
Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
Comment thread docs/design/multi-agent-runtime-proposal.md
Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
…d use Headless Services for DNS stability

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
… Redis Hash layout feedback

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
…n date

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
…or GC cleanup, and cache GC coordinator lookup

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
…C coordinator lookup

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
…-port error handling, and injectDependencyEndpoints signature

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
…ring, and CRD status

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
…est fields

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
@Abhinav-kodes Abhinav-kodes force-pushed the lfx-multi-agent-proposal branch from 538105c to cf2dc75 Compare May 20, 2026 17:11
@Abhinav-kodes
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a design proposal for MultiAgentRuntime, a new custom resource for AgentCube that enables declarative orchestration of collaborating agent groups. The design includes support for topological startup ordering, atomic creation and rollback, per-role warm pools, and stable inter-agent communication using Headless Services. The review feedback highlights three key areas for improvement: ensuring Headless Service selectors use stable identifiers (GroupSessionID and Role) to support self-healing, utilizing atomic store operations like Redis Lua scripts for manifest cleanup to prevent race conditions during garbage collection, and optimizing the parallel creation logic to reduce mutex contention when injecting dependency endpoints.

Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
Comment thread docs/design/multi-agent-runtime-proposal.md Outdated
Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
@Abhinav-kodes
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a design proposal for MultiAgentRuntime, a new custom resource for AgentCube that enables declarative orchestration of multiple collaborating agents. The design covers CRD specifications, topological startup ordering, stable DNS-based endpoint injection via Headless Services, and group-aware garbage collection. Feedback was provided regarding the limitations of the ValidatingAdmissionWebhook in predicting runtime-generated service names and the necessity of using Redis Lua scripts for atomic role deletion to prevent race conditions.


**Validation against Naming Collisions:**
* Because multiple role names or port names could map to the same sanitized environment variable (e.g., `my-agent` and `my.agent` both sanitize to `AGENTCUBE_DEP_MY_AGENT_ENDPOINT`), the API server validates the group configuration at request admission time. If any two roles or named ports within the group result in the same sanitized environment variable key, the request is rejected with a `400 Bad Request` validation error.
* The `ValidatingAdmissionWebhook` also explicitly checks for **Service name collisions after truncation**: after computing `mar-{shortHash}-{roleNameSanitized-truncated-stripped}` for each role, if any two roles in the same group produce an identical service name, the request is rejected. This prevents the edge case where two roles whose names are identical in their first 50 characters would silently share a single Headless Service.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ValidatingAdmissionWebhook cannot compute the final Service name because the groupSessionID (which contains a random UUID) is only generated at runtime during the API request, not at resource admission time. The webhook should instead focus on validating that role names within the same MultiAgentRuntime spec do not collide after sanitization and truncation.

// To prevent race conditions during concurrent GC, the check-then-delete sequence
// (removing the role, and deleting the manifest if it was the last role) MUST be
// implemented using a Redis Lua script or transaction.
DeleteAgentGroupRole(ctx context.Context, groupSessionID, roleName string) error
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To ensure atomicity and prevent race conditions where the manifest might be deleted while a concurrent process is adding a role, the DeleteAgentGroupRole implementation should use a Redis Lua script. This script should perform the HDEL and then check HLEN to decide whether to delete the entire key in a single atomic operation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress kind/documentation Improvements or additions to documentation size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[lfx-mentorship-2026-June-August] Support multi-AgentCube Capability

3 participants