Skip to content

feat(picod): implement two-stage secure initialization to isolate sandbox sessions#352

Open
Abhinav-kodes wants to merge 14 commits into
volcano-sh:mainfrom
Abhinav-kodes:feat/picod-two-stage-init
Open

feat(picod): implement two-stage secure initialization to isolate sandbox sessions#352
Abhinav-kodes wants to merge 14 commits into
volcano-sh:mainfrom
Abhinav-kodes:feat/picod-two-stage-init

Conversation

@Abhinav-kodes
Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind feature
/kind security

What this PR does / why we need it:
This PR implements Two-Stage Secure Initialization for PicoD to resolve a critical security vulnerability regarding cross-sandbox token replays.

Previously, under the Plain Authentication design, all PicoD sandboxes verified JWTs using the same shared public key injected via the PICOD_AUTH_PUBLIC_KEY environment variable. This meant a valid JWT issued for one sandbox could theoretically be replayed against another sandbox, breaking tenant isolation.

This PR introduces a cryptographically isolated two-stage handshake:

  1. Bootstrap Stage: Workload Manager generates a Bootstrap Key Pair and injects the PICOD_BOOTSTRAP_PUBLIC_KEY into the PicoD container environment. PicoD loads this at startup.
  2. Session Stage (Initialization): Workload Manager generates a dynamically unique Session Key Pair. It calls the new POST /init endpoint on PicoD with an init_jwt (signed by the Bootstrap Private Key) that contains the session_public_key claim. PicoD extracts and permanently stores this session key.

All subsequent user requests to that sandbox must be signed by the unique Session Private Key. This guarantees strict cryptographic isolation between sandboxes while maintaining our fast-startup serverless architecture.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:
This implementation realizes the architecture outlined in docs/design/agentcube-proposal.md under section 5.2 (Picod Workflow). Note that the older PicoD-Plain-Authentication-Design.md document is now slightly outdated regarding this specific bootstrap flow.

Does this PR introduce a user-facing change?:

Implemented two-stage secure initialization for PicoD. Sandboxes are now cryptographically isolated using dynamic session keys to prevent cross-sandbox token replay vulnerabilities.

Copilot AI review requested due to automatic review settings May 19, 2026 12:44
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

@Abhinav-kodes: The label(s) kind/security cannot be applied, because the repository doesn't have them.

Details

In response to this:

What type of PR is this?

/kind feature
/kind security

What this PR does / why we need it:
This PR implements Two-Stage Secure Initialization for PicoD to resolve a critical security vulnerability regarding cross-sandbox token replays.

Previously, under the Plain Authentication design, all PicoD sandboxes verified JWTs using the same shared public key injected via the PICOD_AUTH_PUBLIC_KEY environment variable. This meant a valid JWT issued for one sandbox could theoretically be replayed against another sandbox, breaking tenant isolation.

This PR introduces a cryptographically isolated two-stage handshake:

  1. Bootstrap Stage: Workload Manager generates a Bootstrap Key Pair and injects the PICOD_BOOTSTRAP_PUBLIC_KEY into the PicoD container environment. PicoD loads this at startup.
  2. Session Stage (Initialization): Workload Manager generates a dynamically unique Session Key Pair. It calls the new POST /init endpoint on PicoD with an init_jwt (signed by the Bootstrap Private Key) that contains the session_public_key claim. PicoD extracts and permanently stores this session key.

All subsequent user requests to that sandbox must be signed by the unique Session Private Key. This guarantees strict cryptographic isolation between sandboxes while maintaining our fast-startup serverless architecture.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:
This implementation realizes the architecture outlined in docs/design/agentcube-proposal.md under section 5.2 (Picod Workflow). Note that the older PicoD-Plain-Authentication-Design.md document is now slightly outdated regarding this specific bootstrap flow.

Does this PR introduce a user-facing change?:

Implemented two-stage secure initialization for PicoD. Sandboxes are now cryptographically isolated using dynamic session keys to prevent cross-sandbox token replay vulnerabilities.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a two-stage authentication mechanism for PicoD sandboxes by introducing an /init endpoint and bootstrap key generation. The reviewer identified a critical security flaw where the router's shared key is incorrectly reused for session initialization instead of a unique session key. Additionally, the review points out that the session key can be overwritten, suggests refactoring duplicated PEM parsing logic, highlights the need for comprehensive unit tests for the new initialization flow, and recommends making the HTTP client timeout configurable.

Comment thread pkg/workloadmanager/sandbox_helper.go Outdated
Comment thread pkg/picod/auth.go
Comment thread pkg/picod/auth.go Outdated
Comment thread pkg/picod/server.go
Comment thread pkg/workloadmanager/sandbox_helper.go Outdated
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 19, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 44.47514% with 201 lines in your changes missing coverage. Please review.
✅ Project coverage is 48.36%. Comparing base (524e55e) to head (4bc163c).
⚠️ Report is 82 commits behind head on main.

Files with missing lines Patch % Lines
pkg/workloadmanager/bootstrap_auth.go 24.70% 61 Missing and 3 partials ⚠️
pkg/router/jwt.go 3.57% 27 Missing ⚠️
pkg/store/store_redis.go 0.00% 20 Missing ⚠️
pkg/store/store_valkey.go 14.28% 18 Missing ⚠️
pkg/picod/auth.go 75.38% 8 Missing and 8 partials ⚠️
pkg/router/handlers.go 0.00% 14 Missing ⚠️
pkg/workloadmanager/server.go 21.42% 11 Missing ⚠️
pkg/workloadmanager/sandbox_helper.go 77.27% 5 Missing and 5 partials ⚠️
pkg/workloadmanager/handlers.go 75.67% 5 Missing and 4 partials ⚠️
pkg/workloadmanager/workload_builder.go 0.00% 7 Missing ⚠️
... and 1 more
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #352      +/-   ##
==========================================
+ Coverage   47.57%   48.36%   +0.79%     
==========================================
  Files          30       31       +1     
  Lines        2819     3186     +367     
==========================================
+ Hits         1341     1541     +200     
- Misses       1338     1475     +137     
- Partials      140      170      +30     
Flag Coverage Δ
unittests 48.36% <44.47%> (+0.79%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot AI review requested due to automatic review settings May 19, 2026 13:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@Abhinav-kodes
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a two-stage authentication and initialization workflow for PicoD. It introduces bootstrap keys to verify an initial setup phase that establishes unique session-specific keys for each sandbox. The workload manager has been updated to generate these ephemeral keys and perform remote initialization of PicoD instances, while the router now supports signing JWTs with per-session private keys. Reviewer feedback identifies performance bottlenecks related to frequent RSA key parsing and redundant HTTP client instantiation, and provides corrections for the golang-jwt/jwt/v5 library usage to ensure proper claim enforcement.

Comment thread pkg/router/jwt.go Outdated
Comment thread pkg/workloadmanager/sandbox_helper.go Outdated
Comment thread pkg/picod/auth.go Outdated
Comment thread pkg/picod/auth.go
@Abhinav-kodes Abhinav-kodes force-pushed the feat/picod-two-stage-init branch from d610e39 to 1672739 Compare May 19, 2026 14:16
@Abhinav-kodes Abhinav-kodes requested a review from Copilot May 19, 2026 14:18
@Abhinav-kodes
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a two-stage authentication and initialization flow for PicoD, introducing a bootstrap public key and an /init endpoint for session-specific key registration. The changes include updates to the Workload Manager to generate and distribute these keys, and modifications to the Router to support session-specific JWT signing. Feedback includes concerns regarding a potential memory leak in the JWTManager due to an unbounded cache, the use of global variables in bootstrap_auth.go which hinders testability and thread safety, and a recommendation to improve security by verifying the issuer claim in the bootstrap JWT.

Comment thread pkg/router/jwt.go Outdated
Comment thread pkg/workloadmanager/bootstrap_auth.go Outdated
Comment thread pkg/picod/auth.go Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@Abhinav-kodes
Copy link
Copy Markdown
Contributor Author

/gemini review

@Abhinav-kodes Abhinav-kodes requested a review from Copilot May 19, 2026 15:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@acsoto
Copy link
Copy Markdown
Member

acsoto commented May 23, 2026

Actually this reintroduces the old two-stage PicoD initialization flow from the outdated proposal. The current implementation was intentionally simplified in #115 to use startup-time key injection, and #136 removed the remaining Workload Manager NeedInitialization flow. This /init call brings back the extra post-ready dependency and failure mode that those PRs removed.

@Abhinav-kodes
Copy link
Copy Markdown
Contributor Author

Hi @acsoto
After looking back at #115 and #136, this flow is actually pretty different from the old initialization path that was removed. Those earlier changes removed PicoD’s /init + NeedInitialization flow because it was being used to push runtime config like workspace paths and user data, and PicoD could block on it indefinitely.

In this PR, /init is only used for a one-time cryptographic key exchange after waitForSandboxEntryPointsReady succeeds. There’s no polling loop, no persistent initialization flag/state, and the request is bounded with a 5s timeout. If it fails, sandbox creation rolls back through the normal prepareSandbox error path.

The main reason for adding it is to close a security gap in the current shared-key model: today all PicoD pods trust the same PICOD_AUTH_PUBLIC_KEY, which means a JWT issued for one sandbox could be replayed against another. The per-session keypair exchange here prevents that.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important security improvement - two-stage initialization isolates bootstrap from session auth.

pkg/picod/auth.go

  • Lines 36-38 (ErrAlreadyInitialized): Good sentinel error to prevent re-initialization attacks.
  • Lines 41-42: Renaming PICOD_AUTH_PUBLIC_KEY to PICOD_BOOTSTRAP_PUBLIC_KEY is a breaking change for existing deployments. Add a migration note or support the old env var name as a fallback.
  • Lines 59-76 (parseRSAPublicKeyFromPEM): Extracted helper is cleaner than duplicating PEM parsing. Good.
  • Lines 79-92 (LoadBootstrapPublicKey): Same logic as before, just renamed. Clean.
  • Lines 95-113 (SetSessionPublicKey): The initialized flag with mutex protection prevents race conditions. Once set, the session key cannot be overwritten. This is the core security invariant.
  • Lines 116-145 (VerifyBootstrapJWT): Good - validates issuer (agentcube-workload-manager), requires expiration, and extracts the session_public_key claim. The 1-minute leeway is reasonable for clock skew.
  • Lines 149-162 (AuthMiddleware): Returning 503 when not initialized is correct - it tells the client to retry. Make sure health/liveness probes are not routed through this middleware, otherwise the pod will be killed before init completes.

pkg/common/types/sandbox.go

  • Line 41-44 (SessionPrivateKey): Using json:"-" to exclude the private key from serialization is critical. The comment explains the transient-only nature well. Good.

pkg/picod/auth_test.go

  • Tests properly updated to use two-stage init flow (LoadBootstrapPublicKey + SetSessionPublicKey). Good coverage.

pkg/picod/execute_test.go

  • Line 66: Test setup now includes SetSessionPublicKey after server creation. Consistent with the new flow.

LGTM. The env var rename needs migration documentation.

@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hzxuzhonghu, yaozengzeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
…tion

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
… bootstrap keys

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
…nd fix comment typo

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
…om eviction

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
… docs

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
Copilot AI review requested due to automatic review settings May 25, 2026 12:54
@Abhinav-kodes Abhinav-kodes force-pushed the feat/picod-two-stage-init branch from 31004cf to 4bc163c Compare May 25, 2026 12:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants