Skip to content

feat: add session observability GET endpoints and fix flaky JWT tests#331

Open
Abhinav-kodes wants to merge 5 commits into
volcano-sh:mainfrom
Abhinav-kodes:feat-get-session-endpoints
Open

feat: add session observability GET endpoints and fix flaky JWT tests#331
Abhinav-kodes wants to merge 5 commits into
volcano-sh:mainfrom
Abhinav-kodes:feat-get-session-endpoints

Conversation

@Abhinav-kodes
Copy link
Copy Markdown
Contributor

@Abhinav-kodes Abhinav-kodes commented May 13, 2026

What type of PR is this?

/kind feature
/kind bug

What this PR does / why we need it:
This PR completes the CRUD operations for the Workload Manager API by introducing "Session Observability" endpoints, and fixes a critical test flakiness issue blocking CI pipelines.

Previously, the Workload Manager only supported creating (POST) and deleting (DELETE) sandboxes. If a client disconnected or lost the initial response, there was no native way to retrieve the entrypoints or check the status of a running session.

This PR adds:

  • GET /v1/agent-runtime/sessions/:sessionId (and /code-interpreter) to retrieve a specific session's details.
  • GET /v1/agent-runtime/sessions (and /code-interpreter) to list all active sessions.
  • A ListSandboxesByKind implementation in the store.Store interface, backed by non-blocking SCAN loops for both Valkey and Redis.
  • Authorization filtering to ensure authenticated users can only view and list sessions within their own namespaces. Note: Unauthorized cross-namespace access returns a 404 Not Found rather than a 403 Forbidden to prevent session ID enumeration.
  • Comprehensive unit tests covering the new handlers, data store queries, and authorization boundaries.

Bug Fixes Included:

  • Fixes the flaky TestGetPrivateKeyPEM unit test in pkg/router/jwt_test.go by replacing a direct deep-equal assertion on rsa.PrivateKey with component-wise mathematical comparisons. This prevents random CI failures caused by differences in unexported precomputed byte slice lengths during ASN.1 serialization/deserialization.

Which issue(s) this PR fixes:
Fixes #348

Special notes for your reviewer:
The ListSandboxesByKind implementations for Valkey and Redis utilize the SCAN command rather than KEYS to ensure the new GET /sessions endpoint does not block the storage engine. While this requires filtering by namespace in-memory, this O(total sessions) footprint is documented in the store logic (seenKeys). A proper secondary index scheme (e.g., Redis SET per namespace/kind) will be introduced in a future follow-up PR to provide O(1) namespace filtering, keeping this PR strictly focused on the observability endpoints.

Does this PR introduce a user-facing change?:

Added `GET /sessions` and `GET /sessions/:sessionId` endpoints to the Workload Manager API, allowing clients to list active sandboxes and retrieve connection entrypoints for existing sessions. Fixed a flaky test blocking CI pipelines.

Copilot AI review requested due to automatic review settings May 13, 2026 13:18
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yaozengzeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces functionality to retrieve individual sandboxes by session ID and list sandboxes filtered by kind. Changes include updates to the Store interface, Redis and Valkey implementations using SCAN for efficiency, and new HTTP handlers in the workload manager. Feedback highlights a potential failure in the Valkey implementation if keys expire during a scan, and suggests enforcing kind-based validation in the GET handler to prevent unauthorized cross-kind access. Additionally, indentation in the new handlers should be corrected to use tabs for consistency with Go standards.

Comment thread pkg/store/store_valkey.go Outdated
Comment thread pkg/workloadmanager/handlers.go Outdated
Comment thread pkg/workloadmanager/server.go Outdated
Comment thread pkg/workloadmanager/server.go Outdated
Comment thread pkg/workloadmanager/handlers_test.go Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds “session observability” read endpoints to the Workload Manager API so clients can retrieve a session by ID or list active sessions, backed by new store capabilities for enumerating sandboxes by kind.

Changes:

  • Added GET /v1/{agent-runtime|code-interpreter}/sessions and GET /v1/{agent-runtime|code-interpreter}/sessions/:sessionId routes and handlers in Workload Manager.
  • Extended store.Store with ListSandboxesByKind and implemented it for Redis + Valkey using SCAN.
  • Added/updated unit tests for the new handlers and store methods, plus updated fakes to satisfy the new interface.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
pkg/workloadmanager/server.go Registers new GET session routes for agent-runtime and code-interpreter APIs.
pkg/workloadmanager/handlers.go Implements handleGetSandbox and handleListSandboxes with optional auth namespace filtering.
pkg/workloadmanager/handlers_test.go Adds unit tests for the new GET handlers and updates fake store.
pkg/workloadmanager/garbage_collection_test.go Updates nopStore to satisfy the extended store interface.
pkg/store/interface.go Adds ListSandboxesByKind to the Store interface.
pkg/store/store_redis.go Implements Redis ListSandboxesByKind via SCAN + batch load + kind filter.
pkg/store/store_redis_test.go Adds a unit test for Redis ListSandboxesByKind.
pkg/store/store_valkey.go Implements Valkey ListSandboxesByKind via SCAN + batch load + kind filter.
pkg/store/store_valkey_test.go Adds a unit test for Valkey ListSandboxesByKind.
pkg/router/session_manager_test.go Updates router test fake store to satisfy the extended store interface.

Comment thread pkg/store/store_redis.go
Comment thread pkg/store/store_valkey.go
Comment thread pkg/workloadmanager/handlers.go Outdated
Comment thread pkg/store/store_redis_test.go
Comment thread pkg/store/store_valkey_test.go
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 13, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 83.20000% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 50.86%. Comparing base (524e55e) to head (0b47b4f).
⚠️ Report is 82 commits behind head on main.

Files with missing lines Patch % Lines
pkg/store/store_valkey.go 74.35% 5 Missing and 5 partials ⚠️
pkg/store/store_redis.go 79.31% 3 Missing and 3 partials ⚠️
pkg/workloadmanager/server.go 0.00% 4 Missing ⚠️
pkg/workloadmanager/handlers.go 97.91% 1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #331      +/-   ##
==========================================
+ Coverage   47.57%   50.86%   +3.29%     
==========================================
  Files          30       30              
  Lines        2819     3002     +183     
==========================================
+ Hits         1341     1527     +186     
+ Misses       1338     1314      -24     
- Partials      140      161      +21     
Flag Coverage Δ
unittests 50.86% <83.20%> (+3.29%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Abhinav-kodes
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@Abhinav-kodes Abhinav-kodes requested a review from Copilot May 13, 2026 14:14
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements functionality to retrieve and list sandboxes by kind, adding new API endpoints and supporting logic in the Redis and Valkey storage backends. Feedback focuses on performance optimizations, specifically recommending secondary indices to replace O(N) scans when listing sandboxes, improving slice pre-allocation for memory efficiency, and strengthening result-length validation during batch operations.

Comment thread pkg/store/store_redis.go
Comment thread pkg/store/store_valkey.go Outdated
Comment thread pkg/store/store_valkey.go
Comment thread pkg/workloadmanager/handlers.go Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 15 changed files in this pull request and generated 2 comments.

Comment thread pkg/store/store_redis.go
Comment thread pkg/store/store_valkey.go
Copilot AI review requested due to automatic review settings May 13, 2026 14:35
@Abhinav-kodes Abhinav-kodes force-pushed the feat-get-session-endpoints branch from ea991fe to 9cd1422 Compare May 13, 2026 14:37
@Abhinav-kodes
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the ability to list and retrieve sandboxes by their kind across Redis and Valkey storage backends, adding new API endpoints and namespace-based authorization. Review feedback suggests using non-blocking SSCAN instead of SMembers to prevent performance issues with large datasets and recommends logging errors during secondary index cleanup in DeleteSandboxBySessionID to avoid orphaned session IDs.

Comment thread pkg/store/store_redis.go
Comment thread pkg/store/store_valkey.go
Comment thread pkg/store/store_redis.go Outdated
Comment thread pkg/store/store_valkey.go Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements functionality to retrieve and list sandboxes by kind for both Agent Runtime and Code Interpreter services. It introduces a new ListSandboxesByKind method to the storage interface with implementations for Redis and Valkey, and adds corresponding API endpoints with namespace-based authorization filtering. Feedback focused on potential scalability issues with the current SCAN-based filtering approach in the storage layer and suggested a more efficient in-place filtering implementation for the list handler to reduce memory allocations.

Comment thread pkg/store/store_redis.go
Comment thread pkg/store/store_valkey.go
Comment thread pkg/workloadmanager/handlers.go Outdated
@Abhinav-kodes Abhinav-kodes force-pushed the feat-get-session-endpoints branch from 0fcba68 to b35599d Compare May 13, 2026 15:43
@Abhinav-kodes
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces functionality to list and retrieve sandboxes by kind and session ID across the storage and API layers. The Store interface is expanded with ListSandboxesByKind, implemented for both Redis and Valkey using efficient SCAN operations to prevent blocking. New REST endpoints are added to the workload manager to support these operations, incorporating namespace-based filtering when authentication is enabled. Review feedback suggests removing redundant nil checks in the Redis and Valkey store implementations, as the batch loading logic already ensures non-nil results.

Comment thread pkg/store/store_redis.go Outdated
Comment thread pkg/store/store_valkey.go Outdated
Copilot AI review requested due to automatic review settings May 13, 2026 16:09
@Abhinav-kodes Abhinav-kodes force-pushed the feat-get-session-endpoints branch from b35599d to ae72e13 Compare May 13, 2026 16:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@Abhinav-kodes
Copy link
Copy Markdown
Contributor Author

@LiZhenCheng9527 @hzxuzhonghu @YaoZengzeng
I have applied all the ai assistant's suggestions , but SCAN-based approach is O(N) over all session keys and won't scale well with millions of sandboxes as pointed out by @gemini-code-assist

a dedicated Redis SET per kind (session:kind:agent-runtime) is the approach I'd lean towards, as it provides O(K) lookups (where K = sandboxes of that kind) without requiring a key schema migration.

However, implementing a secondary index requires changes to StoreSandbox and DeleteSandboxBySessionID to maintain the index (SADD on create, SREM on delete), which is outside the scope of this PR.
I'll track this as a follow-up optimization.

For the current scale of the project, the SCAN-based approach with per-batch filtering (100 keys at a time) should be sufficient.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Thank you for the update and for considering the scalability implications. Your plan to track the secondary index implementation as a follow-up optimization is reasonable, especially given the current scale of the project. The use of SCAN with batch filtering is a pragmatic approach to avoid blocking the storage engine while still providing the requested functionality. I look forward to seeing the secondary index implementation in a future PR.

Copy link
Copy Markdown
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good addition — the missing GET endpoints were a real gap. Using SCAN instead of KEYS is the right call for production Redis. The namespace-scoped authorization filtering is also important here.

Comment thread pkg/store/store_redis.go
@Abhinav-kodes
Copy link
Copy Markdown
Contributor Author

@hzxuzhonghu Thanks for pointing that out!
I've added a comment in both the Redis and Valkey store implementations to explicitly document this behavior. The comment notes that seenKeys is used for cross-page deduplication (since SCAN can return the same key multiple times) and that for extremely large datasets, implementing a dedicated secondary index (like a SET per kind) is recommended.
Once this PR is merged, I plan to open a follow-up PR to implement that secondary index optimization so we don't have to rely on O(N) scanning for these endpoints long-term.
Could you please take another look when you have a moment?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 16 changed files in this pull request and generated 2 comments.

Comment thread pkg/workloadmanager/handlers.go Outdated
Comment thread pkg/workloadmanager/handlers.go
@Abhinav-kodes Abhinav-kodes changed the title feat: add session observability GET endpoints for sandboxes feat: add session observability GET endpoints and fix flaky JWT tests May 19, 2026
@acsoto
Copy link
Copy Markdown
Member

acsoto commented May 23, 2026

What is the intended auth boundary here: namespace-scoped or service-account-scoped?

@Abhinav-kodes
Copy link
Copy Markdown
Contributor Author

@acsoto

The current behavior is namespace-scoped, not per-service-account. handleGetSandbox and handleListSandboxes only check sandboxInfo.SandboxNamespace == userNamespace, so any authenticated service account in the same namespace can read another service account's session entrypoints. The comment block in auth.go (lines 52–57) actually states the intended model should be per-service-account ownership ("Users can only list/access sandboxes they created"), so there's a mismatch.

The reason the GET handlers don't enforce per-SA ownership is that SandboxInfo has no CreatedBy field, the creator's service account name is available in the request context during POST but is never persisted into the store by buildSandboxPlaceHolder or buildSandboxInfo.

The DELETE handler sidesteps this because it delegates enforcement to Kubernetes RBAC via the user's own dynamicClient; the Kubernetes API itself rejects cross-account deletes. For GET, however, we read directly from the store, so we need the creator identity stored in-band.

My proposed fix (can be done in this PR or a follow-up):

  • Add CreatedBy string \json:"createdBy" toSandboxInfo
  • Thread the serviceAccountName from handleSandboxCreate down into buildSandboxPlaceHolder and stamp it on the stored object
  • In handleGetSandbox and handleListSandboxes, add a second check:
    sandboxInfo.CreatedBy == serviceAccountName (from extractUserInfo)

If the consensus is that namespace-scoped access is the correct model for these read endpoints (e.g. in a multi-tenant environment where all service accounts in a namespace are considered equivalent), I'm happy to update auth.go's comment block to reflect that and document the boundary explicitly in the PR description.

But given the existing documented intent, I believe the stricter per-service-account model is the correct one.

I'll wait for confirmation on which direction you'd like before pushing a patch.

@acsoto
Copy link
Copy Markdown
Member

acsoto commented May 23, 2026

Thanks for clarifying.

Copy link
Copy Markdown
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well-structured observability endpoints with proper access control.

pkg/workloadmanager/handlers.go

  • Lines 310-335 (handleGetSandbox): Good - returns 404 instead of 403 for namespace mismatch to avoid information leakage. The kind check on line 326 prevents cross-kind session lookups.
  • Lines 338-361 (handleListSandboxes): The in-place filter (filtered := sandboxes[:0]) avoids a heap allocation. Good optimization.

pkg/store/store_redis.go

  • Lines 146-200 (ListSandboxesByKind): Uses SCAN with count=100 to avoid blocking Redis. The seenKeys map for deduplication is necessary since SCAN can return duplicates across cursor pages. The comment about memory impact on large datasets is appreciated - a secondary index per kind (Redis SET) would be the right follow-up optimization for production scale.

pkg/store/store_valkey.go

  • Lines 114-133: The loadSandboxesBySessionIDs fix properly handles nil MGET entries by checking msg.IsNil() before calling ToString(). This fixes a potential panic when keys are evicted between SCAN and MGET. Good.
  • Lines 134-136: Changed the size check from > len(sessionIDKeys) to != len(sessionIDKeys) - this is more correct since MGET always returns exactly one entry per key.
  • Lines 178-230: ValKey ListSandboxesByKind mirrors the Redis implementation correctly.

pkg/store/store_redis_test.go

  • Lines 68-69: The index key prefix change from sandbox: to session: is a consistency fix. Make sure no deployed systems depend on the old prefix, or add a migration note.

pkg/store/store_valkey_test.go

  • Lines 394-437 (TestValkeyStore_LoadSandboxesBySessionIDs_OrphanedZSetEntry): Excellent test - simulates key eviction between SCAN and MGET. This is a real-world scenario in Redis/Valkey with memory policies.

pkg/router/jwt_test.go

  • Lines 186-193: Comparing RSA keys mathematically (N, E, D, Primes) instead of by object equality fixes the flaky test. The root cause was likely non-deterministic precomputed fields in the RSA key struct.

LGTM.

Copilot AI review requested due to automatic review settings May 25, 2026 10:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
…espace mismatch

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
@Abhinav-kodes Abhinav-kodes force-pushed the feat-get-session-endpoints branch from 3563620 to 0b47b4f Compare May 25, 2026 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky TestGetPrivateKeyPEM in pkg/router due to direct deep equal assertion on rsa.PrivateKey

6 participants