feat: add session observability GET endpoints and fix flaky JWT tests by Abhinav-kodes · Pull Request #331 · volcano-sh/agentcube

Abhinav-kodes · 2026-05-13T13:18:19Z

What type of PR is this?

/kind feature
/kind bug

What this PR does / why we need it:
This PR completes the CRUD operations for the Workload Manager API by introducing "Session Observability" endpoints, and fixes a critical test flakiness issue blocking CI pipelines.

Previously, the Workload Manager only supported creating (POST) and deleting (DELETE) sandboxes. If a client disconnected or lost the initial response, there was no native way to retrieve the entrypoints or check the status of a running session.

This PR adds:

GET /v1/agent-runtime/sessions/:sessionId (and /code-interpreter) to retrieve a specific session's details.
GET /v1/agent-runtime/sessions (and /code-interpreter) to list all active sessions.
A ListSandboxesByKind implementation in the store.Store interface, backed by non-blocking SCAN loops for both Valkey and Redis.
Authorization filtering to ensure authenticated users can only view and list sessions within their own namespaces. Note: Unauthorized cross-namespace access returns a 404 Not Found rather than a 403 Forbidden to prevent session ID enumeration.
Comprehensive unit tests covering the new handlers, data store queries, and authorization boundaries.

Bug Fixes Included:

Fixes the flaky TestGetPrivateKeyPEM unit test in pkg/router/jwt_test.go by replacing a direct deep-equal assertion on rsa.PrivateKey with component-wise mathematical comparisons. This prevents random CI failures caused by differences in unexported precomputed byte slice lengths during ASN.1 serialization/deserialization.

Which issue(s) this PR fixes:
Fixes #348

Special notes for your reviewer:
The ListSandboxesByKind implementations for Valkey and Redis utilize the SCAN command rather than KEYS to ensure the new GET /sessions endpoint does not block the storage engine. While this requires filtering by namespace in-memory, this O(total sessions) footprint is documented in the store logic (seenKeys). A proper secondary index scheme (e.g., Redis SET per namespace/kind) will be introduced in a future follow-up PR to provide O(1) namespace filtering, keeping this PR strictly focused on the observability endpoints.

Does this PR introduce a user-facing change?:

Added `GET /sessions` and `GET /sessions/:sessionId` endpoints to the Workload Manager API, allowing clients to list active sandboxes and retrieve connection entrypoints for existing sessions. Fixed a flaky test blocking CI pipelines.

volcano-sh-bot · 2026-05-13T13:18:27Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yaozengzeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

pkg/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gemini-code-assist

Code Review

This pull request introduces functionality to retrieve individual sandboxes by session ID and list sandboxes filtered by kind. Changes include updates to the Store interface, Redis and Valkey implementations using SCAN for efficiency, and new HTTP handlers in the workload manager. Feedback highlights a potential failure in the Valkey implementation if keys expire during a scan, and suggests enforcing kind-based validation in the GET handler to prevent unauthorized cross-kind access. Additionally, indentation in the new handlers should be corrected to use tabs for consistency with Go standards.

Copilot

Pull request overview

Adds “session observability” read endpoints to the Workload Manager API so clients can retrieve a session by ID or list active sessions, backed by new store capabilities for enumerating sandboxes by kind.

Changes:

Added GET /v1/{agent-runtime|code-interpreter}/sessions and GET /v1/{agent-runtime|code-interpreter}/sessions/:sessionId routes and handlers in Workload Manager.
Extended store.Store with ListSandboxesByKind and implemented it for Redis + Valkey using SCAN.
Added/updated unit tests for the new handlers and store methods, plus updated fakes to satisfy the new interface.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
pkg/workloadmanager/server.go	Registers new GET session routes for agent-runtime and code-interpreter APIs.
pkg/workloadmanager/handlers.go	Implements `handleGetSandbox` and `handleListSandboxes` with optional auth namespace filtering.
pkg/workloadmanager/handlers_test.go	Adds unit tests for the new GET handlers and updates fake store.
pkg/workloadmanager/garbage_collection_test.go	Updates nopStore to satisfy the extended store interface.
pkg/store/interface.go	Adds `ListSandboxesByKind` to the `Store` interface.
pkg/store/store_redis.go	Implements Redis `ListSandboxesByKind` via `SCAN` + batch load + kind filter.
pkg/store/store_redis_test.go	Adds a unit test for Redis `ListSandboxesByKind`.
pkg/store/store_valkey.go	Implements Valkey `ListSandboxesByKind` via `SCAN` + batch load + kind filter.
pkg/store/store_valkey_test.go	Adds a unit test for Valkey `ListSandboxesByKind`.
pkg/router/session_manager_test.go	Updates router test fake store to satisfy the extended store interface.

codecov-commenter · 2026-05-13T13:26:17Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 83.20000% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 50.86%. Comparing base (524e55e) to head (0b47b4f).
⚠️ Report is 82 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/store/store_valkey.go	74.35%	5 Missing and 5 partials ⚠️
pkg/store/store_redis.go	79.31%	3 Missing and 3 partials ⚠️
pkg/workloadmanager/server.go	0.00%	4 Missing ⚠️
pkg/workloadmanager/handlers.go	97.91%	1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #331      +/-   ##
==========================================
+ Coverage   47.57%   50.86%   +3.29%     
==========================================
  Files          30       30              
  Lines        2819     3002     +183     
==========================================
+ Hits         1341     1527     +186     
+ Misses       1338     1314      -24     
- Partials      140      161      +21

Flag	Coverage Δ
unittests	`50.86% <83.20%> (+3.29%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Abhinav-kodes · 2026-05-13T14:11:04Z

/gemini review

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

gemini-code-assist

Code Review

This pull request implements functionality to retrieve and list sandboxes by kind, adding new API endpoints and supporting logic in the Redis and Valkey storage backends. Feedback focuses on performance optimizations, specifically recommending secondary indices to replace O(N) scans when listing sandboxes, improving slice pre-allocation for memory efficiency, and strengthening result-length validation during batch operations.

Copilot

Pull request overview

Copilot reviewed 10 out of 15 changed files in this pull request and generated 2 comments.

Abhinav-kodes · 2026-05-13T14:38:17Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces the ability to list and retrieve sandboxes by their kind across Redis and Valkey storage backends, adding new API endpoints and namespace-based authorization. Review feedback suggests using non-blocking SSCAN instead of SMembers to prevent performance issues with large datasets and recommends logging errors during secondary index cleanup in DeleteSandboxBySessionID to avoid orphaned session IDs.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

gemini-code-assist

Code Review

This pull request implements functionality to retrieve and list sandboxes by kind for both Agent Runtime and Code Interpreter services. It introduces a new ListSandboxesByKind method to the storage interface with implementations for Redis and Valkey, and adds corresponding API endpoints with namespace-based authorization filtering. Feedback focused on potential scalability issues with the current SCAN-based filtering approach in the storage layer and suggested a more efficient in-place filtering implementation for the list handler to reduce memory allocations.

Abhinav-kodes · 2026-05-13T15:56:07Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces functionality to list and retrieve sandboxes by kind and session ID across the storage and API layers. The Store interface is expanded with ListSandboxesByKind, implemented for both Redis and Valkey using efficient SCAN operations to prevent blocking. New REST endpoints are added to the workload manager to support these operations, incorporating namespace-based filtering when authentication is enabled. Review feedback suggests removing redundant nil checks in the Redis and Valkey store implementations, as the batch loading logic already ensures non-nil results.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Abhinav-kodes · 2026-05-13T16:21:19Z

@LiZhenCheng9527 @hzxuzhonghu @YaoZengzeng
I have applied all the ai assistant's suggestions , but SCAN-based approach is O(N) over all session keys and won't scale well with millions of sandboxes as pointed out by @gemini-code-assist

a dedicated Redis SET per kind (session:kind:agent-runtime) is the approach I'd lean towards, as it provides O(K) lookups (where K = sandboxes of that kind) without requiring a key schema migration.

However, implementing a secondary index requires changes to StoreSandbox and DeleteSandboxBySessionID to maintain the index (SADD on create, SREM on delete), which is outside the scope of this PR.
I'll track this as a follow-up optimization.

For the current scale of the project, the SCAN-based approach with per-batch filtering (100 keys at a time) should be sufficient.

gemini-code-assist · 2026-05-13T16:22:34Z

Thank you for the update and for considering the scalability implications. Your plan to track the secondary index implementation as a follow-up optimization is reasonable, especially given the current scale of the project. The use of SCAN with batch filtering is a pragmatic approach to avoid blocking the storage engine while still providing the requested functionality. I look forward to seeing the secondary index implementation in a future PR.

hzxuzhonghu

Good addition — the missing GET endpoints were a real gap. Using SCAN instead of KEYS is the right call for production Redis. The namespace-scoped authorization filtering is also important here.

Abhinav-kodes · 2026-05-19T10:43:24Z

@hzxuzhonghu Thanks for pointing that out!
I've added a comment in both the Redis and Valkey store implementations to explicitly document this behavior. The comment notes that seenKeys is used for cross-page deduplication (since SCAN can return the same key multiple times) and that for extremely large datasets, implementing a dedicated secondary index (like a SET per kind) is recommended.
Once this PR is merged, I plan to open a follow-up PR to implement that secondary index optimization so we don't have to rely on O(N) scanning for these endpoints long-term.
Could you please take another look when you have a moment?

Copilot

Pull request overview

Copilot reviewed 11 out of 16 changed files in this pull request and generated 2 comments.

acsoto · 2026-05-23T09:09:56Z

What is the intended auth boundary here: namespace-scoped or service-account-scoped?

Abhinav-kodes · 2026-05-23T10:39:44Z

@acsoto

The current behavior is namespace-scoped, not per-service-account. handleGetSandbox and handleListSandboxes only check sandboxInfo.SandboxNamespace == userNamespace, so any authenticated service account in the same namespace can read another service account's session entrypoints. The comment block in auth.go (lines 52–57) actually states the intended model should be per-service-account ownership ("Users can only list/access sandboxes they created"), so there's a mismatch.

The reason the GET handlers don't enforce per-SA ownership is that SandboxInfo has no CreatedBy field, the creator's service account name is available in the request context during POST but is never persisted into the store by buildSandboxPlaceHolder or buildSandboxInfo.

The DELETE handler sidesteps this because it delegates enforcement to Kubernetes RBAC via the user's own dynamicClient; the Kubernetes API itself rejects cross-account deletes. For GET, however, we read directly from the store, so we need the creator identity stored in-band.

My proposed fix (can be done in this PR or a follow-up):

Add CreatedBy string \json:"createdBy" toSandboxInfo
Thread the serviceAccountName from handleSandboxCreate down into buildSandboxPlaceHolder and stamp it on the stored object
In handleGetSandbox and handleListSandboxes, add a second check:
sandboxInfo.CreatedBy == serviceAccountName (from extractUserInfo)

If the consensus is that namespace-scoped access is the correct model for these read endpoints (e.g. in a multi-tenant environment where all service accounts in a namespace are considered equivalent), I'm happy to update auth.go's comment block to reflect that and document the boundary explicitly in the PR description.

But given the existing documented intent, I believe the stricter per-service-account model is the correct one.

I'll wait for confirmation on which direction you'd like before pushing a patch.

acsoto · 2026-05-23T10:46:59Z

Thanks for clarifying.

hzxuzhonghu

Well-structured observability endpoints with proper access control.

pkg/workloadmanager/handlers.go

Lines 310-335 (handleGetSandbox): Good - returns 404 instead of 403 for namespace mismatch to avoid information leakage. The kind check on line 326 prevents cross-kind session lookups.
Lines 338-361 (handleListSandboxes): The in-place filter (filtered := sandboxes[:0]) avoids a heap allocation. Good optimization.

pkg/store/store_redis.go

Lines 146-200 (ListSandboxesByKind): Uses SCAN with count=100 to avoid blocking Redis. The seenKeys map for deduplication is necessary since SCAN can return duplicates across cursor pages. The comment about memory impact on large datasets is appreciated - a secondary index per kind (Redis SET) would be the right follow-up optimization for production scale.

pkg/store/store_valkey.go

Lines 114-133: The loadSandboxesBySessionIDs fix properly handles nil MGET entries by checking msg.IsNil() before calling ToString(). This fixes a potential panic when keys are evicted between SCAN and MGET. Good.
Lines 134-136: Changed the size check from > len(sessionIDKeys) to != len(sessionIDKeys) - this is more correct since MGET always returns exactly one entry per key.
Lines 178-230: ValKey ListSandboxesByKind mirrors the Redis implementation correctly.

pkg/store/store_redis_test.go

Lines 68-69: The index key prefix change from sandbox: to session: is a consistency fix. Make sure no deployed systems depend on the old prefix, or add a migration note.

pkg/store/store_valkey_test.go

Lines 394-437 (TestValkeyStore_LoadSandboxesBySessionIDs_OrphanedZSetEntry): Excellent test - simulates key eviction between SCAN and MGET. This is a real-world scenario in Redis/Valkey with memory policies.

pkg/router/jwt_test.go

Lines 186-193: Comparing RSA keys mathematically (N, E, D, Primes) instead of by object equality fixes the flaky test. The root cause was likely non-deterministic precomputed fields in the RSA key struct.

LGTM.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>

…espace mismatch Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>

Copilot AI review requested due to automatic review settings May 13, 2026 13:18

volcano-sh-bot added the kind/feature label May 13, 2026

volcano-sh-bot requested review from LiZhenCheng9527, YaoZengzeng and acsoto May 13, 2026 13:18

volcano-sh-bot added the size/L label May 13, 2026

Copilot started reviewing on behalf of Abhinav-kodes May 13, 2026 13:19 View session

Abhinav-kodes force-pushed the feat-get-session-endpoints branch from f3820f4 to 7f79307 Compare May 13, 2026 13:19

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

Comment thread pkg/store/store_valkey.go Outdated

Comment thread pkg/workloadmanager/handlers.go Outdated

Comment thread pkg/workloadmanager/server.go Outdated

Comment thread pkg/workloadmanager/server.go Outdated

Comment thread pkg/workloadmanager/handlers_test.go Outdated

Copilot AI reviewed May 13, 2026

View reviewed changes

Comment thread pkg/store/store_redis.go

Comment thread pkg/store/store_valkey.go

Comment thread pkg/workloadmanager/handlers.go Outdated

Comment thread pkg/store/store_redis_test.go

Comment thread pkg/store/store_valkey_test.go

volcano-sh-bot added size/XL and removed size/L labels May 13, 2026

Abhinav-kodes force-pushed the feat-get-session-endpoints branch from 44d7045 to edbec37 Compare May 13, 2026 13:48

Abhinav-kodes requested a review from Copilot May 13, 2026 13:49

Copilot started reviewing on behalf of Abhinav-kodes May 13, 2026 14:05 View session

Copilot AI reviewed May 13, 2026

View reviewed changes

Abhinav-kodes requested a review from Copilot May 13, 2026 14:14

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

Comment thread pkg/store/store_redis.go

Comment thread pkg/store/store_valkey.go Outdated

Comment thread pkg/store/store_valkey.go

Comment thread pkg/workloadmanager/handlers.go Outdated

Copilot AI reviewed May 13, 2026

View reviewed changes

Comment thread pkg/store/store_redis.go

Comment thread pkg/store/store_valkey.go

Copilot started reviewing on behalf of Abhinav-kodes May 13, 2026 14:34 View session

Copilot AI review requested due to automatic review settings May 13, 2026 14:35

Abhinav-kodes force-pushed the feat-get-session-endpoints branch from ea991fe to 9cd1422 Compare May 13, 2026 14:37

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

Comment thread pkg/store/store_redis.go

Comment thread pkg/store/store_valkey.go

Comment thread pkg/store/store_redis.go Outdated

Comment thread pkg/store/store_valkey.go Outdated

Copilot AI reviewed May 13, 2026

View reviewed changes

Copilot started reviewing on behalf of Abhinav-kodes May 13, 2026 14:43 View session

Abhinav-kodes force-pushed the feat-get-session-endpoints branch from 9cd1422 to 4e7ab3b Compare May 13, 2026 14:47

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

Comment thread pkg/store/store_redis.go

Comment thread pkg/store/store_valkey.go

Comment thread pkg/workloadmanager/handlers.go Outdated

Abhinav-kodes force-pushed the feat-get-session-endpoints branch from 0fcba68 to b35599d Compare May 13, 2026 15:43

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

Comment thread pkg/store/store_redis.go Outdated

Comment thread pkg/store/store_valkey.go Outdated

Copilot AI review requested due to automatic review settings May 13, 2026 16:09

Abhinav-kodes force-pushed the feat-get-session-endpoints branch from b35599d to ae72e13 Compare May 13, 2026 16:09

Copilot started reviewing on behalf of Abhinav-kodes May 13, 2026 16:10 View session

Copilot AI reviewed May 13, 2026

View reviewed changes

hzxuzhonghu reviewed May 19, 2026

View reviewed changes

Comment thread pkg/store/store_redis.go

Abhinav-kodes requested review from Copilot and hzxuzhonghu May 19, 2026 10:43

Copilot started reviewing on behalf of Abhinav-kodes May 19, 2026 11:07 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

Comment thread pkg/workloadmanager/handlers.go Outdated

Comment thread pkg/workloadmanager/handlers.go

Abhinav-kodes changed the title ~~feat: add session observability GET endpoints for sandboxes~~ feat: add session observability GET endpoints and fix flaky JWT tests May 19, 2026

hzxuzhonghu reviewed May 25, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings May 25, 2026 10:27

Copilot started reviewing on behalf of Abhinav-kodes May 25, 2026 10:27 View session

Copilot AI reviewed May 25, 2026

View reviewed changes

Abhinav-kodes added 5 commits May 25, 2026 21:50

feat: add session observability GET endpoints for sandboxes

4f18b06

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>

test: add memory footprint note for seenKeys per review

f6bcafc

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>

test: fix flaky RSA key comparison in TestGetPrivateKeyPEM

0eeb2c7

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>

test: update TestHandleGetSandbox to expect 404 instead of 403 on nam…

92f070c

…espace mismatch Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>

auth: enforce per-service-account ownership on GET session endpoints

0b47b4f

Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>

Abhinav-kodes force-pushed the feat-get-session-endpoints branch from 3563620 to 0b47b4f Compare May 25, 2026 16:22

Conversation

Abhinav-kodes commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

volcano-sh-bot commented May 13, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Abhinav-kodes commented May 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Abhinav-kodes commented May 13, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Abhinav-kodes commented May 13, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Abhinav-kodes commented May 13, 2026

Abhinav-kodes commented May 13, 2026 •

edited

Loading

codecov-commenter commented May 13, 2026 •

edited

Loading

acsoto commented May 23, 2026 •

edited

Loading