feat: add session observability GET endpoints and fix flaky JWT tests#331
feat: add session observability GET endpoints and fix flaky JWT tests#331Abhinav-kodes wants to merge 5 commits into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
f3820f4 to
7f79307
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces functionality to retrieve individual sandboxes by session ID and list sandboxes filtered by kind. Changes include updates to the Store interface, Redis and Valkey implementations using SCAN for efficiency, and new HTTP handlers in the workload manager. Feedback highlights a potential failure in the Valkey implementation if keys expire during a scan, and suggests enforcing kind-based validation in the GET handler to prevent unauthorized cross-kind access. Additionally, indentation in the new handlers should be corrected to use tabs for consistency with Go standards.
There was a problem hiding this comment.
Pull request overview
Adds “session observability” read endpoints to the Workload Manager API so clients can retrieve a session by ID or list active sessions, backed by new store capabilities for enumerating sandboxes by kind.
Changes:
- Added
GET /v1/{agent-runtime|code-interpreter}/sessionsandGET /v1/{agent-runtime|code-interpreter}/sessions/:sessionIdroutes and handlers in Workload Manager. - Extended
store.StorewithListSandboxesByKindand implemented it for Redis + Valkey usingSCAN. - Added/updated unit tests for the new handlers and store methods, plus updated fakes to satisfy the new interface.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/workloadmanager/server.go | Registers new GET session routes for agent-runtime and code-interpreter APIs. |
| pkg/workloadmanager/handlers.go | Implements handleGetSandbox and handleListSandboxes with optional auth namespace filtering. |
| pkg/workloadmanager/handlers_test.go | Adds unit tests for the new GET handlers and updates fake store. |
| pkg/workloadmanager/garbage_collection_test.go | Updates nopStore to satisfy the extended store interface. |
| pkg/store/interface.go | Adds ListSandboxesByKind to the Store interface. |
| pkg/store/store_redis.go | Implements Redis ListSandboxesByKind via SCAN + batch load + kind filter. |
| pkg/store/store_redis_test.go | Adds a unit test for Redis ListSandboxesByKind. |
| pkg/store/store_valkey.go | Implements Valkey ListSandboxesByKind via SCAN + batch load + kind filter. |
| pkg/store/store_valkey_test.go | Adds a unit test for Valkey ListSandboxesByKind. |
| pkg/router/session_manager_test.go | Updates router test fake store to satisfy the extended store interface. |
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #331 +/- ##
==========================================
+ Coverage 47.57% 50.86% +3.29%
==========================================
Files 30 30
Lines 2819 3002 +183
==========================================
+ Hits 1341 1527 +186
+ Misses 1338 1314 -24
- Partials 140 161 +21
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
44d7045 to
edbec37
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements functionality to retrieve and list sandboxes by kind, adding new API endpoints and supporting logic in the Redis and Valkey storage backends. Feedback focuses on performance optimizations, specifically recommending secondary indices to replace O(N) scans when listing sandboxes, improving slice pre-allocation for memory efficiency, and strengthening result-length validation during batch operations.
ea991fe to
9cd1422
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces the ability to list and retrieve sandboxes by their kind across Redis and Valkey storage backends, adding new API endpoints and namespace-based authorization. Review feedback suggests using non-blocking SSCAN instead of SMembers to prevent performance issues with large datasets and recommends logging errors during secondary index cleanup in DeleteSandboxBySessionID to avoid orphaned session IDs.
9cd1422 to
4e7ab3b
Compare
There was a problem hiding this comment.
Code Review
This pull request implements functionality to retrieve and list sandboxes by kind for both Agent Runtime and Code Interpreter services. It introduces a new ListSandboxesByKind method to the storage interface with implementations for Redis and Valkey, and adds corresponding API endpoints with namespace-based authorization filtering. Feedback focused on potential scalability issues with the current SCAN-based filtering approach in the storage layer and suggested a more efficient in-place filtering implementation for the list handler to reduce memory allocations.
0fcba68 to
b35599d
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces functionality to list and retrieve sandboxes by kind and session ID across the storage and API layers. The Store interface is expanded with ListSandboxesByKind, implemented for both Redis and Valkey using efficient SCAN operations to prevent blocking. New REST endpoints are added to the workload manager to support these operations, incorporating namespace-based filtering when authentication is enabled. Review feedback suggests removing redundant nil checks in the Redis and Valkey store implementations, as the batch loading logic already ensures non-nil results.
b35599d to
ae72e13
Compare
|
@LiZhenCheng9527 @hzxuzhonghu @YaoZengzeng a dedicated Redis SET per kind ( However, implementing a secondary index requires changes to For the current scale of the project, the SCAN-based approach with per-batch filtering (100 keys at a time) should be sufficient. |
|
Thank you for the update and for considering the scalability implications. Your plan to track the secondary index implementation as a follow-up optimization is reasonable, especially given the current scale of the project. The use of |
hzxuzhonghu
left a comment
There was a problem hiding this comment.
Good addition — the missing GET endpoints were a real gap. Using SCAN instead of KEYS is the right call for production Redis. The namespace-scoped authorization filtering is also important here.
|
@hzxuzhonghu Thanks for pointing that out! |
|
What is the intended auth boundary here: namespace-scoped or service-account-scoped? |
|
The current behavior is namespace-scoped, not per-service-account. The reason the GET handlers don't enforce per-SA ownership is that The My proposed fix (can be done in this PR or a follow-up):
If the consensus is that namespace-scoped access is the correct model for these read endpoints (e.g. in a multi-tenant environment where all service accounts in a namespace are considered equivalent), I'm happy to update But given the existing documented intent, I believe the stricter per-service-account model is the correct one. I'll wait for confirmation on which direction you'd like before pushing a patch. |
|
Thanks for clarifying. |
hzxuzhonghu
left a comment
There was a problem hiding this comment.
Well-structured observability endpoints with proper access control.
pkg/workloadmanager/handlers.go
- Lines 310-335 (
handleGetSandbox): Good - returns 404 instead of 403 for namespace mismatch to avoid information leakage. The kind check on line 326 prevents cross-kind session lookups. - Lines 338-361 (
handleListSandboxes): The in-place filter (filtered := sandboxes[:0]) avoids a heap allocation. Good optimization.
pkg/store/store_redis.go
- Lines 146-200 (
ListSandboxesByKind): Uses SCAN with count=100 to avoid blocking Redis. TheseenKeysmap for deduplication is necessary since SCAN can return duplicates across cursor pages. The comment about memory impact on large datasets is appreciated - a secondary index per kind (Redis SET) would be the right follow-up optimization for production scale.
pkg/store/store_valkey.go
- Lines 114-133: The
loadSandboxesBySessionIDsfix properly handles nil MGET entries by checkingmsg.IsNil()before callingToString(). This fixes a potential panic when keys are evicted between SCAN and MGET. Good. - Lines 134-136: Changed the size check from
> len(sessionIDKeys)to!= len(sessionIDKeys)- this is more correct since MGET always returns exactly one entry per key. - Lines 178-230: ValKey
ListSandboxesByKindmirrors the Redis implementation correctly.
pkg/store/store_redis_test.go
- Lines 68-69: The index key prefix change from
sandbox:tosession:is a consistency fix. Make sure no deployed systems depend on the old prefix, or add a migration note.
pkg/store/store_valkey_test.go
- Lines 394-437 (
TestValkeyStore_LoadSandboxesBySessionIDs_OrphanedZSetEntry): Excellent test - simulates key eviction between SCAN and MGET. This is a real-world scenario in Redis/Valkey with memory policies.
pkg/router/jwt_test.go
- Lines 186-193: Comparing RSA keys mathematically (N, E, D, Primes) instead of by object equality fixes the flaky test. The root cause was likely non-deterministic precomputed fields in the RSA key struct.
LGTM.
Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
…espace mismatch Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
Signed-off-by: Abhinav Singh <abhinavsingh717073@gmail.com>
3563620 to
0b47b4f
Compare
What type of PR is this?
/kind feature
/kind bug
What this PR does / why we need it:
This PR completes the CRUD operations for the Workload Manager API by introducing "Session Observability" endpoints, and fixes a critical test flakiness issue blocking CI pipelines.
Previously, the Workload Manager only supported creating (POST) and deleting (DELETE) sandboxes. If a client disconnected or lost the initial response, there was no native way to retrieve the entrypoints or check the status of a running session.
This PR adds:
GET /v1/agent-runtime/sessions/:sessionId(and/code-interpreter) to retrieve a specific session's details.GET /v1/agent-runtime/sessions(and/code-interpreter) to list all active sessions.ListSandboxesByKindimplementation in thestore.Storeinterface, backed by non-blockingSCANloops for both Valkey and Redis.404 Not Foundrather than a403 Forbiddento prevent session ID enumeration.Bug Fixes Included:
TestGetPrivateKeyPEMunit test inpkg/router/jwt_test.goby replacing a direct deep-equal assertion onrsa.PrivateKeywith component-wise mathematical comparisons. This prevents random CI failures caused by differences in unexported precomputed byte slice lengths during ASN.1 serialization/deserialization.Which issue(s) this PR fixes:
Fixes #348
Special notes for your reviewer:
The
ListSandboxesByKindimplementations for Valkey and Redis utilize theSCANcommand rather thanKEYSto ensure the new GET/sessionsendpoint does not block the storage engine. While this requires filtering by namespace in-memory, this O(total sessions) footprint is documented in the store logic (seenKeys). A proper secondary index scheme (e.g., Redis SET per namespace/kind) will be introduced in a future follow-up PR to provide O(1) namespace filtering, keeping this PR strictly focused on the observability endpoints.Does this PR introduce a user-facing change?: