fix: backport stack-auth token-refresh fix; release 2.2.3#407
Conversation
Access-key auth on 2.2.2 saw ZeroKMS "Request not authorized" begin ~15 min (the access-token lifetime) after startup. A get_token() future cancelled in the post-HTTP, pre-install window of stack-auth's AutoRefresh::refresh_non_blocking defused its CancelGuard too early, stranding refresh_in_progress = true. Every later refresh then wedged; once the cached token crossed real expiry, callers hung in wait_for_in_flight_refresh forever. The upstream fix (move defuse() after the token install + flag clear) first ships in stack-auth >= 0.36.0, but bumping cipherstash-client to that line drags in an unrelated EQL API redesign. Instead, vendor the published stack-auth 0.34.1-alpha.4 source, apply the CancelGuard reordering, and override the transitive dependency via [patch.crates-io] (same version satisfies cipherstash-client's exact pin, so no republish). Includes a regression test that fails on the pre-fix ordering. Remove the patch once Proxy moves to cipherstash-client built against stack-auth >= 0.36.0.
Patch release for the stack-auth token-refresh hotfix (CIP-3159): bump workspace version 2.2.2 -> 2.2.3 and promote the Unreleased CHANGELOG entry to [2.2.3].
|
Warning Review limit reached
More reviews will be available in 44 minutes and 47 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. 📝 WalkthroughWalkthroughVersion bumped to 2.2.3 with a Changesstack-auth vendor hotfix and auth strategy library
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
vendor/stack-auth/CHANGELOG.md (1)
1-168:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftNormalize the vendored changelog structure before merge.
This file is currently malformed for changelog consumption: repeated top-level sections, heading-order break (e.g., Line 101 starts
# Changelogmid-file), and no[Unreleased]section at the top. It also matches the reported MD001/MD024 violations.As per coding guidelines, “Maintain CHANGELOG.md following Keep a Changelog format with [Unreleased] section at the top and entries organized by type (Added, Changed, Deprecated, Removed, Fixed, Security)” and “Write changelog entries from the user's perspective, not implementation details.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@vendor/stack-auth/CHANGELOG.md` around lines 1 - 168, Restructure the CHANGELOG.md file to follow Keep a Changelog format properly. Move the `# Changelog` heading to the very top of the file before all content. Add an `[Unreleased]` section immediately after the main heading and before any version-dated sections. Consolidate the duplicate sections (Documentation, Features, Fixes, Miscellaneous, Refactoring, Testing, Style) that appear multiple times throughout the file, merging duplicate entries under each category type. Ensure each version section follows the standard format with a version number and date, and organize all entries by category type within each version to comply with Keep a Changelog standards and resolve the MD001/MD024 violations.Sources: Coding guidelines, Linters/SAST tools
🧹 Nitpick comments (3)
vendor/stack-auth/src/token.rs (1)
489-489: 💤 Low valueReplace
expect()withunwrap()per coding guidelines.As per coding guidelines, use
unwrap()instead ofexpect()in test code unless providing meaningful context.Suggested fix
- .expect("failed to encode JWT"); + .unwrap();🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@vendor/stack-auth/src/token.rs` at line 489, Replace the `.expect("failed to encode JWT")` call with `.unwrap()` to comply with coding guidelines that recommend using `unwrap()` instead of `expect()` in test code. Locate the line containing the expect call on the JWT encoding operation and remove the message string, replacing it with a simple unwrap() method call.Source: Coding guidelines
vendor/stack-auth/tasks.toml (1)
7-7: 💤 Low valueAdd Windows platform support for native binary copy.
The binary copy command handles macOS (
.dylib) and Linux (.so) but lacks a Windows (.dll) fallback. If Windows CI/local development is expected, this will silently fail to copy the native library.📦 Proposed fix to include Windows `.dll` fallback
- "cp ../../../target/debug/libstack_auth_node.dylib stack-auth-node.node 2>/dev/null || cp ../../../target/debug/libstack_auth_node.so stack-auth-node.node", + "cp ../../../target/debug/libstack_auth_node.dylib stack-auth-node.node 2>/dev/null || cp ../../../target/debug/libstack_auth_node.so stack-auth-node.node 2>/dev/null || cp ../../../target/debug/stack_auth_node.dll stack-auth-node.node",🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@vendor/stack-auth/tasks.toml` at line 7, The binary copy command in the build task only provides fallbacks for macOS (.dylib) and Linux (.so) platforms, missing support for Windows (.dll). Extend the existing command chain by adding another fallback condition after the Linux fallback that attempts to copy the Windows .dll file (libstack_auth_node.dll) to stack-auth-node.node, ensuring the command handles all three platforms gracefully.vendor/stack-auth/src/auto_strategy.rs (1)
343-377: 💤 Low valueEnvironment variable tests may be flaky under parallel execution.
Tests that modify
CS_WORKSPACE_CRNcan interfere with each other when run in parallel. Consider using a test mutex orserial_testcrate to ensure these tests run sequentially, or restructure to avoid relying on real env var manipulation.However, since this is vendored code for a hotfix and the tests appear to restore env vars, this is acceptable for now.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@vendor/stack-auth/src/auto_strategy.rs` around lines 343 - 377, The tests explicit_access_key_without_crn_and_no_env_returns_missing_workspace_crn and invalid_crn_env_var_returns_invalid_crn modify the shared CS_WORKSPACE_CRN environment variable and may interfere with each other under parallel test execution. While the tests do restore the environment variable, to prevent flakiness in parallel execution, add the #[serial] attribute from the serial_test crate to these tests to ensure they run sequentially, or alternatively wrap the environment variable manipulation in a test mutex to provide thread-safe isolation across parallel test runs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@CHANGELOG.md`:
- Line 13: The changelog entry for version 2.2.3 on line 13 contains
implementation details like the `stack-auth` backport and `CIP-3159` reference
that should not be exposed to end users. Reword the entry to focus only on the
user-visible behavior and impact, describing what the user experiences (ZeroKMS
authentication failures ~15 minutes after startup have been fixed) without
mentioning internal technical implementation details like token refresh
mechanisms or upstream backports.
In `@vendor/stack-auth/src/device_client.rs`:
- Around line 114-118: The code currently returns Ok(()) when receiving a 409
CONFLICT status in the device client provisioning flow, but this is incorrect
because at that point in the function the local secretkey.json has already been
confirmed as missing. Instead of treating the conflict as success, return an
error result that clearly indicates the device provisioning failed because
another client is already provisioned server-side while the local key material
is absent. This ensures the provisioning contract is upheld by signaling failure
when local key material cannot be obtained, regardless of the server-side state.
In `@vendor/stack-auth/tasks.toml`:
- Line 3: The task configuration in the dir field references a non-existent
working directory path {{config_root}}/packages/stack-auth/node and a
non-existent package stack-auth-node. Verify the actual directory structure in
the repository and update the dir value to point to the correct working
directory location where the stack-auth Node.js package actually exists, or if
this package and task are not yet implemented in the repository, remove this
task configuration from the tasks.toml file until it is ready to be integrated.
---
Outside diff comments:
In `@vendor/stack-auth/CHANGELOG.md`:
- Around line 1-168: Restructure the CHANGELOG.md file to follow Keep a
Changelog format properly. Move the `# Changelog` heading to the very top of the
file before all content. Add an `[Unreleased]` section immediately after the
main heading and before any version-dated sections. Consolidate the duplicate
sections (Documentation, Features, Fixes, Miscellaneous, Refactoring, Testing,
Style) that appear multiple times throughout the file, merging duplicate entries
under each category type. Ensure each version section follows the standard
format with a version number and date, and organize all entries by category type
within each version to comply with Keep a Changelog standards and resolve the
MD001/MD024 violations.
---
Nitpick comments:
In `@vendor/stack-auth/src/auto_strategy.rs`:
- Around line 343-377: The tests
explicit_access_key_without_crn_and_no_env_returns_missing_workspace_crn and
invalid_crn_env_var_returns_invalid_crn modify the shared CS_WORKSPACE_CRN
environment variable and may interfere with each other under parallel test
execution. While the tests do restore the environment variable, to prevent
flakiness in parallel execution, add the #[serial] attribute from the
serial_test crate to these tests to ensure they run sequentially, or
alternatively wrap the environment variable manipulation in a test mutex to
provide thread-safe isolation across parallel test runs.
In `@vendor/stack-auth/src/token.rs`:
- Line 489: Replace the `.expect("failed to encode JWT")` call with `.unwrap()`
to comply with coding guidelines that recommend using `unwrap()` instead of
`expect()` in test code. Locate the line containing the expect call on the JWT
encoding operation and remove the message string, replacing it with a simple
unwrap() method call.
In `@vendor/stack-auth/tasks.toml`:
- Line 7: The binary copy command in the build task only provides fallbacks for
macOS (.dylib) and Linux (.so) platforms, missing support for Windows (.dll).
Extend the existing command chain by adding another fallback condition after the
Linux fallback that attempts to copy the Windows .dll file
(libstack_auth_node.dll) to stack-auth-node.node, ensuring the command handles
all three platforms gracefully.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2eccdc1c-c2be-48af-b9bf-81bfadee5304
⛔ Files ignored due to path filters (2)
Cargo.lockis excluded by!**/*.lockvendor/stack-auth/Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (27)
CHANGELOG.mdCargo.tomlvendor/stack-auth/.gitignorevendor/stack-auth/CHANGELOG.mdvendor/stack-auth/Cargo.tomlvendor/stack-auth/Cargo.toml.origvendor/stack-auth/LICENSEvendor/stack-auth/README.mdvendor/stack-auth/examples/auto_strategy.rsvendor/stack-auth/examples/device_code.rsvendor/stack-auth/src/access_key.rsvendor/stack-auth/src/access_key_refresher.rsvendor/stack-auth/src/access_key_strategy.rsvendor/stack-auth/src/auto_refresh.rsvendor/stack-auth/src/auto_strategy.rsvendor/stack-auth/src/device_client.rsvendor/stack-auth/src/device_code/mod.rsvendor/stack-auth/src/device_code/protocol.rsvendor/stack-auth/src/device_code/tests.rsvendor/stack-auth/src/lib.rsvendor/stack-auth/src/oauth_refresher.rsvendor/stack-auth/src/oauth_strategy.rsvendor/stack-auth/src/refresher.rsvendor/stack-auth/src/service_token.rsvendor/stack-auth/src/static_token_strategy.rsvendor/stack-auth/src/token.rsvendor/stack-auth/tasks.toml
- CHANGELOG: reword the 2.2.3 entry to be user-facing (drop stack-auth/CIP-3159 implementation detail), per the project changelog convention. Traceability stays in the commit/PR. - Trim the vendored crate to library essentials: remove tasks.toml (node build tooling unused by Proxy), the upstream CHANGELOG.md, and Cargo.toml.orig. src/ is kept verbatim from the published stack-auth 0.34.1-alpha.4 plus the single CIP-3159 fix. Intentionally NOT changed: lint/structure findings inside vendored upstream library code (src/token.rs expect(), src/auto_strategy.rs env-var test isolation, src/device_client.rs 409 handling). These are verbatim third-party code in paths Proxy doesn't exercise; they should be fixed upstream, not diverged in a surgical hotfix vendor.
|
Thanks for the review. Triaged the findings — none touched the actual fix ( Fixed
Intentionally not changed (vendored upstream code, verbatim)
This whole vendor + |
Summary
Customers on 2.2.2 using access-key authentication reported ZeroKMS
Request not authorizederrors that began exactly ~15 minutes (the access-token lifetime) after connecting: connections worked on startup, then failed in lockstep on every encrypt/decrypt.Root cause
The bug is in
stack-auth'sAutoRefresh::refresh_non_blocking. When aget_token()future is cancelled in the post-HTTP, pre-install window (routine in a proxy — e.g. a client disconnecting mid-query drops the task), theCancelGuardwasdefuse()d before the state re-lock + token install. A cancellation landing on thatstate.lock().awaitthen leftrefresh_in_progress = truewith nonotify_waiters(). Every later refresh wedged, and once the cached token crossed real expiry, callers hung forever inwait_for_in_flight_refresh— surfacing as the ZeroKMS auth errors.This is why older Proxy was immune: the pre-
stack-authcredentials/module refreshed via a background task, not lazily insideget_token().Fix
The upstream fix (CIP-3159, suite commits
3cacf1faf+2ee370561) movesdefuse()to after the token install + flag clear. It first ships instack-auth >= 0.36.0— but bumpingcipherstash-clientto that line drags in an unrelated EQL API redesign (encrypt_eqlnow returnsEqlOutput,identifier()becomes a method) that ripples through Proxy's encrypt path.Instead, this PR vendors the published
stack-auth 0.34.1-alpha.4source, applies the CancelGuard reordering, and overrides the transitive dependency via[patch.crates-io]. The patch keeps version0.34.1-alpha.4, so it satisfiescipherstash-client's exact pin with no republish and no Proxy code changes. The published source pins its own deps to crates.io versions, so there's no duplicate-crate / trait-bound mismatch.Note
This is a temporary patch. Remove
vendor/stack-auth+ the[patch.crates-io]block once Proxy moves to acipherstash-clientbuilt againststack-auth >= 0.36.0.Testing
regression_cip_3159::cancellation_in_relock_window_does_not_strand_refreshreproduces the cancel-on-re-lock scenario and asserts the flag is cleared + a later caller doesn't hang.cargo check -p cipherstash-proxyis clean (no Proxy code changes).Release
Second commit bumps to 2.2.3 and promotes the CHANGELOG entry. The release version/changelog drift guard's checks pass locally.
Summary by CodeRabbit
Bug Fixes
Chores