Skip to content

fix: backport stack-auth token-refresh fix; release 2.2.3#407

Merged
freshtonic merged 3 commits into
mainfrom
james/cip-3159-token-refresh
Jun 17, 2026
Merged

fix: backport stack-auth token-refresh fix; release 2.2.3#407
freshtonic merged 3 commits into
mainfrom
james/cip-3159-token-refresh

Conversation

@freshtonic

@freshtonic freshtonic commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

Customers on 2.2.2 using access-key authentication reported ZeroKMS Request not authorized errors that began exactly ~15 minutes (the access-token lifetime) after connecting: connections worked on startup, then failed in lockstep on every encrypt/decrypt.

Root cause

The bug is in stack-auth's AutoRefresh::refresh_non_blocking. When a get_token() future is cancelled in the post-HTTP, pre-install window (routine in a proxy — e.g. a client disconnecting mid-query drops the task), the CancelGuard was defuse()d before the state re-lock + token install. A cancellation landing on that state.lock().await then left refresh_in_progress = true with no notify_waiters(). Every later refresh wedged, and once the cached token crossed real expiry, callers hung forever in wait_for_in_flight_refresh — surfacing as the ZeroKMS auth errors.

This is why older Proxy was immune: the pre-stack-auth credentials/ module refreshed via a background task, not lazily inside get_token().

Fix

The upstream fix (CIP-3159, suite commits 3cacf1faf + 2ee370561) moves defuse() to after the token install + flag clear. It first ships in stack-auth >= 0.36.0 — but bumping cipherstash-client to that line drags in an unrelated EQL API redesign (encrypt_eql now returns EqlOutput, identifier() becomes a method) that ripples through Proxy's encrypt path.

Instead, this PR vendors the published stack-auth 0.34.1-alpha.4 source, applies the CancelGuard reordering, and overrides the transitive dependency via [patch.crates-io]. The patch keeps version 0.34.1-alpha.4, so it satisfies cipherstash-client's exact pin with no republish and no Proxy code changes. The published source pins its own deps to crates.io versions, so there's no duplicate-crate / trait-bound mismatch.

Note

This is a temporary patch. Remove vendor/stack-auth + the [patch.crates-io] block once Proxy moves to a cipherstash-client built against stack-auth >= 0.36.0.

Testing

  • New regression test regression_cip_3159::cancellation_in_relock_window_does_not_strand_refresh reproduces the cancel-on-re-lock scenario and asserts the flag is cleared + a later caller doesn't hang.
  • Verified it fails on the pre-fix ordering and passes on the fix.
  • Full vendored suite: 106 passed, 0 failed.
  • cargo check -p cipherstash-proxy is clean (no Proxy code changes).

Release

Second commit bumps to 2.2.3 and promotes the CHANGELOG entry. The release version/changelog drift guard's checks pass locally.

Summary by CodeRabbit

  • Bug Fixes

    • Fixed ZeroKMS authentication failures that could stall token refresh and cause "Request not authorized" errors
  • Chores

    • Version bump to 2.2.3

Access-key auth on 2.2.2 saw ZeroKMS "Request not authorized" begin ~15 min (the access-token lifetime) after startup. A get_token() future cancelled in the post-HTTP, pre-install window of stack-auth's AutoRefresh::refresh_non_blocking defused its CancelGuard too early, stranding refresh_in_progress = true. Every later refresh then wedged; once the cached token crossed real expiry, callers hung in wait_for_in_flight_refresh forever.

The upstream fix (move defuse() after the token install + flag clear) first ships in stack-auth >= 0.36.0, but bumping cipherstash-client to that line drags in an unrelated EQL API redesign. Instead, vendor the published stack-auth 0.34.1-alpha.4 source, apply the CancelGuard reordering, and override the transitive dependency via [patch.crates-io] (same version satisfies cipherstash-client's exact pin, so no republish). Includes a regression test that fails on the pre-fix ordering.

Remove the patch once Proxy moves to cipherstash-client built against stack-auth >= 0.36.0.
Patch release for the stack-auth token-refresh hotfix (CIP-3159): bump workspace version 2.2.2 -> 2.2.3 and promote the Unreleased CHANGELOG entry to [2.2.3].
@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@freshtonic, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 44 minutes and 47 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 21618035-1571-4d47-bd24-62cbd3edf1b6

📥 Commits

Reviewing files that changed from the base of the PR and between 9b4f9b6 and 9411667.

📒 Files selected for processing (1)
  • CHANGELOG.md
📝 Walkthrough

Walkthrough

Version bumped to 2.2.3 with a [patch.crates-io] override redirecting stack-auth to a vendored copy at vendor/stack-auth. The vendored crate is a complete authentication library backporting a CancelGuard fix (CIP-3159) that prevented token-refresh cancellation from stranding refresh_in_progress and causing subsequent ZeroKMS authorization failures.

Changes

stack-auth vendor hotfix and auth strategy library

Layer / File(s) Summary
Project version bump and patch override
CHANGELOG.md, Cargo.toml
Bumps to 2.2.3, adds [patch.crates-io] HOTFIX override pointing stack-auth to vendor/stack-auth, excludes the vendor dir from workspace members, and documents the ZeroKMS fix in the project changelog.
Vendored crate scaffolding and metadata
vendor/stack-auth/Cargo.toml, vendor/stack-auth/Cargo.toml.orig, vendor/stack-auth/.gitignore, vendor/stack-auth/LICENSE, vendor/stack-auth/README.md, vendor/stack-auth/CHANGELOG.md, vendor/stack-auth/tasks.toml
Adds all non-source crate metadata: generated and original Cargo manifests, .gitignore, PolyForm license, README documenting strategies and security notes, full upstream CHANGELOG, and a Node integration-test task definition.
Core public contracts: AuthStrategy, SecretToken, AuthError, Refresher
vendor/stack-auth/src/lib.rs, vendor/stack-auth/src/refresher.rs
Defines the AuthStrategy trait and async get_token contract, the zeroizing/redacted SecretToken wrapper, the AuthError enum, URL normalization and HTTP client helpers, and the Refresher trait with credential lifecycle hooks used by all concrete strategies.
Token model, JWT claims, and ServiceToken
vendor/stack-auth/src/token.rs, vendor/stack-auth/src/service_token.rs
Defines Token (expiry windows, refresh token lifecycle, JWT claim decoding, async OAuth refresh call) and ServiceToken (claim-caching bearer wrapper exposing subject, workspace_id, zerokms_url with no-signature-verify JWT decode). Both include comprehensive unit tests.
AutoRefresh concurrency engine with CancelGuard (CIP-3159 hotfix)
vendor/stack-auth/src/auto_refresh.rs
Implements the in-memory concurrency-safe token cache using a mutex-protected State, AtomicBool refresh_in_progress, and Notify. Introduces CancelGuard to clear refresh_in_progress and notify waiters when a refresh future is cancelled in the post-HTTP pre-install window. Includes unit tests, stress tests with an Axum server, and a targeted CIP-3159 regression test.
Access key type, refresher, and strategy
vendor/stack-auth/src/access_key.rs, vendor/stack-auth/src/access_key_refresher.rs, vendor/stack-auth/src/access_key_strategy.rs
Defines AccessKey with CSAK-format parsing and InvalidAccessKey errors, AccessKeyRefresher (no-op persistence, POST to api/authorise), and AccessKeyStrategy builder with environment/service-discovery URL resolution wrapped in AutoRefresh. Includes concurrency and stress tests.
OAuth strategy, AutoStrategy, and StaticTokenStrategy
vendor/stack-auth/src/oauth_refresher.rs, vendor/stack-auth/src/oauth_strategy.rs, vendor/stack-auth/src/auto_strategy.rs, vendor/stack-auth/src/static_token_strategy.rs
Implements OAuthRefresher (optional profile-store persistence), OAuthStrategy builder for in-memory token and ProfileStore sources, AutoStrategy detection enum (access-key > OAuth > NotAuthenticated) with env-var fallback builder, and StaticTokenStrategy.
Device code flow, device client provisioning, and examples
vendor/stack-auth/src/device_code/..., vendor/stack-auth/src/device_client.rs, vendor/stack-auth/examples/*
Implements RFC 8628 device code flow (DeviceCodeStrategy, PendingDeviceCode polling with slow_down/expiry handling, profile persistence), bind_client_device for ZeroKMS CreateClient provisioning (409 no-op, secretkey.json save), protocol wire types, full test suites, and auto_strategy/device_code runnable examples.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Suggested reviewers

  • tobyhede

Poem

🐇 A token once stalled in the night,
Its CancelGuard wavering in flight.
We patched and we vendored with care,
Now ZeroKMS no longer despairs.
The refresh guard wakes — all is right! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title clearly and concisely summarizes the primary changes: backporting a critical stack-auth token-refresh fix and releasing version 2.2.3 to address the ZeroKMS authentication deadlock issue.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch james/cip-3159-token-refresh

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@freshtonic freshtonic changed the title fix: backport stack-auth token-refresh fix (CIP-3159); release 2.2.3 fix: backport stack-auth token-refresh fix; release 2.2.3 Jun 17, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
vendor/stack-auth/CHANGELOG.md (1)

1-168: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Normalize the vendored changelog structure before merge.

This file is currently malformed for changelog consumption: repeated top-level sections, heading-order break (e.g., Line 101 starts # Changelog mid-file), and no [Unreleased] section at the top. It also matches the reported MD001/MD024 violations.

As per coding guidelines, “Maintain CHANGELOG.md following Keep a Changelog format with [Unreleased] section at the top and entries organized by type (Added, Changed, Deprecated, Removed, Fixed, Security)” and “Write changelog entries from the user's perspective, not implementation details.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@vendor/stack-auth/CHANGELOG.md` around lines 1 - 168, Restructure the
CHANGELOG.md file to follow Keep a Changelog format properly. Move the `#
Changelog` heading to the very top of the file before all content. Add an
`[Unreleased]` section immediately after the main heading and before any
version-dated sections. Consolidate the duplicate sections (Documentation,
Features, Fixes, Miscellaneous, Refactoring, Testing, Style) that appear
multiple times throughout the file, merging duplicate entries under each
category type. Ensure each version section follows the standard format with a
version number and date, and organize all entries by category type within each
version to comply with Keep a Changelog standards and resolve the MD001/MD024
violations.

Sources: Coding guidelines, Linters/SAST tools

🧹 Nitpick comments (3)
vendor/stack-auth/src/token.rs (1)

489-489: 💤 Low value

Replace expect() with unwrap() per coding guidelines.

As per coding guidelines, use unwrap() instead of expect() in test code unless providing meaningful context.

Suggested fix
-        .expect("failed to encode JWT");
+        .unwrap();
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@vendor/stack-auth/src/token.rs` at line 489, Replace the `.expect("failed to
encode JWT")` call with `.unwrap()` to comply with coding guidelines that
recommend using `unwrap()` instead of `expect()` in test code. Locate the line
containing the expect call on the JWT encoding operation and remove the message
string, replacing it with a simple unwrap() method call.

Source: Coding guidelines

vendor/stack-auth/tasks.toml (1)

7-7: 💤 Low value

Add Windows platform support for native binary copy.

The binary copy command handles macOS (.dylib) and Linux (.so) but lacks a Windows (.dll) fallback. If Windows CI/local development is expected, this will silently fail to copy the native library.

📦 Proposed fix to include Windows `.dll` fallback
- "cp ../../../target/debug/libstack_auth_node.dylib stack-auth-node.node 2>/dev/null || cp ../../../target/debug/libstack_auth_node.so stack-auth-node.node",
+ "cp ../../../target/debug/libstack_auth_node.dylib stack-auth-node.node 2>/dev/null || cp ../../../target/debug/libstack_auth_node.so stack-auth-node.node 2>/dev/null || cp ../../../target/debug/stack_auth_node.dll stack-auth-node.node",
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@vendor/stack-auth/tasks.toml` at line 7, The binary copy command in the build
task only provides fallbacks for macOS (.dylib) and Linux (.so) platforms,
missing support for Windows (.dll). Extend the existing command chain by adding
another fallback condition after the Linux fallback that attempts to copy the
Windows .dll file (libstack_auth_node.dll) to stack-auth-node.node, ensuring the
command handles all three platforms gracefully.
vendor/stack-auth/src/auto_strategy.rs (1)

343-377: 💤 Low value

Environment variable tests may be flaky under parallel execution.

Tests that modify CS_WORKSPACE_CRN can interfere with each other when run in parallel. Consider using a test mutex or serial_test crate to ensure these tests run sequentially, or restructure to avoid relying on real env var manipulation.

However, since this is vendored code for a hotfix and the tests appear to restore env vars, this is acceptable for now.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@vendor/stack-auth/src/auto_strategy.rs` around lines 343 - 377, The tests
explicit_access_key_without_crn_and_no_env_returns_missing_workspace_crn and
invalid_crn_env_var_returns_invalid_crn modify the shared CS_WORKSPACE_CRN
environment variable and may interfere with each other under parallel test
execution. While the tests do restore the environment variable, to prevent
flakiness in parallel execution, add the #[serial] attribute from the
serial_test crate to these tests to ensure they run sequentially, or
alternatively wrap the environment variable manipulation in a test mutex to
provide thread-safe isolation across parallel test runs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@CHANGELOG.md`:
- Line 13: The changelog entry for version 2.2.3 on line 13 contains
implementation details like the `stack-auth` backport and `CIP-3159` reference
that should not be exposed to end users. Reword the entry to focus only on the
user-visible behavior and impact, describing what the user experiences (ZeroKMS
authentication failures ~15 minutes after startup have been fixed) without
mentioning internal technical implementation details like token refresh
mechanisms or upstream backports.

In `@vendor/stack-auth/src/device_client.rs`:
- Around line 114-118: The code currently returns Ok(()) when receiving a 409
CONFLICT status in the device client provisioning flow, but this is incorrect
because at that point in the function the local secretkey.json has already been
confirmed as missing. Instead of treating the conflict as success, return an
error result that clearly indicates the device provisioning failed because
another client is already provisioned server-side while the local key material
is absent. This ensures the provisioning contract is upheld by signaling failure
when local key material cannot be obtained, regardless of the server-side state.

In `@vendor/stack-auth/tasks.toml`:
- Line 3: The task configuration in the dir field references a non-existent
working directory path {{config_root}}/packages/stack-auth/node and a
non-existent package stack-auth-node. Verify the actual directory structure in
the repository and update the dir value to point to the correct working
directory location where the stack-auth Node.js package actually exists, or if
this package and task are not yet implemented in the repository, remove this
task configuration from the tasks.toml file until it is ready to be integrated.

---

Outside diff comments:
In `@vendor/stack-auth/CHANGELOG.md`:
- Around line 1-168: Restructure the CHANGELOG.md file to follow Keep a
Changelog format properly. Move the `# Changelog` heading to the very top of the
file before all content. Add an `[Unreleased]` section immediately after the
main heading and before any version-dated sections. Consolidate the duplicate
sections (Documentation, Features, Fixes, Miscellaneous, Refactoring, Testing,
Style) that appear multiple times throughout the file, merging duplicate entries
under each category type. Ensure each version section follows the standard
format with a version number and date, and organize all entries by category type
within each version to comply with Keep a Changelog standards and resolve the
MD001/MD024 violations.

---

Nitpick comments:
In `@vendor/stack-auth/src/auto_strategy.rs`:
- Around line 343-377: The tests
explicit_access_key_without_crn_and_no_env_returns_missing_workspace_crn and
invalid_crn_env_var_returns_invalid_crn modify the shared CS_WORKSPACE_CRN
environment variable and may interfere with each other under parallel test
execution. While the tests do restore the environment variable, to prevent
flakiness in parallel execution, add the #[serial] attribute from the
serial_test crate to these tests to ensure they run sequentially, or
alternatively wrap the environment variable manipulation in a test mutex to
provide thread-safe isolation across parallel test runs.

In `@vendor/stack-auth/src/token.rs`:
- Line 489: Replace the `.expect("failed to encode JWT")` call with `.unwrap()`
to comply with coding guidelines that recommend using `unwrap()` instead of
`expect()` in test code. Locate the line containing the expect call on the JWT
encoding operation and remove the message string, replacing it with a simple
unwrap() method call.

In `@vendor/stack-auth/tasks.toml`:
- Line 7: The binary copy command in the build task only provides fallbacks for
macOS (.dylib) and Linux (.so) platforms, missing support for Windows (.dll).
Extend the existing command chain by adding another fallback condition after the
Linux fallback that attempts to copy the Windows .dll file
(libstack_auth_node.dll) to stack-auth-node.node, ensuring the command handles
all three platforms gracefully.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2eccdc1c-c2be-48af-b9bf-81bfadee5304

📥 Commits

Reviewing files that changed from the base of the PR and between bdeead3 and 9b4f9b6.

⛔ Files ignored due to path filters (2)
  • Cargo.lock is excluded by !**/*.lock
  • vendor/stack-auth/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (27)
  • CHANGELOG.md
  • Cargo.toml
  • vendor/stack-auth/.gitignore
  • vendor/stack-auth/CHANGELOG.md
  • vendor/stack-auth/Cargo.toml
  • vendor/stack-auth/Cargo.toml.orig
  • vendor/stack-auth/LICENSE
  • vendor/stack-auth/README.md
  • vendor/stack-auth/examples/auto_strategy.rs
  • vendor/stack-auth/examples/device_code.rs
  • vendor/stack-auth/src/access_key.rs
  • vendor/stack-auth/src/access_key_refresher.rs
  • vendor/stack-auth/src/access_key_strategy.rs
  • vendor/stack-auth/src/auto_refresh.rs
  • vendor/stack-auth/src/auto_strategy.rs
  • vendor/stack-auth/src/device_client.rs
  • vendor/stack-auth/src/device_code/mod.rs
  • vendor/stack-auth/src/device_code/protocol.rs
  • vendor/stack-auth/src/device_code/tests.rs
  • vendor/stack-auth/src/lib.rs
  • vendor/stack-auth/src/oauth_refresher.rs
  • vendor/stack-auth/src/oauth_strategy.rs
  • vendor/stack-auth/src/refresher.rs
  • vendor/stack-auth/src/service_token.rs
  • vendor/stack-auth/src/static_token_strategy.rs
  • vendor/stack-auth/src/token.rs
  • vendor/stack-auth/tasks.toml

Comment thread CHANGELOG.md Outdated
Comment thread vendor/stack-auth/src/device_client.rs
Comment thread vendor/stack-auth/tasks.toml Outdated
- CHANGELOG: reword the 2.2.3 entry to be user-facing (drop stack-auth/CIP-3159 implementation detail), per the project changelog convention. Traceability stays in the commit/PR.
- Trim the vendored crate to library essentials: remove tasks.toml (node build tooling unused by Proxy), the upstream CHANGELOG.md, and Cargo.toml.orig. src/ is kept verbatim from the published stack-auth 0.34.1-alpha.4 plus the single CIP-3159 fix.

Intentionally NOT changed: lint/structure findings inside vendored upstream library code (src/token.rs expect(), src/auto_strategy.rs env-var test isolation, src/device_client.rs 409 handling). These are verbatim third-party code in paths Proxy doesn't exercise; they should be fixed upstream, not diverged in a surgical hotfix vendor.
@freshtonic

Copy link
Copy Markdown
Contributor Author

Thanks for the review. Triaged the findings — none touched the actual fix (auto_refresh.rs) or the regression test. Addressed in 9411667:

Fixed

  • CHANGELOG.md (2.2.3 entry) leaked implementation detail — reworded to be user-facing per our changelog convention; dropped the stack-auth/CIP-3159 sentence (traceability stays in the commit + this PR).
  • vendor/stack-auth/tasks.toml (critical) and the vendored CHANGELOG.md (major) — removed. These are upstream packaging/Node-tooling files, not part of the library build. Also dropped Cargo.toml.orig. The vendor is now trimmed to library essentials.

Intentionally not changed (vendored upstream code, verbatim)
vendor/stack-auth/src/** is a verbatim copy of the published stack-auth 0.34.1-alpha.4 crate with a single surgical change — the CIP-3159 CancelGuard reordering in auto_refresh.rs. I'm deliberately not linting or altering third-party library logic/tests in this hotfix vendor, so the following are skipped:

  • src/token.rs:489 expect()unwrap() — vendored test style.
  • src/auto_strategy.rs env-var test isolation — vendored tests; restore their env and pass.
  • src/device_client.rs 409 handling — vendored OAuth device-provisioning logic that Proxy never exercises (Proxy uses access keys). Changing upstream semantics here would diverge our copy from the real crate; if it's a real bug it should be fixed upstream.

This whole vendor + [patch.crates-io] block is temporary and gets removed once Proxy moves to a cipherstash-client built against stack-auth >= 0.36.0.

@freshtonic freshtonic merged commit 1e2807c into main Jun 17, 2026
6 checks passed
@freshtonic freshtonic deleted the james/cip-3159-token-refresh branch June 17, 2026 05:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants