fix(hooks): release lease when beforeLease hook fails with onFailure=endLease#823
Conversation
…endLease When a beforeLease hook fails with on_failure='endLease', the exporter reported BEFORE_LEASE_HOOK_FAILED status but never actually released the lease. This left the exporter stuck in BeforeLeaseHookFailed state and the lease active indefinitely. The root cause was that run_before_lease_hook()'s endLease error handler only logged the failure and reported status, but never called request_lease_release() (which sends release_lease=True to the controller and sets the lease_ended event to unblock the connection loop). Fix: - Set skip_after_lease_hook=True to prevent afterLease hook from running - Add 1s sleep to give the client time to poll the final status - Call request_lease_release() to actually end the lease on the controller This mirrors the pattern already used by run_after_lease_hook()'s finally block. Fixes: #639
|
Warning Review limit reached
More reviews will be available in 10 minutes and 47 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate. For paid Pro and Pro+ PR reviews, CodeRabbit uses rolling per-developer review limits. Reviews become available again as older review attempts age out of the rolling limit window. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughWhen a ChangesbeforeLease endLease failure: active lease release and afterLease skip
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Extract the lease release try/except into a helper method to bring run_before_lease_hook complexity from 11 to within the ruff C901 limit of 10.
Add test that verifies run_before_lease_hook handles errors from request_lease_release gracefully (e.g. controller unavailable) without raising.
| lease_scope.skip_after_lease_hook = True | ||
| await report_status( | ||
| ExporterStatus.BEFORE_LEASE_HOOK_FAILED, | ||
| f"beforeLease hook failed (on_failure=endLease): {e}", | ||
| ) | ||
| # No request_lease_release - client will discover failure and release | ||
| # Delay to give client time to poll the final status before releasing | ||
| await anyio.sleep(1.0) | ||
| await self._safe_release_lease(request_lease_release) | ||
|
|
||
| except Exception as e: | ||
| logger.error("beforeLease hook failed with unexpected error: %s", e, exc_info=True) |
There was a problem hiding this comment.
run_after_lease_hook protects its release with a finally block inside CancelScope(shield=True). Consider wrapping the sleep-and-release sequence here in a shielded cancel scope, or moving the release into the finally block guarded by a flag. I think both blocks are somewhat identical and can be refactored.
kirkbrauer
left a comment
There was a problem hiding this comment.
I think this looks ok for now, might want to look into refactoring this into the finally bloack as @raballew mentioned, also having the FSM implemented would help a lot here.
…it semantics Address review feedback: - Move the sleep-and-release sequence from the except block into the finally block, guarded by a should_release flag and wrapped in CancelScope(shield=True). This mirrors run_after_lease_hook's pattern and ensures the release completes even if the task group is cancelled. - Fix on_failure=exit semantics: the exit branch now calls shutdown(wait_for_lease_exit=False) for immediate exit without releasing the lease. The lease stays active so the client can observe the failure status; it expires naturally on the controller. This also fixes #640 where the exporter hung forever because shutdown(wait_for_lease_exit=True) waited for a lease end that never came.
| logger.info("beforeLease hook completed successfully") | ||
|
|
||
| except HookExecutionError as e: | ||
| lease_scope.skip_after_lease_hook = True |
There was a problem hiding this comment.
why did we change the scope of this? it's unrelated to this bugfix.
- Restore wait_for_lease_exit=True for the exit branch (unchanged from original behavior) - Move skip_after_lease_hook back into each branch explicitly rather than hoisting it above the if/else, preserving the original code structure
yep, we should eventually invest in the FSM, ... probably will revisit this comment in 2030 🤣 |
…PORTER_LABELS) Port of main #53 (d074b99) to the Rust client + thin Python facade. `jmp shell`, when holding a remote controller lease, now exports JMP_EXPORTER, JMP_LEASE, and JMP_EXPORTER_LABELS to the child process; the exporter's labels are fetched from the controller (`get_exporter`) at acquisition (non-fatal on failure), sorted and comma-joined as `k=v`. Direct mode (`--tls-grpc`) sets none. Read side (Python): `jumpstarter.utils.env.ExporterMetadata` + `env_with_metadata()` context managers expose the connected exporter's name/labels/lease from within a shell session. Also correct a stale comment in the Rust hooks (#823, already satisfied): on a beforeLease endLease failure the client never reaches LEASE_READY so it never connects (has_client stays false) → afterLease is skipped and the lease is released. Verified e2e against the live controller: the three vars are set with the exporter's real labels (board=stress,rack=lab-a) and env_with_metadata() reads them back. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Brings the 41 commits that landed on main since the branch point (Jun 12): CI/renovate/container configs, the new jumpstarter-driver-obd, sdwire/flashers/ble driver updates, JEP-0014 docs, and Go-controller deploy changes — all of which are orthogonal to the Rust rewrite and merged cleanly. Conflict resolution (our Rust rewrite supersedes the Python these patches touched): - exporter/hooks.py, hooks_test.py, client/lease.py, common/utils_test.py: kept deleted (hooks, lease lifecycle, and shell launching are owned by the Rust core). - common/utils.py, utils/env.py: kept ours (thin client; utils/env.py retains the #53 ExporterMetadata/env_with_metadata helper ported in the previous commit). - config/env.py: union of both (JMP_EXPORTER/JMP_EXPORTER_LABELS + JMP_LEASE). - python/Makefile: kept both new .PHONY targets (sync-native + test-report). - uv.lock: regenerated from the merged pyprojects. The two main patches that DID touch rewritten logic were hand-ported first: #53 (exporter-context shell env vars) and #823 (hooks endLease release, already satisfied). Verified: 128 Python tests pass across core + the merged drivers; no dangling references to the deleted modules; uv sync reconciles the env (obd installed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Problem
When a
beforeLeasehook fails withon_failure=endLease, the exporter reportsBEFORE_LEASE_HOOK_FAILEDstatus but never actually releases the lease. This leaves the exporter stuck inBeforeLeaseHookFailedstate and the lease active indefinitely. Clients that attempt to connect getConnection to exporter lost.Similarly, when a
beforeLeasehook fails withon_failure=exit, the exporter calledshutdown(wait_for_lease_exit=True)which deferred exit until the lease ends — but nobody released the lease, so the exporter hung forever in a zombie state.Fixes #639
Fixes #640
Root Cause
In
hooks.py,run_before_lease_hook()'s error handlers for bothendLeaseandexithad issues:endLease: Only reported status but never calledrequest_lease_release()— the callback that frees the lease on the controller and unblocks the connection loop.exit: Calledshutdown(wait_for_lease_exit=True)which deferred exit until the lease ends, but the lease was never released, creating a deadlock.Fix
endLeasebranchskip_after_lease_hook = Trueto prevent the after-lease hook from runningrequest_lease_release()in a shieldedfinallyblockexitbranchshutdown(wait_for_lease_exit=False)for immediate exitskip_after_lease_hook = TrueStructural improvements
_safe_release_lease()helper for safe error-handling around lease releasefinallyblock with ashould_releaseflag, wrapped inCancelScope(shield=True)to protect from task group cancellation — matching the pattern used byrun_after_lease_hookChanges
hooks.pyendLeaseandexitbranches; add_safe_release_lease()helper; shieldedfinallyblockhooks_test.pyhooks_test.goexporter-hooks-before-fail-endLease-with-after.yamlTesting
endLeaseregressionexitscenario"