Skip to content

fix(hooks): release lease when beforeLease hook fails with onFailure=endLease#823

Merged
mangelajo merged 5 commits into
mainfrom
fix/hooks-endlease-release
Jun 23, 2026
Merged

fix(hooks): release lease when beforeLease hook fails with onFailure=endLease#823
mangelajo merged 5 commits into
mainfrom
fix/hooks-endlease-release

Conversation

@mangelajo

@mangelajo mangelajo commented Jun 23, 2026

Copy link
Copy Markdown
Member

Problem

When a beforeLease hook fails with on_failure=endLease, the exporter reports BEFORE_LEASE_HOOK_FAILED status but never actually releases the lease. This leaves the exporter stuck in BeforeLeaseHookFailed state and the lease active indefinitely. Clients that attempt to connect get Connection to exporter lost.

Similarly, when a beforeLease hook fails with on_failure=exit, the exporter called shutdown(wait_for_lease_exit=True) which deferred exit until the lease ends — but nobody released the lease, so the exporter hung forever in a zombie state.

Fixes #639
Fixes #640

Root Cause

In hooks.py, run_before_lease_hook()'s error handlers for both endLease and exit had issues:

  • endLease: Only reported status but never called request_lease_release() — the callback that frees the lease on the controller and unblocks the connection loop.
  • exit: Called shutdown(wait_for_lease_exit=True) which deferred exit until the lease ends, but the lease was never released, creating a deadlock.

Fix

endLease branch

  • Set skip_after_lease_hook = True to prevent the after-lease hook from running
  • Release the lease via request_lease_release() in a shielded finally block

exit branch

  • Changed to shutdown(wait_for_lease_exit=False) for immediate exit
  • The lease is not released — it stays active so the client can observe the failure status and expires naturally on the controller
  • Set skip_after_lease_hook = True

Structural improvements

  • Extracted _safe_release_lease() helper for safe error-handling around lease release
  • Moved lease release into the finally block with a should_release flag, wrapped in CancelScope(shield=True) to protect from task group cancellation — matching the pattern used by run_after_lease_hook

Changes

File Change
hooks.py Fix both endLease and exit branches; add _safe_release_lease() helper; shielded finally block
hooks_test.py Update/add tests for both failure modes, including error handling paths
hooks_test.go Add e2e tests B3 (lease release + recovery) and B4 (afterLease skipped)
exporter-hooks-before-fail-endLease-with-after.yaml New e2e exporter config for B4 test

Testing

  • 53 hooks unit tests pass
  • 126 total exporter tests pass
  • ruff lint clean
  • New e2e tests cover the endLease regression
  • Existing e2e tests B3/B5 and H2 cover the exit scenario"

…endLease

When a beforeLease hook fails with on_failure='endLease', the exporter
reported BEFORE_LEASE_HOOK_FAILED status but never actually released the
lease. This left the exporter stuck in BeforeLeaseHookFailed state and
the lease active indefinitely.

The root cause was that run_before_lease_hook()'s endLease error handler
only logged the failure and reported status, but never called
request_lease_release() (which sends release_lease=True to the controller
and sets the lease_ended event to unblock the connection loop).

Fix:
- Set skip_after_lease_hook=True to prevent afterLease hook from running
- Add 1s sleep to give the client time to poll the final status
- Call request_lease_release() to actually end the lease on the controller

This mirrors the pattern already used by run_after_lease_hook()'s finally
block.

Fixes: #639
@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Warning

Review limit reached

@mangelajo, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 10 minutes and 47 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses rolling per-developer review limits. Reviews become available again as older review attempts age out of the rolling limit window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0505c78e-8dc1-499c-afa9-b7ca991b355c

📥 Commits

Reviewing files that changed from the base of the PR and between 9c2071e and be92f9f.

📒 Files selected for processing (2)
  • python/packages/jumpstarter/jumpstarter/exporter/hooks.py
  • python/packages/jumpstarter/jumpstarter/exporter/hooks_test.py
📝 Walkthrough

Walkthrough

When a beforeLease hook fails with onFailure=endLease, the exporter now actively calls a request_lease_release callback (after a 1-second delay) and sets lease_scope.skip_after_lease_hook=True. Unit tests are updated and two new E2E test cases (B3, B4) with a new YAML fixture validate lease recovery and afterLease suppression.

Changes

beforeLease endLease failure: active lease release and afterLease skip

Layer / File(s) Summary
hooks.py endLease path + unit test
python/packages/jumpstarter/jumpstarter/exporter/hooks.py, python/packages/jumpstarter/jumpstarter/exporter/hooks_test.py
run_before_lease_hook's endLease handler now sleeps 1.0s, calls request_lease_release in a try/except, and sets lease_scope.skip_after_lease_hook=True. Unit test updated to assert the release call is made, skip flag is set, and shutdown is not called.
E2E fixture and tests
e2e/exporters/exporter-hooks-before-fail-endLease-with-after.yaml, e2e/test/hooks_test.go
New YAML fixture defines a failing beforeLease with onFailure: endLease and an afterLease with onFailure: warn. New E2E cases B3 (lease recovery after failure) and B4 (afterLease not executed) are added; prior exit-mode test renumbered from B3 to B5.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • jumpstarter-dev/jumpstarter#349: Also modifies run_before_lease_hook and _cleanup_after_lease hook teardown coordination and exit-mode status handling in hooks.py.
  • jumpstarter-dev/jumpstarter#603: Changes whether afterLease runs based on lease state, directly related to the skip_after_lease_hook control flow added here.
  • jumpstarter-dev/jumpstarter#655: Modifies run_before_lease_hook in the same hooks.py and hooks_test.py files, adjusting before-lease hook completion guards.

Suggested reviewers

  • evakhoni
  • kirkbrauer

Poem

🐇 A hook said "endLease!" and meant it for real,
But the exporter just shrugged — what a bad deal!
Now a callback appears, waits a tick, then lets go,
The afterLease sleeps while the lease exits slow.
No more stuck exporters, the rabbit hops free! 🌿

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately captures the main fix: releasing the lease when beforeLease hook fails with onFailure=endLease.
Linked Issues check ✅ Passed The PR fully addresses issue #639 by implementing the missing lease release call in the endLease error handler, preventing the stuck state.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the lease release issue: hook implementation, unit tests, e2e tests, and a required config file.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check ✅ Passed The pull request description is directly related to the changeset and provides comprehensive context about the problem, root cause, fix, and testing performed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/hooks-endlease-release

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

Extract the lease release try/except into a helper method to bring
run_before_lease_hook complexity from 11 to within the ruff C901
limit of 10.
Add test that verifies run_before_lease_hook handles errors from
request_lease_release gracefully (e.g. controller unavailable)
without raising.
@mangelajo mangelajo added this to the 0.9.0 milestone Jun 23, 2026
@mangelajo mangelajo requested review from evakhoni and kirkbrauer June 23, 2026 11:58
Comment on lines +686 to 696
lease_scope.skip_after_lease_hook = True
await report_status(
ExporterStatus.BEFORE_LEASE_HOOK_FAILED,
f"beforeLease hook failed (on_failure=endLease): {e}",
)
# No request_lease_release - client will discover failure and release
# Delay to give client time to poll the final status before releasing
await anyio.sleep(1.0)
await self._safe_release_lease(request_lease_release)

except Exception as e:
logger.error("beforeLease hook failed with unexpected error: %s", e, exc_info=True)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_after_lease_hook protects its release with a finally block inside CancelScope(shield=True). Consider wrapping the sleep-and-release sequence here in a shielded cancel scope, or moving the release into the finally block guarded by a flag. I think both blocks are somewhat identical and can be refactored.

Comment thread python/packages/jumpstarter/jumpstarter/exporter/hooks.py
Comment thread e2e/test/hooks_test.go

@kirkbrauer kirkbrauer left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks ok for now, might want to look into refactoring this into the finally bloack as @raballew mentioned, also having the FSM implemented would help a lot here.

…it semantics

Address review feedback:
- Move the sleep-and-release sequence from the except block into the
  finally block, guarded by a should_release flag and wrapped in
  CancelScope(shield=True). This mirrors run_after_lease_hook's pattern
  and ensures the release completes even if the task group is cancelled.
- Fix on_failure=exit semantics: the exit branch now calls
  shutdown(wait_for_lease_exit=False) for immediate exit without
  releasing the lease. The lease stays active so the client can observe
  the failure status; it expires naturally on the controller.
  This also fixes #640 where the exporter hung forever because
  shutdown(wait_for_lease_exit=True) waited for a lease end that
  never came.
Comment thread python/packages/jumpstarter/jumpstarter/exporter/hooks.py Outdated
logger.info("beforeLease hook completed successfully")

except HookExecutionError as e:
lease_scope.skip_after_lease_hook = True

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did we change the scope of this? it's unrelated to this bugfix.

- Restore wait_for_lease_exit=True for the exit branch (unchanged
  from original behavior)
- Move skip_after_lease_hook back into each branch explicitly rather
  than hoisting it above the if/else, preserving the original code
  structure
@mangelajo mangelajo requested a review from raballew June 23, 2026 15:18
@mangelajo

Copy link
Copy Markdown
Member Author

I think this looks ok for now, might want to look into refactoring this into the finally bloack as @raballew mentioned, also having the FSM implemented would help a lot here.

yep, we should eventually invest in the FSM, ... probably will revisit this comment in 2030 🤣

@mangelajo mangelajo enabled auto-merge (squash) June 23, 2026 15:20
@mangelajo mangelajo merged commit 4a5d98e into main Jun 23, 2026
37 of 39 checks passed
@mangelajo mangelajo deleted the fix/hooks-endlease-release branch June 23, 2026 15:58
kirkbrauer added a commit that referenced this pull request Jun 23, 2026
…PORTER_LABELS)

Port of main #53 (d074b99) to the Rust client + thin Python facade.

`jmp shell`, when holding a remote controller lease, now exports JMP_EXPORTER,
JMP_LEASE, and JMP_EXPORTER_LABELS to the child process; the exporter's labels are
fetched from the controller (`get_exporter`) at acquisition (non-fatal on failure),
sorted and comma-joined as `k=v`. Direct mode (`--tls-grpc`) sets none.

Read side (Python): `jumpstarter.utils.env.ExporterMetadata` + `env_with_metadata()`
context managers expose the connected exporter's name/labels/lease from within a
shell session.

Also correct a stale comment in the Rust hooks (#823, already satisfied): on a
beforeLease endLease failure the client never reaches LEASE_READY so it never
connects (has_client stays false) → afterLease is skipped and the lease is released.

Verified e2e against the live controller: the three vars are set with the exporter's
real labels (board=stress,rack=lab-a) and env_with_metadata() reads them back.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
kirkbrauer added a commit that referenced this pull request Jun 23, 2026
Brings the 41 commits that landed on main since the branch point (Jun 12):
CI/renovate/container configs, the new jumpstarter-driver-obd, sdwire/flashers/ble
driver updates, JEP-0014 docs, and Go-controller deploy changes — all of which are
orthogonal to the Rust rewrite and merged cleanly.

Conflict resolution (our Rust rewrite supersedes the Python these patches touched):
- exporter/hooks.py, hooks_test.py, client/lease.py, common/utils_test.py: kept
  deleted (hooks, lease lifecycle, and shell launching are owned by the Rust core).
- common/utils.py, utils/env.py: kept ours (thin client; utils/env.py retains the
  #53 ExporterMetadata/env_with_metadata helper ported in the previous commit).
- config/env.py: union of both (JMP_EXPORTER/JMP_EXPORTER_LABELS + JMP_LEASE).
- python/Makefile: kept both new .PHONY targets (sync-native + test-report).
- uv.lock: regenerated from the merged pyprojects.

The two main patches that DID touch rewritten logic were hand-ported first: #53
(exporter-context shell env vars) and #823 (hooks endLease release, already satisfied).

Verified: 128 Python tests pass across core + the merged drivers; no dangling
references to the deleted modules; uv sync reconciles the env (obd installed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants