Skip to content

[Hooks] PR2 — VM preemption detection (poller daemon)#9712

Open
zpoint wants to merge 5 commits into
skypilot-org:masterfrom
zpoint:feature/vm-preemption-detection
Open

[Hooks] PR2 — VM preemption detection (poller daemon)#9712
zpoint wants to merge 5 commits into
skypilot-org:masterfrom
zpoint:feature/vm-preemption-detection

Conversation

@zpoint
Copy link
Copy Markdown
Collaborator

@zpoint zpoint commented May 25, 2026

Summary

Builds on the merged PR1 lifecycle-hooks framework (#9064) by adding the VM-side preemption detector that PR1 explicitly deferred. On AWS / GCP / Azure VMs, a daemon thread inside skylet polls the cloud's metadata endpoint and SIGTERMs skylet on a detected spot reclaim — flowing into the same hook_executor.try_claim_teardown('preemption') path that K8s preStop already uses.

What changes

  • New module sky/skylet/preemption_poller.py (~180 lines).

    • AWS: IMDSv2 /latest/meta-data/spot/instance-action (404 = clear, 200 = scheduled).
    • GCP: long-poll /computeMetadata/v1/instance/preempted?wait_for_change=true.
    • Azure: IMDS Scheduled Events, filters for Preempt (ignores Freeze/Reboot/Redeploy).
    • All HTTP via stdlib urllib.request — no new dependencies.
    • Public API: start(cloud) -> threading.Event (returned event .set() stops the daemon).
  • Skylet boot wiring sky/skylet/skylet.py:

    • _detect_cloud_for_preemption_poller() probes each cloud's metadata endpoint with a 1s timeout. K8s pods short-circuit via KUBERNETES_SERVICE_HOST.
    • _should_install_preemption_sigterm_handler(detected_cloud) returns True for both K8s (kubelet preStop) and VM clouds (poller SIGTERM). Handler installation routes both signal sources through the same hook-executor path.
  • Smoke tests (S14/S15/S16) for GCP simulate-maintenance, Azure simulate-eviction, and a no-false-positive test on on-demand VMs. AWS is unit-only (no public reclaim-simulation API).

Why this is in PR2, not PR1

PR1 shipped the framework + K8s preemption + VM autostop/down. The VM preemption-detection daemon was deliberately deferred to keep PR1's review surface small. Until PR2 lands, a preemption hook on a VM cluster is accepted by the schema and stored on the skylet, but the hook never fires because no signal source delivers preemption notices to the skylet on VM clouds.

Test plan

Unit tests (16 tests, all green locally):

pytest tests/unit_tests/test_preemption_poller.py -v

Covers:

  • AWS IMDSv2 token + spot/instance-action polling (positive detect, 404 clear, retry on 5xx)
  • GCP wait_for_change long-poll (positive detect, idle, error backoff)
  • Azure IMDS scheduled events (Preempt fires; Freeze/Reboot/Redeploy ignored)
  • start() dispatches to the correct cloud loop, raises on unknown cloud
  • SIGTERM-handler-installation contract: must install on K8s OR any VM cloud (the bug I caught in the cherry-picked commits)

Smoke tests (real-cloud):

  • pytest tests/smoke_tests/test_lifecycle_hooks.py::test_preemption_gcp_simulate_maintenance --generic-cloud gcp
  • pytest tests/smoke_tests/test_lifecycle_hooks.py::test_preemption_azure_simulate_eviction --generic-cloud azure
  • pytest tests/smoke_tests/test_lifecycle_hooks.py::test_preemption_no_false_positive --generic-cloud aws

Plus all existing PR1 hook tests still pass (75 unit tests + the K8s + VM smoke suites).

Stacked work that comes after

PR3 (worker fan-out) was previously stacked on this branch; the existing PR97 was closed and will be re-opened against this PR2 once it lands.


Open in Devin Review

zpoint and others added 4 commits May 25, 2026 19:40
Adds a per-cloud VM preemption detector that runs alongside skylet
on AWS/GCP/Azure VMs. On detecting a pending spot reclaim, the
poller sends SIGTERM to skylet, which then runs preemption hooks
through the same `hook_executor.try_claim_teardown` CAS path used
by Kubernetes preStop (PR1).

New module: `sky/skylet/preemption_poller.py`
- AWS IMDSv2: polls `spot/instance-action` (404 = clear).
- GCP: long-poll `/instance/preempted?wait_for_change=true`.
- Azure: polls IMDS Scheduled Events, fires only on `Preempt` event
  type (ignores Freeze/Reboot/Redeploy).
- All HTTP via stdlib `urllib.request` — no new dependencies.
- Public API: `start(cloud) -> threading.Event` (daemon thread,
  call `.set()` on the returned event to stop).

Skylet boot wiring (`sky/skylet/skylet.py`):
- `_detect_cloud_for_preemption_poller()` probes each cloud's
  metadata endpoint with a 1s timeout. Returns `None` for K8s /
  bare-metal / Slurm (where the SIGTERM path alone suffices).
- `main()` starts the poller iff cloud is detected.

Unit tests (`tests/unit_tests/test_preemption_poller.py`): 11 tests
with mocked `urllib.request.urlopen` covering positive detection,
idle polling, cloud dispatch, and unsupported-cloud rejection.
EKS pods can reach the underlying EC2 IMDS, so the previous
probe-everything strategy mis-detected EKS pods as 'aws' and started
a poller that races the K8s preStop bridge.

Short-circuit: return None when KUBERNETES_SERVICE_HOST is set in
the env (kubelet sets this automatically on every K8s pod). Cheaper
and unambiguous compared to relying on IMDS reachability.

This commit lives on PR2 because that's where
_detect_cloud_for_preemption_poller is defined.
Appended at the end of test_lifecycle_hooks.py (post-PR1 layout) since
the cherry-pick from the original PR2 branch (commit 6529b15) hit
structural conflicts against the merged PR1 baseline. Tests are the
same shape as the original commit — only the file position changed.

Tests added:

  - S14 test_preemption_gcp_simulate_maintenance (@pytest.mark.gcp):
    Launches GCP spot VM, runs ``gcloud compute instances
    simulate-maintenance-event`` to flip the metadata
    ``/instance/preempted`` flag, expects the poller's long-poll to
    wake up, send SIGTERM to skylet, and fire the preemption hook.

  - S15 test_preemption_azure_simulate_eviction (@pytest.mark.azure):
    Launches Azure spot VM, runs ``az vm simulate-eviction`` which
    triggers the IMDS scheduled-events ``Preempt`` event; the poller
    catches it and SIGTERMs the skylet.

  - S16 test_preemption_no_false_positive (generic_cloud, @no_kubernetes):
    On-demand VM with a preemption hook; verifies the poller runs but
    never fires (no spurious SIGTERM, no preemption.log written).

AWS is unit-only (no public simulate-spot-reclaim API) — covered by
``tests/unit_tests/test_preemption_poller.py`` with mocked IMDSv2
responses. K8s is excluded for S16 because K8s clusters don't run the
VM-side poller (preStop SIGTERM is the K8s path, shipped in PR1).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The cherry-picked PR2 commits installed the preemption SIGTERM
handler only when ``KUBERNETES_SERVICE_HOST`` was set. The poller
signals preemption by sending SIGTERM to skylet's own PID — with no
handler installed, that SIGTERM killed skylet immediately without
running the preemption hook, defeating the entire purpose of the
PR2 VM-side preemption-detection work.

Bug: ``_should_install_preemption_sigterm_handler()`` gated only on
K8s, with a docstring claiming "On VM clouds, preemption is detected
via the metadata poller ... so installing a SIGTERM handler here
would be dead code." That reasoning was inverted: the poller's
*output signal* is SIGTERM, so the handler is exactly what's needed
to catch it.

Fix:

1. ``_should_install_preemption_sigterm_handler(detected_cloud)``
   takes the cloud identity from the boot probe; returns True when
   EITHER K8s env-var is set OR a VM cloud was detected (the poller
   will run and SIGTERM us).

2. In ``main()``, run cloud detection FIRST so the handler decision
   can use it. Install handler if any preemption signal source
   exists; start poller iff cloud is non-None.

Regression coverage (red → green TDD):

  * ``test_should_install_sigterm_handler_on_kubernetes``: K8s env
    var → True (original path preserved)
  * ``test_should_install_sigterm_handler_on_vm_cloud`` × (aws/gcp/azure)
    → True (the bug we just fixed)
  * ``test_should_not_install_sigterm_handler_when_no_signal_source``
    → False (bare-metal / Slurm)

16/16 preemption-poller unit tests green; 75/75 lifecycle-hooks
unit tests still green.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements background preemption pollers for AWS, GCP, and Azure to detect spot instance reclaims via metadata endpoints. The skylet is updated to detect the cloud environment at startup, install a SIGTERM handler, and launch the relevant poller. New smoke and unit tests verify the detection logic across providers. Feedback focuses on improving the AWS poller's efficiency by caching the IMDSv2 token and refining the GCP long-polling logic to prevent race conditions and unnecessary backoffs during timeouts.

Comment on lines +67 to +71
while not stop_event.is_set():
token = _get_aws_imds_token()
if token is None:
stop_event.wait(_AWS_POLL_INTERVAL_SECONDS)
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The AWS IMDSv2 token is currently fetched in every iteration of the polling loop (every 5 seconds). Since the token is requested with a 6-hour TTL (21600 seconds), fetching it so frequently is inefficient and may lead to unnecessary load or rate-limiting on the IMDS endpoint. It is recommended to cache the token and reuse it across iterations.

    token = None
    while not stop_event.is_set():
        if token is None:
            token = _get_aws_imds_token()
        if token is None:
            stop_event.wait(_AWS_POLL_INTERVAL_SECONDS)
            continue

Comment on lines +84 to +87
except urllib.error.HTTPError as e:
# 404 means no action scheduled — the happy path.
if e.code != 404:
logger.debug(f'preemption_poller[aws]: HTTPError {e.code}: {e}')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When caching the IMDSv2 token, ensure that a 401 Unauthorized response (which indicates an expired or invalid token) triggers a token refresh in the next iteration by clearing the cached value.

Suggested change
except urllib.error.HTTPError as e:
# 404 means no action scheduled — the happy path.
if e.code != 404:
logger.debug(f'preemption_poller[aws]: HTTPError {e.code}: {e}')
except urllib.error.HTTPError as e:
if e.code == 401:
token = None
# 404 means no action scheduled — the happy path.
elif e.code != 404:
logger.debug(f'preemption_poller[aws]: HTTPError {e.code}: {e}')

Comment on lines +109 to +118
with urllib.request.urlopen(req, timeout=60) as resp:
body = resp.read().decode().strip().upper()
if body == 'TRUE':
logger.info(
'preemption_poller[gcp]: preempted flag flipped TRUE')
_signal_preemption()
return
except Exception as e: # pylint: disable=broad-except
logger.debug(f'preemption_poller[gcp]: poll error: {e}')
stop_event.wait(_GCP_POLL_INTERVAL_SECONDS)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In GCP long polling, the metadata server typically waits for 60 seconds before returning a response if no change occurs. Setting the client-side timeout to exactly 60 seconds can lead to frequent race-condition timeouts. It is recommended to set the client timeout slightly higher (e.g., 70 seconds). Additionally, a timeout during long polling is an expected event and should not trigger the 5-second backoff (_GCP_POLL_INTERVAL_SECONDS), as this creates unnecessary gaps in preemption detection coverage.

…ontract

``test_sigterm_handler_installed_only_on_kubernetes`` in
test_lifecycle_hooks.py was added in PR1 and pinned the
K8s-only handler-install contract. PR2's poller daemon sends SIGTERM
to the skylet's own PID on VM clouds, so the handler must install
there too. The PR1 test contradicts the corrected behavior; the
new tests in tests/unit_tests/test_preemption_poller.py
(``test_should_install_sigterm_handler_on_kubernetes`` +
``test_should_install_sigterm_handler_on_vm_cloud`` +
``test_should_not_install_sigterm_handler_when_no_signal_source``)
fully cover the new contract.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 5 additional findings in Devin Review.

Open in Devin Review

Comment on lines +116 to +118
except Exception as e: # pylint: disable=broad-except
logger.debug(f'preemption_poller[gcp]: poll error: {e}')
stop_event.wait(_GCP_POLL_INTERVAL_SECONDS)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 GCP poller busy-loops because sleep is only in the except branch, not after successful FALSE responses

The _poll_gcp poller places stop_event.wait(_GCP_POLL_INTERVAL_SECONDS) inside the except block (line 118), so it only sleeps on error. When the GCP metadata server returns FALSE (the normal non-preempted state), the code falls through the if body == 'TRUE' check and immediately loops back to make another request — with no sleep.

This is compounded by the fact that wait_for_change=true without a last_etag parameter causes the GCP metadata server to return the current value immediately (per GCP docs: "If you don't provide the last_etag parameter, the metadata server returns the current value immediately"). The result is a tight busy-loop hammering the metadata server continuously for the entire lifetime of the VM.

Both the AWS poller (sky/skylet/preemption_poller.py:90) and Azure poller (sky/skylet/preemption_poller.py:144) correctly place stop_event.wait() outside the try/except so it always executes. The GCP poller is inconsistent with both.

Control flow comparison across pollers

AWS (correct — sleep always runs):

try: ... except: ...
stop_event.wait(...)   # outside try/except

Azure (correct — sleep always runs):

try: ... except: ...
stop_event.wait(...)   # outside try/except

GCP (broken — sleep only on error):

try:
    ...  # if FALSE: falls through
except:
    stop_event.wait(...)  # inside except!
# no sleep here
Suggested change
except Exception as e: # pylint: disable=broad-except
logger.debug(f'preemption_poller[gcp]: poll error: {e}')
stop_event.wait(_GCP_POLL_INTERVAL_SECONDS)
except Exception as e: # pylint: disable=broad-except
logger.debug(f'preemption_poller[gcp]: poll error: {e}')
stop_event.wait(_GCP_POLL_INTERVAL_SECONDS)
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@zpoint
Copy link
Copy Markdown
Collaborator Author

zpoint commented May 25, 2026

/quicktest-core
/smoke-test -k test_lifecycle_hooks --kubernetes
/smoke-test -k test_lifecycle_hooks --aws

@zpoint
Copy link
Copy Markdown
Collaborator Author

zpoint commented May 25, 2026

All CI green on commit `be5b84bce`.

CI matrix

Layer Build Result
GH Actions run 26393497232 21/21 green
BK quicktest-core #3278 passed
BK smoke --aws (`-k test_lifecycle_hooks`) #10915 passed — incl. `test_preemption_no_false_positive` (new PR2 smoke that pins: on-demand VM with a preemption hook never spuriously fires it)
BK smoke --kubernetes (`-k test_lifecycle_hooks`) #10914 passed — all 8 existing K8s hook tests still green; PR2 is additive on the VM side so K8s behavior is unchanged

Note

`test_preemption_gcp_simulate_maintenance` (`@pytest.mark.gcp`) and `test_preemption_azure_simulate_eviction` (`@pytest.mark.azure`) are scoped to the GCP and Azure smoke pipelines respectively — they'd run on `/smoke-test -k test_lifecycle_hooks --gcp` and `--azure`, which I can trigger if you'd like additional real-cloud sign-off. Default-cloud AWS verification + the no-false-positive test cover the core poller invariant.

What this PR adds

The user-visible YAML/CLI surface is unchanged — PR1 already accepted `events: [preemption]` on VM clusters. PR2 makes that hook actually fire on a real spot reclaim.

Cloud × Event PR1 (merged) PR2 (this PR)
K8s × stop / preemption / down ✅ shipped (no change)
VM × stop / down ✅ shipped (no change)
VM × preemption ❌ accepted-but-doesn't-fire fires via metadata poller → SIGTERM → existing hook_executor path

Bug caught during prep

The original PR2 commits gated the SIGTERM-handler installation only on K8s (`KUBERNETES_SERVICE_HOST` env var). But the new `preemption_poller` daemon signals preemption by sending SIGTERM to skylet's own PID — with no handler installed on VM clouds, that SIGTERM would have killed skylet without firing any hook, defeating the entire point of PR2. Fixed in commit `ac1f82d4d` with 5 new contract-pinning unit tests covering the K8s / each VM cloud / no-source cases.

PR ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant