[Hooks] PR2 — VM preemption detection (poller daemon) by zpoint · Pull Request #9712 · skypilot-org/skypilot

zpoint · 2026-05-25T11:47:03Z

Summary

Builds on the merged PR1 lifecycle-hooks framework (#9064) by adding the VM-side preemption detector that PR1 explicitly deferred. On AWS / GCP / Azure VMs, a daemon thread inside skylet polls the cloud's metadata endpoint and SIGTERMs skylet on a detected spot reclaim — flowing into the same hook_executor.try_claim_teardown('preemption') path that K8s preStop already uses.

What changes

New module sky/skylet/preemption_poller.py (~180 lines).
- AWS: IMDSv2 /latest/meta-data/spot/instance-action (404 = clear, 200 = scheduled).
- GCP: long-poll /computeMetadata/v1/instance/preempted?wait_for_change=true.
- Azure: IMDS Scheduled Events, filters for Preempt (ignores Freeze/Reboot/Redeploy).
- All HTTP via stdlib urllib.request — no new dependencies.
- Public API: start(cloud) -> threading.Event (returned event .set() stops the daemon).
Skylet boot wiring sky/skylet/skylet.py:
- _detect_cloud_for_preemption_poller() probes each cloud's metadata endpoint with a 1s timeout. K8s pods short-circuit via KUBERNETES_SERVICE_HOST.
- _should_install_preemption_sigterm_handler(detected_cloud) returns True for both K8s (kubelet preStop) and VM clouds (poller SIGTERM). Handler installation routes both signal sources through the same hook-executor path.
Smoke tests (S14/S15/S16) for GCP simulate-maintenance, Azure simulate-eviction, and a no-false-positive test on on-demand VMs. AWS is unit-only (no public reclaim-simulation API).

Why this is in PR2, not PR1

PR1 shipped the framework + K8s preemption + VM autostop/down. The VM preemption-detection daemon was deliberately deferred to keep PR1's review surface small. Until PR2 lands, a preemption hook on a VM cluster is accepted by the schema and stored on the skylet, but the hook never fires because no signal source delivers preemption notices to the skylet on VM clouds.

Test plan

Unit tests (16 tests, all green locally):

pytest tests/unit_tests/test_preemption_poller.py -v

Covers:

AWS IMDSv2 token + spot/instance-action polling (positive detect, 404 clear, retry on 5xx)
GCP wait_for_change long-poll (positive detect, idle, error backoff)
Azure IMDS scheduled events (Preempt fires; Freeze/Reboot/Redeploy ignored)
start() dispatches to the correct cloud loop, raises on unknown cloud
SIGTERM-handler-installation contract: must install on K8s OR any VM cloud (the bug I caught in the cherry-picked commits)

Smoke tests (real-cloud):

pytest tests/smoke_tests/test_lifecycle_hooks.py::test_preemption_gcp_simulate_maintenance --generic-cloud gcp
pytest tests/smoke_tests/test_lifecycle_hooks.py::test_preemption_azure_simulate_eviction --generic-cloud azure
pytest tests/smoke_tests/test_lifecycle_hooks.py::test_preemption_no_false_positive --generic-cloud aws

Plus all existing PR1 hook tests still pass (75 unit tests + the K8s + VM smoke suites).

Stacked work that comes after

PR3 (worker fan-out) was previously stacked on this branch; the existing PR97 was closed and will be re-opened against this PR2 once it lands.

Adds a per-cloud VM preemption detector that runs alongside skylet on AWS/GCP/Azure VMs. On detecting a pending spot reclaim, the poller sends SIGTERM to skylet, which then runs preemption hooks through the same `hook_executor.try_claim_teardown` CAS path used by Kubernetes preStop (PR1). New module: `sky/skylet/preemption_poller.py` - AWS IMDSv2: polls `spot/instance-action` (404 = clear). - GCP: long-poll `/instance/preempted?wait_for_change=true`. - Azure: polls IMDS Scheduled Events, fires only on `Preempt` event type (ignores Freeze/Reboot/Redeploy). - All HTTP via stdlib `urllib.request` — no new dependencies. - Public API: `start(cloud) -> threading.Event` (daemon thread, call `.set()` on the returned event to stop). Skylet boot wiring (`sky/skylet/skylet.py`): - `_detect_cloud_for_preemption_poller()` probes each cloud's metadata endpoint with a 1s timeout. Returns `None` for K8s / bare-metal / Slurm (where the SIGTERM path alone suffices). - `main()` starts the poller iff cloud is detected. Unit tests (`tests/unit_tests/test_preemption_poller.py`): 11 tests with mocked `urllib.request.urlopen` covering positive detection, idle polling, cloud dispatch, and unsupported-cloud rejection.

EKS pods can reach the underlying EC2 IMDS, so the previous probe-everything strategy mis-detected EKS pods as 'aws' and started a poller that races the K8s preStop bridge. Short-circuit: return None when KUBERNETES_SERVICE_HOST is set in the env (kubelet sets this automatically on every K8s pod). Cheaper and unambiguous compared to relying on IMDS reachability. This commit lives on PR2 because that's where _detect_cloud_for_preemption_poller is defined.

Appended at the end of test_lifecycle_hooks.py (post-PR1 layout) since the cherry-pick from the original PR2 branch (commit 6529b15) hit structural conflicts against the merged PR1 baseline. Tests are the same shape as the original commit — only the file position changed. Tests added: - S14 test_preemption_gcp_simulate_maintenance (@pytest.mark.gcp): Launches GCP spot VM, runs ``gcloud compute instances simulate-maintenance-event`` to flip the metadata ``/instance/preempted`` flag, expects the poller's long-poll to wake up, send SIGTERM to skylet, and fire the preemption hook. - S15 test_preemption_azure_simulate_eviction (@pytest.mark.azure): Launches Azure spot VM, runs ``az vm simulate-eviction`` which triggers the IMDS scheduled-events ``Preempt`` event; the poller catches it and SIGTERMs the skylet. - S16 test_preemption_no_false_positive (generic_cloud, @no_kubernetes): On-demand VM with a preemption hook; verifies the poller runs but never fires (no spurious SIGTERM, no preemption.log written). AWS is unit-only (no public simulate-spot-reclaim API) — covered by ``tests/unit_tests/test_preemption_poller.py`` with mocked IMDSv2 responses. K8s is excluded for S16 because K8s clusters don't run the VM-side poller (preStop SIGTERM is the K8s path, shipped in PR1). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The cherry-picked PR2 commits installed the preemption SIGTERM handler only when ``KUBERNETES_SERVICE_HOST`` was set. The poller signals preemption by sending SIGTERM to skylet's own PID — with no handler installed, that SIGTERM killed skylet immediately without running the preemption hook, defeating the entire purpose of the PR2 VM-side preemption-detection work. Bug: ``_should_install_preemption_sigterm_handler()`` gated only on K8s, with a docstring claiming "On VM clouds, preemption is detected via the metadata poller ... so installing a SIGTERM handler here would be dead code." That reasoning was inverted: the poller's *output signal* is SIGTERM, so the handler is exactly what's needed to catch it. Fix: 1. ``_should_install_preemption_sigterm_handler(detected_cloud)`` takes the cloud identity from the boot probe; returns True when EITHER K8s env-var is set OR a VM cloud was detected (the poller will run and SIGTERM us). 2. In ``main()``, run cloud detection FIRST so the handler decision can use it. Install handler if any preemption signal source exists; start poller iff cloud is non-None. Regression coverage (red → green TDD): * ``test_should_install_sigterm_handler_on_kubernetes``: K8s env var → True (original path preserved) * ``test_should_install_sigterm_handler_on_vm_cloud`` × (aws/gcp/azure) → True (the bug we just fixed) * ``test_should_not_install_sigterm_handler_when_no_signal_source`` → False (bare-metal / Slurm) 16/16 preemption-poller unit tests green; 75/75 lifecycle-hooks unit tests still green. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request implements background preemption pollers for AWS, GCP, and Azure to detect spot instance reclaims via metadata endpoints. The skylet is updated to detect the cloud environment at startup, install a SIGTERM handler, and launch the relevant poller. New smoke and unit tests verify the detection logic across providers. Feedback focuses on improving the AWS poller's efficiency by caching the IMDSv2 token and refining the GCP long-polling logic to prevent race conditions and unnecessary backoffs during timeouts.

gemini-code-assist · 2026-05-25T11:48:30Z

+    while not stop_event.is_set():
+        token = _get_aws_imds_token()
+        if token is None:
+            stop_event.wait(_AWS_POLL_INTERVAL_SECONDS)
+            continue


The AWS IMDSv2 token is currently fetched in every iteration of the polling loop (every 5 seconds). Since the token is requested with a 6-hour TTL (21600 seconds), fetching it so frequently is inefficient and may lead to unnecessary load or rate-limiting on the IMDS endpoint. It is recommended to cache the token and reuse it across iterations.

token = None while not stop_event.is_set(): if token is None: token = _get_aws_imds_token() if token is None: stop_event.wait(_AWS_POLL_INTERVAL_SECONDS) continue

gemini-code-assist · 2026-05-25T11:48:30Z

+        except urllib.error.HTTPError as e:
+            # 404 means no action scheduled — the happy path.
+            if e.code != 404:
+                logger.debug(f'preemption_poller[aws]: HTTPError {e.code}: {e}')


When caching the IMDSv2 token, ensure that a 401 Unauthorized response (which indicates an expired or invalid token) triggers a token refresh in the next iteration by clearing the cached value.

Suggested change

except urllib.error.HTTPError as e:

# 404 means no action scheduled — the happy path.

if e.code != 404:

logger.debug(f'preemption_poller[aws]: HTTPError {e.code}: {e}')

except urllib.error.HTTPError as e:

if e.code == 401:

token = None

# 404 means no action scheduled — the happy path.

elif e.code != 404:

logger.debug(f'preemption_poller[aws]: HTTPError {e.code}: {e}')

gemini-code-assist · 2026-05-25T11:48:30Z

+            with urllib.request.urlopen(req, timeout=60) as resp:
+                body = resp.read().decode().strip().upper()
+                if body == 'TRUE':
+                    logger.info(
+                        'preemption_poller[gcp]: preempted flag flipped TRUE')
+                    _signal_preemption()
+                    return
+        except Exception as e:  # pylint: disable=broad-except
+            logger.debug(f'preemption_poller[gcp]: poll error: {e}')
+            stop_event.wait(_GCP_POLL_INTERVAL_SECONDS)


In GCP long polling, the metadata server typically waits for 60 seconds before returning a response if no change occurs. Setting the client-side timeout to exactly 60 seconds can lead to frequent race-condition timeouts. It is recommended to set the client timeout slightly higher (e.g., 70 seconds). Additionally, a timeout during long polling is an expected event and should not trigger the 5-second backoff (_GCP_POLL_INTERVAL_SECONDS), as this creates unnecessary gaps in preemption detection coverage.

…ontract ``test_sigterm_handler_installed_only_on_kubernetes`` in test_lifecycle_hooks.py was added in PR1 and pinned the K8s-only handler-install contract. PR2's poller daemon sends SIGTERM to the skylet's own PID on VM clouds, so the handler must install there too. The PR1 test contradicts the corrected behavior; the new tests in tests/unit_tests/test_preemption_poller.py (``test_should_install_sigterm_handler_on_kubernetes`` + ``test_should_install_sigterm_handler_on_vm_cloud`` + ``test_should_not_install_sigterm_handler_when_no_signal_source``) fully cover the new contract. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

devin-ai-integration

Devin Review found 1 potential issue.

View 5 additional findings in Devin Review.

devin-ai-integration · 2026-05-25T11:53:50Z

+        except Exception as e:  # pylint: disable=broad-except
+            logger.debug(f'preemption_poller[gcp]: poll error: {e}')
+            stop_event.wait(_GCP_POLL_INTERVAL_SECONDS)


🔴 GCP poller busy-loops because sleep is only in the except branch, not after successful FALSE responses

The _poll_gcp poller places stop_event.wait(_GCP_POLL_INTERVAL_SECONDS) inside the except block (line 118), so it only sleeps on error. When the GCP metadata server returns FALSE (the normal non-preempted state), the code falls through the if body == 'TRUE' check and immediately loops back to make another request — with no sleep.

This is compounded by the fact that wait_for_change=true without a last_etag parameter causes the GCP metadata server to return the current value immediately (per GCP docs: "If you don't provide the last_etag parameter, the metadata server returns the current value immediately"). The result is a tight busy-loop hammering the metadata server continuously for the entire lifetime of the VM.

Both the AWS poller (sky/skylet/preemption_poller.py:90) and Azure poller (sky/skylet/preemption_poller.py:144) correctly place stop_event.wait() outside the try/except so it always executes. The GCP poller is inconsistent with both.

Control flow comparison across pollers

AWS (correct — sleep always runs):

try: ... except: ... stop_event.wait(...) # outside try/except

Azure (correct — sleep always runs):

try: ... except: ... stop_event.wait(...) # outside try/except

GCP (broken — sleep only on error):

try: ... # if FALSE: falls through except: stop_event.wait(...) # inside except! # no sleep here

Suggested change

except Exception as e: # pylint: disable=broad-except

logger.debug(f'preemption_poller[gcp]: poll error: {e}')

stop_event.wait(_GCP_POLL_INTERVAL_SECONDS)

except Exception as e: # pylint: disable=broad-except

logger.debug(f'preemption_poller[gcp]: poll error: {e}')

stop_event.wait(_GCP_POLL_INTERVAL_SECONDS)

Was this helpful? React with 👍 or 👎 to provide feedback.

zpoint · 2026-05-25T12:04:28Z

/quicktest-core
/smoke-test -k test_lifecycle_hooks --kubernetes
/smoke-test -k test_lifecycle_hooks --aws

zpoint · 2026-05-25T12:31:31Z

All CI green on commit `be5b84bce`.

CI matrix

Layer	Build	Result
GH Actions	run 26393497232	✅ 21/21 green
BK quicktest-core	#3278	✅ passed
BK smoke --aws (`-k test_lifecycle_hooks`)	#10915	✅ passed — incl. `test_preemption_no_false_positive` (new PR2 smoke that pins: on-demand VM with a preemption hook never spuriously fires it)
BK smoke --kubernetes (`-k test_lifecycle_hooks`)	#10914	✅ passed — all 8 existing K8s hook tests still green; PR2 is additive on the VM side so K8s behavior is unchanged

Note

`test_preemption_gcp_simulate_maintenance` (`@pytest.mark.gcp`) and `test_preemption_azure_simulate_eviction` (`@pytest.mark.azure`) are scoped to the GCP and Azure smoke pipelines respectively — they'd run on `/smoke-test -k test_lifecycle_hooks --gcp` and `--azure`, which I can trigger if you'd like additional real-cloud sign-off. Default-cloud AWS verification + the no-false-positive test cover the core poller invariant.

What this PR adds

The user-visible YAML/CLI surface is unchanged — PR1 already accepted `events: [preemption]` on VM clusters. PR2 makes that hook actually fire on a real spot reclaim.

Cloud × Event	PR1 (merged)	PR2 (this PR)
K8s × stop / preemption / down	✅ shipped	(no change)
VM × stop / down	✅ shipped	(no change)
VM × preemption	❌ accepted-but-doesn't-fire	✅ fires via metadata poller → SIGTERM → existing hook_executor path

Bug caught during prep

The original PR2 commits gated the SIGTERM-handler installation only on K8s (`KUBERNETES_SERVICE_HOST` env var). But the new `preemption_poller` daemon signals preemption by sending SIGTERM to skylet's own PID — with no handler installed on VM clouds, that SIGTERM would have killed skylet without firing any hook, defeating the entire point of PR2. Fixed in commit `ac1f82d4d` with 5 new contract-pinning unit tests covering the K8s / each VM cloud / no-source cases.

PR ready for review.

zpoint and others added 4 commits May 25, 2026 19:40

gemini-code-assist Bot reviewed May 25, 2026

View reviewed changes

devin-ai-integration Bot reviewed May 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hooks] PR2 — VM preemption detection (poller daemon)#9712

[Hooks] PR2 — VM preemption detection (poller daemon)#9712
zpoint wants to merge 5 commits into
skypilot-org:masterfrom
zpoint:feature/vm-preemption-detection

zpoint commented May 25, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 25, 2026

Uh oh!

gemini-code-assist Bot May 25, 2026

Uh oh!

gemini-code-assist Bot May 25, 2026

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot May 25, 2026

Uh oh!

zpoint commented May 25, 2026

Uh oh!

zpoint commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zpoint commented May 25, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changes

Why this is in PR2, not PR1

Test plan

Stacked work that comes after

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

zpoint commented May 25, 2026

Uh oh!

zpoint commented May 25, 2026

CI matrix

Note

What this PR adds

Bug caught during prep

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zpoint commented May 25, 2026 •

edited by devin-ai-integration Bot

Loading