[Hooks] PR2 — VM preemption detection (poller daemon)#9712
Conversation
Adds a per-cloud VM preemption detector that runs alongside skylet on AWS/GCP/Azure VMs. On detecting a pending spot reclaim, the poller sends SIGTERM to skylet, which then runs preemption hooks through the same `hook_executor.try_claim_teardown` CAS path used by Kubernetes preStop (PR1). New module: `sky/skylet/preemption_poller.py` - AWS IMDSv2: polls `spot/instance-action` (404 = clear). - GCP: long-poll `/instance/preempted?wait_for_change=true`. - Azure: polls IMDS Scheduled Events, fires only on `Preempt` event type (ignores Freeze/Reboot/Redeploy). - All HTTP via stdlib `urllib.request` — no new dependencies. - Public API: `start(cloud) -> threading.Event` (daemon thread, call `.set()` on the returned event to stop). Skylet boot wiring (`sky/skylet/skylet.py`): - `_detect_cloud_for_preemption_poller()` probes each cloud's metadata endpoint with a 1s timeout. Returns `None` for K8s / bare-metal / Slurm (where the SIGTERM path alone suffices). - `main()` starts the poller iff cloud is detected. Unit tests (`tests/unit_tests/test_preemption_poller.py`): 11 tests with mocked `urllib.request.urlopen` covering positive detection, idle polling, cloud dispatch, and unsupported-cloud rejection.
EKS pods can reach the underlying EC2 IMDS, so the previous probe-everything strategy mis-detected EKS pods as 'aws' and started a poller that races the K8s preStop bridge. Short-circuit: return None when KUBERNETES_SERVICE_HOST is set in the env (kubelet sets this automatically on every K8s pod). Cheaper and unambiguous compared to relying on IMDS reachability. This commit lives on PR2 because that's where _detect_cloud_for_preemption_poller is defined.
Appended at the end of test_lifecycle_hooks.py (post-PR1 layout) since the cherry-pick from the original PR2 branch (commit 6529b15) hit structural conflicts against the merged PR1 baseline. Tests are the same shape as the original commit — only the file position changed. Tests added: - S14 test_preemption_gcp_simulate_maintenance (@pytest.mark.gcp): Launches GCP spot VM, runs ``gcloud compute instances simulate-maintenance-event`` to flip the metadata ``/instance/preempted`` flag, expects the poller's long-poll to wake up, send SIGTERM to skylet, and fire the preemption hook. - S15 test_preemption_azure_simulate_eviction (@pytest.mark.azure): Launches Azure spot VM, runs ``az vm simulate-eviction`` which triggers the IMDS scheduled-events ``Preempt`` event; the poller catches it and SIGTERMs the skylet. - S16 test_preemption_no_false_positive (generic_cloud, @no_kubernetes): On-demand VM with a preemption hook; verifies the poller runs but never fires (no spurious SIGTERM, no preemption.log written). AWS is unit-only (no public simulate-spot-reclaim API) — covered by ``tests/unit_tests/test_preemption_poller.py`` with mocked IMDSv2 responses. K8s is excluded for S16 because K8s clusters don't run the VM-side poller (preStop SIGTERM is the K8s path, shipped in PR1). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The cherry-picked PR2 commits installed the preemption SIGTERM
handler only when ``KUBERNETES_SERVICE_HOST`` was set. The poller
signals preemption by sending SIGTERM to skylet's own PID — with no
handler installed, that SIGTERM killed skylet immediately without
running the preemption hook, defeating the entire purpose of the
PR2 VM-side preemption-detection work.
Bug: ``_should_install_preemption_sigterm_handler()`` gated only on
K8s, with a docstring claiming "On VM clouds, preemption is detected
via the metadata poller ... so installing a SIGTERM handler here
would be dead code." That reasoning was inverted: the poller's
*output signal* is SIGTERM, so the handler is exactly what's needed
to catch it.
Fix:
1. ``_should_install_preemption_sigterm_handler(detected_cloud)``
takes the cloud identity from the boot probe; returns True when
EITHER K8s env-var is set OR a VM cloud was detected (the poller
will run and SIGTERM us).
2. In ``main()``, run cloud detection FIRST so the handler decision
can use it. Install handler if any preemption signal source
exists; start poller iff cloud is non-None.
Regression coverage (red → green TDD):
* ``test_should_install_sigterm_handler_on_kubernetes``: K8s env
var → True (original path preserved)
* ``test_should_install_sigterm_handler_on_vm_cloud`` × (aws/gcp/azure)
→ True (the bug we just fixed)
* ``test_should_not_install_sigterm_handler_when_no_signal_source``
→ False (bare-metal / Slurm)
16/16 preemption-poller unit tests green; 75/75 lifecycle-hooks
unit tests still green.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request implements background preemption pollers for AWS, GCP, and Azure to detect spot instance reclaims via metadata endpoints. The skylet is updated to detect the cloud environment at startup, install a SIGTERM handler, and launch the relevant poller. New smoke and unit tests verify the detection logic across providers. Feedback focuses on improving the AWS poller's efficiency by caching the IMDSv2 token and refining the GCP long-polling logic to prevent race conditions and unnecessary backoffs during timeouts.
| while not stop_event.is_set(): | ||
| token = _get_aws_imds_token() | ||
| if token is None: | ||
| stop_event.wait(_AWS_POLL_INTERVAL_SECONDS) | ||
| continue |
There was a problem hiding this comment.
The AWS IMDSv2 token is currently fetched in every iteration of the polling loop (every 5 seconds). Since the token is requested with a 6-hour TTL (21600 seconds), fetching it so frequently is inefficient and may lead to unnecessary load or rate-limiting on the IMDS endpoint. It is recommended to cache the token and reuse it across iterations.
token = None
while not stop_event.is_set():
if token is None:
token = _get_aws_imds_token()
if token is None:
stop_event.wait(_AWS_POLL_INTERVAL_SECONDS)
continue| except urllib.error.HTTPError as e: | ||
| # 404 means no action scheduled — the happy path. | ||
| if e.code != 404: | ||
| logger.debug(f'preemption_poller[aws]: HTTPError {e.code}: {e}') |
There was a problem hiding this comment.
When caching the IMDSv2 token, ensure that a 401 Unauthorized response (which indicates an expired or invalid token) triggers a token refresh in the next iteration by clearing the cached value.
| except urllib.error.HTTPError as e: | |
| # 404 means no action scheduled — the happy path. | |
| if e.code != 404: | |
| logger.debug(f'preemption_poller[aws]: HTTPError {e.code}: {e}') | |
| except urllib.error.HTTPError as e: | |
| if e.code == 401: | |
| token = None | |
| # 404 means no action scheduled — the happy path. | |
| elif e.code != 404: | |
| logger.debug(f'preemption_poller[aws]: HTTPError {e.code}: {e}') |
| with urllib.request.urlopen(req, timeout=60) as resp: | ||
| body = resp.read().decode().strip().upper() | ||
| if body == 'TRUE': | ||
| logger.info( | ||
| 'preemption_poller[gcp]: preempted flag flipped TRUE') | ||
| _signal_preemption() | ||
| return | ||
| except Exception as e: # pylint: disable=broad-except | ||
| logger.debug(f'preemption_poller[gcp]: poll error: {e}') | ||
| stop_event.wait(_GCP_POLL_INTERVAL_SECONDS) |
There was a problem hiding this comment.
In GCP long polling, the metadata server typically waits for 60 seconds before returning a response if no change occurs. Setting the client-side timeout to exactly 60 seconds can lead to frequent race-condition timeouts. It is recommended to set the client timeout slightly higher (e.g., 70 seconds). Additionally, a timeout during long polling is an expected event and should not trigger the 5-second backoff (_GCP_POLL_INTERVAL_SECONDS), as this creates unnecessary gaps in preemption detection coverage.
…ontract ``test_sigterm_handler_installed_only_on_kubernetes`` in test_lifecycle_hooks.py was added in PR1 and pinned the K8s-only handler-install contract. PR2's poller daemon sends SIGTERM to the skylet's own PID on VM clouds, so the handler must install there too. The PR1 test contradicts the corrected behavior; the new tests in tests/unit_tests/test_preemption_poller.py (``test_should_install_sigterm_handler_on_kubernetes`` + ``test_should_install_sigterm_handler_on_vm_cloud`` + ``test_should_not_install_sigterm_handler_when_no_signal_source``) fully cover the new contract. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
| except Exception as e: # pylint: disable=broad-except | ||
| logger.debug(f'preemption_poller[gcp]: poll error: {e}') | ||
| stop_event.wait(_GCP_POLL_INTERVAL_SECONDS) |
There was a problem hiding this comment.
🔴 GCP poller busy-loops because sleep is only in the except branch, not after successful FALSE responses
The _poll_gcp poller places stop_event.wait(_GCP_POLL_INTERVAL_SECONDS) inside the except block (line 118), so it only sleeps on error. When the GCP metadata server returns FALSE (the normal non-preempted state), the code falls through the if body == 'TRUE' check and immediately loops back to make another request — with no sleep.
This is compounded by the fact that wait_for_change=true without a last_etag parameter causes the GCP metadata server to return the current value immediately (per GCP docs: "If you don't provide the last_etag parameter, the metadata server returns the current value immediately"). The result is a tight busy-loop hammering the metadata server continuously for the entire lifetime of the VM.
Both the AWS poller (sky/skylet/preemption_poller.py:90) and Azure poller (sky/skylet/preemption_poller.py:144) correctly place stop_event.wait() outside the try/except so it always executes. The GCP poller is inconsistent with both.
Control flow comparison across pollers
AWS (correct — sleep always runs):
try: ... except: ...
stop_event.wait(...) # outside try/except
Azure (correct — sleep always runs):
try: ... except: ...
stop_event.wait(...) # outside try/except
GCP (broken — sleep only on error):
try:
... # if FALSE: falls through
except:
stop_event.wait(...) # inside except!
# no sleep here
| except Exception as e: # pylint: disable=broad-except | |
| logger.debug(f'preemption_poller[gcp]: poll error: {e}') | |
| stop_event.wait(_GCP_POLL_INTERVAL_SECONDS) | |
| except Exception as e: # pylint: disable=broad-except | |
| logger.debug(f'preemption_poller[gcp]: poll error: {e}') | |
| stop_event.wait(_GCP_POLL_INTERVAL_SECONDS) |
Was this helpful? React with 👍 or 👎 to provide feedback.
|
/quicktest-core |
|
All CI green on commit `be5b84bce`. CI matrix
Note`test_preemption_gcp_simulate_maintenance` (`@pytest.mark.gcp`) and `test_preemption_azure_simulate_eviction` (`@pytest.mark.azure`) are scoped to the GCP and Azure smoke pipelines respectively — they'd run on `/smoke-test -k test_lifecycle_hooks --gcp` and `--azure`, which I can trigger if you'd like additional real-cloud sign-off. Default-cloud AWS verification + the no-false-positive test cover the core poller invariant. What this PR addsThe user-visible YAML/CLI surface is unchanged — PR1 already accepted `events: [preemption]` on VM clusters. PR2 makes that hook actually fire on a real spot reclaim.
Bug caught during prepThe original PR2 commits gated the SIGTERM-handler installation only on K8s (`KUBERNETES_SERVICE_HOST` env var). But the new `preemption_poller` daemon signals preemption by sending SIGTERM to skylet's own PID — with no handler installed on VM clouds, that SIGTERM would have killed skylet without firing any hook, defeating the entire point of PR2. Fixed in commit `ac1f82d4d` with 5 new contract-pinning unit tests covering the K8s / each VM cloud / no-source cases. PR ready for review. |
Summary
Builds on the merged PR1 lifecycle-hooks framework (#9064) by adding the VM-side preemption detector that PR1 explicitly deferred. On AWS / GCP / Azure VMs, a daemon thread inside skylet polls the cloud's metadata endpoint and SIGTERMs skylet on a detected spot reclaim — flowing into the same
hook_executor.try_claim_teardown('preemption')path that K8s preStop already uses.What changes
New module
sky/skylet/preemption_poller.py(~180 lines)./latest/meta-data/spot/instance-action(404 = clear, 200 = scheduled)./computeMetadata/v1/instance/preempted?wait_for_change=true.Preempt(ignores Freeze/Reboot/Redeploy).urllib.request— no new dependencies.start(cloud) -> threading.Event(returned event.set()stops the daemon).Skylet boot wiring
sky/skylet/skylet.py:_detect_cloud_for_preemption_poller()probes each cloud's metadata endpoint with a 1s timeout. K8s pods short-circuit viaKUBERNETES_SERVICE_HOST._should_install_preemption_sigterm_handler(detected_cloud)returns True for both K8s (kubelet preStop) and VM clouds (poller SIGTERM). Handler installation routes both signal sources through the same hook-executor path.Smoke tests (S14/S15/S16) for GCP simulate-maintenance, Azure simulate-eviction, and a no-false-positive test on on-demand VMs. AWS is unit-only (no public reclaim-simulation API).
Why this is in PR2, not PR1
PR1 shipped the framework + K8s preemption + VM autostop/down. The VM preemption-detection daemon was deliberately deferred to keep PR1's review surface small. Until PR2 lands, a
preemptionhook on a VM cluster is accepted by the schema and stored on the skylet, but the hook never fires because no signal source delivers preemption notices to the skylet on VM clouds.Test plan
Unit tests (16 tests, all green locally):
Covers:
start()dispatches to the correct cloud loop, raises on unknown cloudORany VM cloud (the bug I caught in the cherry-picked commits)Smoke tests (real-cloud):
pytest tests/smoke_tests/test_lifecycle_hooks.py::test_preemption_gcp_simulate_maintenance --generic-cloud gcppytest tests/smoke_tests/test_lifecycle_hooks.py::test_preemption_azure_simulate_eviction --generic-cloud azurepytest tests/smoke_tests/test_lifecycle_hooks.py::test_preemption_no_false_positive --generic-cloud awsPlus all existing PR1 hook tests still pass (75 unit tests + the K8s + VM smoke suites).
Stacked work that comes after
PR3 (worker fan-out) was previously stacked on this branch; the existing PR97 was closed and will be re-opened against this PR2 once it lands.