Skip to content

Verda: make startup script and SSH key lifecycle per-instance with reliable cleanup#3718

Open
peterschmidt85 wants to merge 6 commits intomasterfrom
fix/verda-per-instance-resource-cleanup
Open

Verda: make startup script and SSH key lifecycle per-instance with reliable cleanup#3718
peterschmidt85 wants to merge 6 commits intomasterfrom
fix/verda-per-instance-resource-cleanup

Conversation

@peterschmidt85
Copy link
Copy Markdown
Contributor

Summary

This PR makes Verda startup scripts and SSH keys fully per-instance and ensures they are cleaned up reliably.

What changed

  • Switched Verda provisioning from shared/reused resources to per-instance resources:
    • create a dedicated startup script per instance
    • create dedicated SSH key(s) per instance
  • Persisted both resource IDs in instance backend data:
    • startup_script_id
    • ssh_key_ids
  • Added cleanup on provisioning failure:
    • if instance provisioning fails after creating script/keys, best-effort deletes created resources
  • Added cleanup on termination:
    • after instance termination call, deletes startup script and SSH key(s)
  • Made cleanup idempotent and safer:
    • not-found handling now prefers API code (not_found) and narrowly handles expected invalid_request invalid-ID messages
    • unrelated API errors are not swallowed
  • Removed old SSH key fingerprint reuse path (no more shared key lookup/reuse).

Backward compatibility

Backward compatible:

  • VerdaInstanceBackendData fields are optional.
  • Existing instances with missing/legacy backend_data still terminate safely (no-op cleanup when IDs are absent).

Validation

Automated

  • pre-commit run --all-files
  • uv run pytest -q src/tests/_internal/core/backends/verda/test_compute.py ✅ (27 passed)
  • uv run pytest -q ✅ (2420 passed, 1147 skipped)

Manual end-to-end checks

  • Created 1-node Verda fleet and verified DB/provider metadata contained per-instance startup_script_id + ssh_key_ids.
  • Deleted fleet and verified:
    • dstack fleet/instance moved to terminated/deleted state
    • startup script became not_found in Verda
    • SSH key became not_found in Verda
  • Ran detached hello-world task on Verda and verified successful completion and logs output hello-world.

Notes

  • Provider instance may remain visible as discontinued in Verda API for some time; this is provider-side lifecycle behavior and does not affect dstack cleanup correctness.

AI assistance

This PR was prepared with AI assistance.

@peterschmidt85 peterschmidt85 requested a review from jvstme March 31, 2026 12:51
or message == "not found"
or ("startup script id" in message and "invalid" in message)
or ("script id" in message and "invalid" in message)
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) In my testing, deleting a nonexistent key or script does not actually raise any exceptions. So I assume both _is_ssh_key_not_found_error and _is_startup_script_not_found_error are hallucinations and can be removed.

peterschmidt85 and others added 4 commits April 2, 2026 10:14
Co-authored-by: jvstme <36324149+jvstme@users.noreply.github.com>
Co-authored-by: jvstme <36324149+jvstme@users.noreply.github.com>
Co-authored-by: jvstme <36324149+jvstme@users.noreply.github.com>
Co-authored-by: jvstme <36324149+jvstme@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants