CON-1514: summarize self-test startup failure evidence by jjziets · Pull Request #421 · vast-ai/vast-cli

jjziets · 2026-06-15T09:54:14Z

Summary

Improves self-test startup failure diagnostics by using the daemon log evidence already collected for the support bundle.

When an instance fails before the self-test runtime starts, the CLI can now:

classify Vast startup wrapper build/status lines like apt-get update || echo 'V220614a...' as daemon startup failures instead of generic status errors
distinguish the Dockerfile command line from the actual V220614a failure marker output
detect when the wrapper apt-get update step completed successfully
detect CDI GPU device injection failures in instance/daemon.log
print short startup evidence in the terminal output before the diagnostic bundle summary

This addresses the observed case where the visible status message made it look like apt/networking failed, while the bundle showed apt succeeded and the real failure was later Docker/NVIDIA CDI device injection.

Impact

The support bundle remains the full evidence pack, but hosts now get a clearer first-pass diagnosis in CLI output:

whether apt-get update actually failed or only appeared in the build command
whether the self-test runtime ever started
whether the likely cause is host daemon/NVIDIA runtime state rather than the self-test image

No self-test image change is required for this PR because the affected failure happens before /verification/remote.sh and the runtime test scripts run.

Validation

uv run --with pytest --with requests --with pycryptodome pytest tests/cli/test_runtime_diagnostics.py tests/cli/test_self_test_support_bundle.py -> 28 passed
uv run --with pytest --with requests --with pycryptodome pytest tests/cli -> 306 passed
git diff --check -> clean

jjziets · 2026-06-15T10:04:49Z

Dogfood follow-up on machine 14629 hit the intended path:

selected vastai/test:self-test-v2-cuda-11.8
instance failed before self-test runtime start with CDI GPU injection failure
CLI classified it as daemon_startup_failed
instance cleanup succeeded on attempt 1
diagnostic bundle was saved with instance/daemon.log, instance/container.log, and instance/show-instance.json

After the dogfood run, I tightened the terminal formatting so the output is easier to scan:

grouped into Result, What happened, Underlying error, Remediation, Suggested steps, and Where to read next
moved the long daemon error into an indented block instead of inline with the summary
shortened the final Test failed: line to the human summary while preserving the full underlying error above and in the bundle

Validation after formatter update:

focused diagnostics/support/machine tests: 76 passed
full CLI suite: 306 passed
git diff --check: clean

jjziets · 2026-06-15T10:25:16Z

Final dogfood/demo refresh after formatting update:

First rerun on machine 14629 proved the grouped formatter, but exposed a gap: in that run the CDI failure was present in the instance status/underlying error, while the collected instance/daemon.log did not include the final CDI line early enough. The output was cleaner, but it missed the What happened evidence section.
I updated the PR to refine startup evidence from both instance/daemon.log and the instance status/underlying error text, then added a regression test for status-only CDI evidence.
Final rerun on 14629 now prints the intended sections: Result, What happened, Underlying error, Remediation, Suggested steps, and Where to read next.
Final green pass rerun on machine 8649 passed with vastai/test:self-test-v2-cuda-12.8 and cleaned up successfully.
Final active-instance audit returned 0 active instances.

Validation after the status-message fallback fix:

focused diagnostics/support/machine tests: 77 passed
full CLI suite: 307 passed
git diff --check: clean

Local demo artifacts for the meeting:

startup failure comparison: self-test-demo/captures/startup-14629-before-after-pr421.html
green pass comparison: self-test-demo/captures/green-8649-before-after-pr421.html

jjziets · 2026-06-16T10:45:30Z

Small follow-up: added self-test log level vocabulary matching the existing VAST_LOG_LEVEL names:

critical
error
warning
info
debug

Scope is intentionally self-test only for this PR. Usage:

vastai self-test machine <machine_id> --log-level debug
# or
VAST_LOG_LEVEL=debug vastai self-test machine <machine_id>

debug enables the existing --debugging output path. --debugging remains supported and resolves to log level debug internally. The resolved level/source is included in raw diagnostics/support-bundle metadata so support can see how the command was run.

Validation after this change:

focused self-test diagnostics/support tests: 82 passed
full CLI suite: 312 passed
git diff --check: clean

jjziets · 2026-06-16T11:03:30Z

Follow-up on self-test verbosity: default info is now compact instead of printing the full diagnostic stream.

Behavior after this update:

debug: preserves the previous verbose behavior, including live progress lines, preflight actual/required/purpose/remediation details, endpoint diagnostics, and full support-bundle contents.
info: default compact lifecycle output plus a meaningful final summary.
warning/error/critical: suppress lifecycle chatter but still print the final summary; warnings/errors still surface when relevant.

The final summary now keeps the important user-facing signal: pass/fail, machine id, failed stage, failed check names, reason, support bundle path if created, and next action. Raw output and support bundles still retain the detailed diagnostic data.

Validation after this change:

focused self-test machine/support tests: 63 passed
full CLI + SDK sweep: 369 passed
git diff --check: clean

Commit: cfcf198

jjziets · 2026-06-16T11:05:39Z

Cross-platform CI on the updated head cfcf198 is green:

install-smoke: macOS, Ubuntu, Windows passed
unit-and-integration: macOS, Ubuntu, Windows passed
detect-changes: passed
serverless-testing: skipped by workflow

Run: https://github.com/vast-ai/vast-cli/actions/runs/27612989435

jjziets marked this pull request as ready for review June 15, 2026 09:57

jjziets force-pushed the CON-1514-startup-evidence-summary branch from 31935db to 92a8526 Compare June 15, 2026 10:04

jjziets force-pushed the CON-1514-startup-evidence-summary branch from 92a8526 to d1b2b95 Compare June 15, 2026 10:19

jjziets force-pushed the CON-1514-startup-evidence-summary branch from d1b2b95 to 2247070 Compare June 16, 2026 10:45

CON-1514 summarize startup failure evidence

cfcf198

jjziets force-pushed the CON-1514-startup-evidence-summary branch from 2247070 to cfcf198 Compare June 16, 2026 11:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CON-1514: summarize self-test startup failure evidence#421

CON-1514: summarize self-test startup failure evidence#421
jjziets wants to merge 1 commit into
vast-ai:masterfrom
jjziets:CON-1514-startup-evidence-summary

jjziets commented Jun 15, 2026

Uh oh!

jjziets commented Jun 15, 2026

Uh oh!

jjziets commented Jun 15, 2026

Uh oh!

jjziets commented Jun 16, 2026

Uh oh!

jjziets commented Jun 16, 2026

Uh oh!

jjziets commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jjziets commented Jun 15, 2026

Summary

Impact

Validation

Uh oh!

jjziets commented Jun 15, 2026

Uh oh!

jjziets commented Jun 15, 2026

Uh oh!

jjziets commented Jun 16, 2026

Uh oh!

jjziets commented Jun 16, 2026

Uh oh!

jjziets commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant