Skip to content

CON-1514: summarize self-test startup failure evidence#421

Open
jjziets wants to merge 1 commit into
vast-ai:masterfrom
jjziets:CON-1514-startup-evidence-summary
Open

CON-1514: summarize self-test startup failure evidence#421
jjziets wants to merge 1 commit into
vast-ai:masterfrom
jjziets:CON-1514-startup-evidence-summary

Conversation

@jjziets

@jjziets jjziets commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary

Improves self-test startup failure diagnostics by using the daemon log evidence already collected for the support bundle.

When an instance fails before the self-test runtime starts, the CLI can now:

  • classify Vast startup wrapper build/status lines like apt-get update || echo 'V220614a...' as daemon startup failures instead of generic status errors
  • distinguish the Dockerfile command line from the actual V220614a failure marker output
  • detect when the wrapper apt-get update step completed successfully
  • detect CDI GPU device injection failures in instance/daemon.log
  • print short startup evidence in the terminal output before the diagnostic bundle summary

This addresses the observed case where the visible status message made it look like apt/networking failed, while the bundle showed apt succeeded and the real failure was later Docker/NVIDIA CDI device injection.

Impact

The support bundle remains the full evidence pack, but hosts now get a clearer first-pass diagnosis in CLI output:

  • whether apt-get update actually failed or only appeared in the build command
  • whether the self-test runtime ever started
  • whether the likely cause is host daemon/NVIDIA runtime state rather than the self-test image

No self-test image change is required for this PR because the affected failure happens before /verification/remote.sh and the runtime test scripts run.

Validation

  • uv run --with pytest --with requests --with pycryptodome pytest tests/cli/test_runtime_diagnostics.py tests/cli/test_self_test_support_bundle.py -> 28 passed
  • uv run --with pytest --with requests --with pycryptodome pytest tests/cli -> 306 passed
  • git diff --check -> clean

@jjziets jjziets marked this pull request as ready for review June 15, 2026 09:57
@jjziets jjziets force-pushed the CON-1514-startup-evidence-summary branch from 31935db to 92a8526 Compare June 15, 2026 10:04
@jjziets

jjziets commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

Dogfood follow-up on machine 14629 hit the intended path:

  • selected vastai/test:self-test-v2-cuda-11.8
  • instance failed before self-test runtime start with CDI GPU injection failure
  • CLI classified it as daemon_startup_failed
  • instance cleanup succeeded on attempt 1
  • diagnostic bundle was saved with instance/daemon.log, instance/container.log, and instance/show-instance.json

After the dogfood run, I tightened the terminal formatting so the output is easier to scan:

  • grouped into Result, What happened, Underlying error, Remediation, Suggested steps, and Where to read next
  • moved the long daemon error into an indented block instead of inline with the summary
  • shortened the final Test failed: line to the human summary while preserving the full underlying error above and in the bundle

Validation after formatter update:

  • focused diagnostics/support/machine tests: 76 passed
  • full CLI suite: 306 passed
  • git diff --check: clean

@jjziets jjziets force-pushed the CON-1514-startup-evidence-summary branch from 92a8526 to d1b2b95 Compare June 15, 2026 10:19
@jjziets

jjziets commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

Final dogfood/demo refresh after formatting update:

  • First rerun on machine 14629 proved the grouped formatter, but exposed a gap: in that run the CDI failure was present in the instance status/underlying error, while the collected instance/daemon.log did not include the final CDI line early enough. The output was cleaner, but it missed the What happened evidence section.
  • I updated the PR to refine startup evidence from both instance/daemon.log and the instance status/underlying error text, then added a regression test for status-only CDI evidence.
  • Final rerun on 14629 now prints the intended sections: Result, What happened, Underlying error, Remediation, Suggested steps, and Where to read next.
  • Final green pass rerun on machine 8649 passed with vastai/test:self-test-v2-cuda-12.8 and cleaned up successfully.
  • Final active-instance audit returned 0 active instances.

Validation after the status-message fallback fix:

  • focused diagnostics/support/machine tests: 77 passed
  • full CLI suite: 307 passed
  • git diff --check: clean

Local demo artifacts for the meeting:

  • startup failure comparison: self-test-demo/captures/startup-14629-before-after-pr421.html
  • green pass comparison: self-test-demo/captures/green-8649-before-after-pr421.html

@jjziets jjziets force-pushed the CON-1514-startup-evidence-summary branch from d1b2b95 to 2247070 Compare June 16, 2026 10:45
@jjziets

jjziets commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Small follow-up: added self-test log level vocabulary matching the existing VAST_LOG_LEVEL names:

  • critical
  • error
  • warning
  • info
  • debug

Scope is intentionally self-test only for this PR. Usage:

vastai self-test machine <machine_id> --log-level debug
# or
VAST_LOG_LEVEL=debug vastai self-test machine <machine_id>

debug enables the existing --debugging output path. --debugging remains supported and resolves to log level debug internally. The resolved level/source is included in raw diagnostics/support-bundle metadata so support can see how the command was run.

Validation after this change:

  • focused self-test diagnostics/support tests: 82 passed
  • full CLI suite: 312 passed
  • git diff --check: clean

@jjziets jjziets force-pushed the CON-1514-startup-evidence-summary branch from 2247070 to cfcf198 Compare June 16, 2026 11:03
@jjziets

jjziets commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Follow-up on self-test verbosity: default info is now compact instead of printing the full diagnostic stream.

Behavior after this update:

  • debug: preserves the previous verbose behavior, including live progress lines, preflight actual/required/purpose/remediation details, endpoint diagnostics, and full support-bundle contents.
  • info: default compact lifecycle output plus a meaningful final summary.
  • warning/error/critical: suppress lifecycle chatter but still print the final summary; warnings/errors still surface when relevant.

The final summary now keeps the important user-facing signal: pass/fail, machine id, failed stage, failed check names, reason, support bundle path if created, and next action. Raw output and support bundles still retain the detailed diagnostic data.

Validation after this change:

  • focused self-test machine/support tests: 63 passed
  • full CLI + SDK sweep: 369 passed
  • git diff --check: clean

Commit: cfcf198

@jjziets

jjziets commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Cross-platform CI on the updated head cfcf198 is green:

  • install-smoke: macOS, Ubuntu, Windows passed
  • unit-and-integration: macOS, Ubuntu, Windows passed
  • detect-changes: passed
  • serverless-testing: skipped by workflow

Run: https://github.com/vast-ai/vast-cli/actions/runs/27612989435

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant