CON-1514: summarize self-test startup failure evidence#421
Conversation
31935db to
92a8526
Compare
|
Dogfood follow-up on machine
After the dogfood run, I tightened the terminal formatting so the output is easier to scan:
Validation after formatter update:
|
92a8526 to
d1b2b95
Compare
|
Final dogfood/demo refresh after formatting update:
Validation after the status-message fallback fix:
Local demo artifacts for the meeting:
|
d1b2b95 to
2247070
Compare
|
Small follow-up: added self-test log level vocabulary matching the existing
Scope is intentionally self-test only for this PR. Usage: vastai self-test machine <machine_id> --log-level debug
# or
VAST_LOG_LEVEL=debug vastai self-test machine <machine_id>
Validation after this change:
|
2247070 to
cfcf198
Compare
|
Follow-up on self-test verbosity: default Behavior after this update:
The final summary now keeps the important user-facing signal: pass/fail, machine id, failed stage, failed check names, reason, support bundle path if created, and next action. Raw output and support bundles still retain the detailed diagnostic data. Validation after this change:
Commit: |
|
Cross-platform CI on the updated head
Run: https://github.com/vast-ai/vast-cli/actions/runs/27612989435 |
Summary
Improves self-test startup failure diagnostics by using the daemon log evidence already collected for the support bundle.
When an instance fails before the self-test runtime starts, the CLI can now:
apt-get update || echo 'V220614a...'as daemon startup failures instead of generic status errorsV220614afailure marker outputapt-get updatestep completed successfullyinstance/daemon.logThis addresses the observed case where the visible status message made it look like apt/networking failed, while the bundle showed apt succeeded and the real failure was later Docker/NVIDIA CDI device injection.
Impact
The support bundle remains the full evidence pack, but hosts now get a clearer first-pass diagnosis in CLI output:
apt-get updateactually failed or only appeared in the build commandNo self-test image change is required for this PR because the affected failure happens before
/verification/remote.shand the runtime test scripts run.Validation
uv run --with pytest --with requests --with pycryptodome pytest tests/cli/test_runtime_diagnostics.py tests/cli/test_self_test_support_bundle.py-> 28 passeduv run --with pytest --with requests --with pycryptodome pytest tests/cli-> 306 passedgit diff --check-> clean