Skip to content

Commit 0cb28ff

Browse files
author
Hannes Zietsman
committed
docs(host): human-review common host questions
1 parent c849fd7 commit 0cb28ff

33 files changed

Lines changed: 1397 additions & 5366 deletions

docs.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -159,7 +159,7 @@
159159
"group": "Teams",
160160
"icon": "users",
161161
"pages": [
162-
"guides/teams/overview",
162+
"guides/teams/teams-overview",
163163
"guides/teams/managing-teams",
164164
"guides/teams/teams-roles",
165165
"guides/teams/legacy-teams"

host/account-hosting-agreement.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ Yes. Use a dedicated account for hosting. Do not use the same account for both c
2121

2222
## How to accept the hosting agreement
2323

24-
Once your host account is created, open the [host setup guide](https://cloud.vast.ai/host/setup/). There is a link in the first paragraph to the hosting agreement. Read through the agreement. Once you accept, your account is converted to a hosting account, and a Machines link appears in the navigation. Your account can now list machines that are running the daemon software.
24+
Once your host account is created, open the [host setup page](https://cloud.vast.ai/host/setup/). There is a link in the first paragraph to the hosting agreement. Read through the agreement. Once you accept, your account is converted to a hosting account, and a Machines link appears in the navigation. Your account can now list machines that are running the daemon software.
2525

2626
<a id="host-features-tab" />
2727
## What must happen before I can see host features or the Machines tab?

host/common-errors-diagnostics.mdx

Lines changed: 104 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -10,81 +10,153 @@ personas:
1010

1111
<div className="persona-chips"><span className="persona-chip">Pro Operator</span><span className="persona-chip">Headless / DC</span></div>
1212

13-
Use this page for host errors that are not specific to install, networking, or self-test, and for collecting diagnostics before asking for help.
13+
Use this page for host errors that are not covered by install, networking, or self-test pages.
1414

1515
<a id="logs" />
16-
## Where are installer, daemon, and self-test logs?
16+
## Logs
1717

18-
For self-test failures, the CLI can create a diagnostic bundle. Default self-test bundles include:
18+
Installer logs are written to `vast_host_install.log` in the directory where you launched the installer, not under `/var/lib/vastai_kaalia`.
19+
20+
```bash
21+
cat vast_host_install.log
22+
```
23+
24+
If the installer created a compressed log archive:
25+
26+
```bash
27+
tar -xzvf vastai_install_logs.tar.gz
28+
cat vast_host_install.log
29+
```
30+
31+
The host daemon log is:
32+
33+
```bash
34+
sudo tail -n 100 /var/lib/vastai_kaalia/kaalia.log
35+
```
36+
37+
For self-test failures, the CLI can create a diagnostic bundle. The normal command is:
38+
39+
```bash
40+
vastai self-test machine <machine_id>
41+
```
42+
43+
Failure bundles are saved by default under:
44+
45+
```text
46+
/tmp/vast_selftest_<machine_id>_<timestamp>.tar.gz
47+
```
48+
49+
You can override the output directory:
50+
51+
```bash
52+
vastai self-test machine <machine_id> \
53+
--support-bundle-dir /path/to/output
54+
```
55+
56+
Bundles can include:
1957

2058
- `self-test-output.log`
2159
- `self-test-result.json`
2260
- `manifest.json`
2361
- `collection-errors.json`
24-
25-
Runtime failures can also include:
26-
2762
- `instance/show-instance.json`
2863
- `instance/container.log`
2964
- `instance/daemon.log`
3065

31-
You can also create a manual diagnostic bundle with the Vast CLI. Use the current CLI reference for the exact command and run it from the host-enabled account that owns the machine.
66+
For a quick SSH check:
67+
68+
```bash
69+
systemctl is-active vastai.service vast_metrics.service docker nvidia-persistenced.service
70+
sudo journalctl -u vastai.service -n 80 --no-pager
71+
sudo journalctl -u vast_metrics.service -n 80 --no-pager
72+
sudo tail -n 100 /var/lib/vastai_kaalia/kaalia.log
73+
sudo cat /var/lib/vastai_kaalia/host_port_range
74+
```
75+
76+
Services should be `active`, logs should not show restart loops or repeated fatal errors, and the configured port range should match the forwarded ports.
77+
78+
<a id="gpu-kernel-logs" />
79+
## GPU, PCIe, And AER Kernel Logs
3280

33-
If you run diagnostics on the actual host, the bundle can include host-local artifacts such as kaalia logs, `dmesg`, `journalctl`, Docker daemon config, and mount information.
81+
Use kernel logs to distinguish a normal container/runtime error from a machine-health problem:
82+
83+
- **NVRM**: NVIDIA kernel driver messages.
84+
- **Xid**: NVIDIA GPU fault, reset, or error codes.
85+
- **PCIe**: the bus/link between the GPU, motherboard, and CPU.
86+
- **AER**: PCIe Advanced Error Reporting messages. Repeated AER messages can point to risers, slots, power, BIOS lane settings, cabling, motherboard, or GPU hardware issues.
87+
88+
Check the current boot:
89+
90+
```bash
91+
sudo journalctl -k -b --no-pager | grep -Ei 'NVRM|Xid|AER|PCIe|fallen|GPU has fallen'
92+
sudo dmesg -T | grep -Ei 'NVRM|Xid|AER|PCIe|fallen|GPU has fallen'
93+
```
94+
95+
Treat repeated Xid, NVRM, PCIe, or AER errors as host hardware/driver health signals, not as isolated self-test messages.
3496

3597
<a id="red-error" />
36-
## What is this red error message on my machine?
98+
## Red Machine Error
3799

38-
If the hosting software detects an error, that error message is listed on your machine in the machines page. Once the cause of the error has been resolved, most error messages are automatically cleared after 1-2 hours. The quickest way to learn more about resolving specific error messages is the hosting channels in [our Discord](https://discord.gg/hSuEbSQ4X8).
100+
The Machines page shows red errors detected by the host software. Fix the cause, then allow the platform to refresh. Many resolved errors clear automatically after 1-2 hours.
39101

40102
<a id="docker-cache" />
41-
## My storage for clients is full of old stopped jobs — can I free up space?
103+
## Full Client Storage
42104

43-
Try cleaning up the Docker build cache, as it sometimes frees up far more space than it claims. You can also clean up old unused images. For expired or deleted rental contracts that did not release their storage, run [vastai cleanup machine](/host/cli/cleanup-machine).
105+
Try cleaning Docker build cache and old unused images. For expired or deleted rentals that did not release storage, run [vastai cleanup machine](/host/cli/cleanup-machine).
44106

45107
<a id="nvidia-smi-fails" />
46-
## What should I do when nvidia-smi fails?
108+
## `nvidia-smi` Fails
47109

48-
Treat `nvidia-smi` failure as a host GPU or driver health problem. Check whether the NVIDIA driver is loaded, whether the GPU is visible on the PCIe bus, whether a reboot is needed after a driver or kernel update, whether there is an NVML driver/library mismatch, and whether system logs show NVRM, Xid, or PCIe errors. Do not keep a machine listed if it cannot reliably report GPU state.
110+
Treat this as a GPU or driver health problem. Check driver load, PCIe visibility, reboot state after updates, NVML mismatch, and NVRM/Xid/PCIe logs. Do not list a machine that cannot reliably report GPU state.
49111

50112
<a id="nvml-mismatch" />
51-
## What should I do for NVML driver/library mismatch?
113+
## NVML Driver/Library Mismatch
52114

53-
This usually means the NVIDIA userspace library and loaded kernel module are from different driver versions, often after an update. Plan downtime and reboot first. If the mismatch persists, clean up or reinstall the NVIDIA driver stack, then verify `nvidia-smi` before listing again.
115+
This usually means the NVIDIA userspace library and loaded kernel module are different versions. Plan downtime, reboot first, then reinstall or repair the NVIDIA driver stack if needed.
54116

55117
<a id="gpu-falls-off-bus" />
56-
## What should I do when a GPU falls off the PCIe bus?
118+
## GPU Falls Off The PCIe Bus
57119

58-
Treat this as a hardware, power, thermal, PCIe, or driver stability issue. Check logs for NVRM, Xid, and PCIe/AER errors. Power-cycle the machine, then inspect PSU capacity, power cables, risers, slots, PCIe lane settings, thermals, and any overclock or undervolt. Repeated bus drops should be fixed before the machine is listed again.
120+
Check NVRM, Xid, and PCIe/AER logs. Power-cycle, then inspect PSU capacity, cables, risers, slots, PCIe lane settings, thermals, and any overclock/undervolt. Fix repeated bus drops before listing again.
59121

60122
<a id="bad-bandwidthtest2" />
61-
## What should I do for bad bandwidthtest2?
123+
## `bad bandwidthtest2`
62124

63-
`bad bandwidthtest2` is best treated as a machine health or verification error. It usually points at a GPU, PCIe, driver/runtime, or data-transfer problem. Check PCIe bandwidth and lane configuration, GPU seating, thermals, power, driver and NVML health, Xid errors, Docker GPU runtime behavior, and self-test diagnostic output. Rerun self-test after fixing the hardware or driver issue.
125+
Treat this as a machine health or verification error. Check PCIe bandwidth, lane configuration, GPU seating, thermals, power, driver/NVML health, Xid errors, Docker GPU runtime behavior, and self-test output. Rerun self-test after fixes.
64126

65127
<a id="failed-cdi" />
66-
## What should I do for failed to inject CDI devices?
128+
## Failed To Inject CDI Devices
67129

68-
If logs or machine status mention failed CDI device injection, treat it as a container GPU device injection or runtime problem. Current self-test output may surface the same underlying issue as a startup or runtime failure instead of using the literal CDI text. Gather a diagnostic bundle, check Docker and NVIDIA container runtime/CDI configuration, confirm GPUs are visible on the host, inspect daemon and container logs, and check for related machine errors. If the configuration looks correct but containers still cannot receive GPU devices, escalate with logs and the machine details requested by support.
130+
This points to GPU device injection or container runtime setup. Gather diagnostics, confirm GPUs are visible on the host, check Docker/NVIDIA runtime/CDI configuration, and inspect daemon/container logs. Do not assume regenerating CDI configuration fixes an unhealthy GPU, driver, or PCIe link. Escalate with logs if containers still cannot receive GPUs.
69131

70132
<a id="collect-logs" />
71-
## What logs should I collect before asking for help?
133+
## Before Asking For Help
134+
135+
Include:
72136

73-
Include the account context, exact command, timestamps, screenshots or CLI output, tested external IP:port for network failures, and the self-test support bundle when available. If support asks for machine-specific details, share them in the support channel rather than a public community thread.
137+
- Account context and exact command.
138+
- Machine ID, offer ID if relevant, and whether the account is the host account.
139+
- Exact timestamps with timezone.
140+
- Exact error strings, screenshots, or CLI output.
141+
- What changed recently: reboot, driver, kernel, Docker, BIOS, storage, router, or ISP.
142+
- Host OS, GPU model/count, NVIDIA driver version, and Docker storage mount output.
143+
- Tested external IP and port for networking issues.
144+
- Self-test support bundle when available.
145+
- Installer log, daemon log, and relevant kernel log excerpts.
74146

75-
Review bundles before sharing them.
147+
Review bundles before sharing them. Share sensitive machine/account details only in the appropriate support channel.
76148

77149
<a id="escalate-support" />
78-
## What should I escalate to Vast support?
150+
## Escalate To Vast Support
79151

80-
Escalate account conversion or hosting agreement issues, payment/tax/payout provider issues that the docs cannot resolve, backend machine records or ghost machines that normal UI/CLI actions cannot remove, API permission or account-visibility problems, suspected backend bugs, and cases where self-test or diagnostics show a platform state mismatch after following documented steps.
152+
Escalate account conversion, hosting agreement, payout, API permission, backend machine record, ghost machine, suspected backend bug, or platform-state mismatch issues.
81153

82-
Local Ubuntu, Docker, GPU driver, hardware, power, thermals, and consumer network setup are primarily host responsibilities. Bring logs and diagnostics when asking for help.
154+
Local Ubuntu, Docker, GPU driver, hardware, power, thermals, and consumer networking are primarily host responsibilities. Vast hosting requires direct public inbound TCP/UDP reachability; CGNAT or double NAT without a real public forwarding path is not a supported hosting network setup.
83155

84-
## Related pages
156+
## Related Pages
85157

86158
| Topic | Read next |
87159
| --- | --- |
88-
| Install-time failures | [Installing Host Software](/host/installing-host-software#install-failures) |
89-
| Self-test error strings | [Self-Test Reference](/host/self-test-reference) |
90-
| Connectivity problems | [Network & Ports](/host/network-ports) |
160+
| Install failures | [Installing Host Software](/host/installing-host-software#install-failures) |
161+
| Self-test errors | [Self-Test Reference](/host/self-test-reference) |
162+
| Connectivity | [Network & Ports](/host/network-ports) |

0 commit comments

Comments
 (0)