You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: host/account-hosting-agreement.mdx
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -21,7 +21,7 @@ Yes. Use a dedicated account for hosting. Do not use the same account for both c
21
21
22
22
## How to accept the hosting agreement
23
23
24
-
Once your host account is created, open the [host setup guide](https://cloud.vast.ai/host/setup/). There is a link in the first paragraph to the hosting agreement. Read through the agreement. Once you accept, your account is converted to a hosting account, and a Machines link appears in the navigation. Your account can now list machines that are running the daemon software.
24
+
Once your host account is created, open the [host setup page](https://cloud.vast.ai/host/setup/). There is a link in the first paragraph to the hosting agreement. Read through the agreement. Once you accept, your account is converted to a hosting account, and a Machines link appears in the navigation. Your account can now list machines that are running the daemon software.
25
25
26
26
<aid="host-features-tab" />
27
27
## What must happen before I can see host features or the Machines tab?
You can also create a manual diagnostic bundle with the Vast CLI. Use the current CLI reference for the exact command and run it from the host-enabled account that owns the machine.
Services should be `active`, logs should not show restart loops or repeated fatal errors, and the configured port range should match the forwarded ports.
77
+
78
+
<aid="gpu-kernel-logs" />
79
+
## GPU, PCIe, And AER Kernel Logs
32
80
33
-
If you run diagnostics on the actual host, the bundle can include host-local artifacts such as kaalia logs, `dmesg`, `journalctl`, Docker daemon config, and mount information.
81
+
Use kernel logs to distinguish a normal container/runtime error from a machine-health problem:
82
+
83
+
-**NVRM**: NVIDIA kernel driver messages.
84
+
-**Xid**: NVIDIA GPU fault, reset, or error codes.
85
+
-**PCIe**: the bus/link between the GPU, motherboard, and CPU.
86
+
-**AER**: PCIe Advanced Error Reporting messages. Repeated AER messages can point to risers, slots, power, BIOS lane settings, cabling, motherboard, or GPU hardware issues.
sudo dmesg -T | grep -Ei 'NVRM|Xid|AER|PCIe|fallen|GPU has fallen'
93
+
```
94
+
95
+
Treat repeated Xid, NVRM, PCIe, or AER errors as host hardware/driver health signals, not as isolated self-test messages.
34
96
35
97
<aid="red-error" />
36
-
## What is this red error message on my machine?
98
+
## Red Machine Error
37
99
38
-
If the hosting software detects an error, that error message is listed on your machine in the machines page. Once the cause of the error has been resolved, most error messages are automatically cleared after 1-2 hours. The quickest way to learn more about resolving specific error messages is the hosting channels in [our Discord](https://discord.gg/hSuEbSQ4X8).
100
+
The Machines page shows red errors detected by the host software. Fix the cause, then allow the platform to refresh. Many resolved errors clear automatically after 1-2 hours.
39
101
40
102
<aid="docker-cache" />
41
-
## My storage for clients is full of old stopped jobs — can I free up space?
103
+
## Full Client Storage
42
104
43
-
Try cleaning up the Docker build cache, as it sometimes frees up far more space than it claims. You can also clean up old unused images. For expired or deleted rental contracts that did not release their storage, run [vastai cleanup machine](/host/cli/cleanup-machine).
105
+
Try cleaning Docker build cache and old unused images. For expired or deleted rentals that did not release storage, run [vastai cleanup machine](/host/cli/cleanup-machine).
44
106
45
107
<aid="nvidia-smi-fails" />
46
-
## What should I do when nvidia-smi fails?
108
+
## `nvidia-smi` Fails
47
109
48
-
Treat `nvidia-smi` failure as a host GPU or driver health problem. Check whether the NVIDIA driver is loaded, whether the GPU is visible on the PCIe bus, whether a reboot is needed after a driver or kernel update, whether there is an NVML driver/library mismatch, and whether system logs show NVRM, Xid, or PCIe errors. Do not keep a machine listed if it cannot reliably report GPU state.
110
+
Treat this as a GPU or driver health problem. Check driver load, PCIe visibility, reboot state after updates, NVML mismatch, and NVRM/Xid/PCIe logs. Do not list a machine that cannot reliably report GPU state.
49
111
50
112
<aid="nvml-mismatch" />
51
-
## What should I do for NVML driver/library mismatch?
113
+
## NVML Driver/Library Mismatch
52
114
53
-
This usually means the NVIDIA userspace library and loaded kernel module are from different driver versions, often after an update. Plan downtime and reboot first. If the mismatch persists, clean up or reinstall the NVIDIA driver stack, then verify `nvidia-smi` before listing again.
115
+
This usually means the NVIDIA userspace library and loaded kernel module are different versions. Plan downtime, reboot first, then reinstall or repair the NVIDIA driver stack if needed.
54
116
55
117
<aid="gpu-falls-off-bus" />
56
-
## What should I do when a GPU falls off the PCIe bus?
118
+
## GPU Falls Off The PCIe Bus
57
119
58
-
Treat this as a hardware, power, thermal, PCIe, or driver stability issue. Check logs for NVRM, Xid, and PCIe/AER errors. Power-cycle the machine, then inspect PSU capacity, power cables, risers, slots, PCIe lane settings, thermals, and any overclock or undervolt. Repeated bus drops should be fixed before the machine is listed again.
120
+
Check NVRM, Xid, and PCIe/AER logs. Power-cycle, then inspect PSU capacity, cables, risers, slots, PCIe lane settings, thermals, and any overclock/undervolt. Fix repeated bus drops before listing again.
59
121
60
122
<aid="bad-bandwidthtest2" />
61
-
## What should I do for bad bandwidthtest2?
123
+
## `bad bandwidthtest2`
62
124
63
-
`bad bandwidthtest2` is best treated as a machine health or verification error. It usually points at a GPU, PCIe, driver/runtime, or data-transfer problem. Check PCIe bandwidth and lane configuration, GPU seating, thermals, power, driver and NVML health, Xid errors, Docker GPU runtime behavior, and self-test diagnostic output. Rerun self-test after fixing the hardware or driver issue.
125
+
Treat this as a machine health or verification error. Check PCIe bandwidth, lane configuration, GPU seating, thermals, power, driver/NVML health, Xid errors, Docker GPU runtime behavior, and self-test output. Rerun self-test after fixes.
64
126
65
127
<aid="failed-cdi" />
66
-
## What should I do for failed to inject CDI devices?
128
+
## Failed To Inject CDI Devices
67
129
68
-
If logs or machine status mention failed CDI device injection, treat it as a container GPU device injection or runtime problem. Current self-test output may surface the same underlying issue as a startup or runtime failure instead of using the literal CDI text. Gather a diagnostic bundle, check Docker and NVIDIA container runtime/CDI configuration, confirm GPUs are visible on the host, inspect daemon and container logs, and check for related machine errors. If the configuration looks correct but containers still cannot receive GPU devices, escalate with logs and the machine details requested by support.
130
+
This points to GPU device injection or container runtime setup. Gather diagnostics, confirm GPUs are visible on the host, check Docker/NVIDIA runtime/CDI configuration, and inspect daemon/container logs. Do not assume regenerating CDI configuration fixes an unhealthy GPU, driver, or PCIe link. Escalate with logs if containers still cannot receive GPUs.
69
131
70
132
<aid="collect-logs" />
71
-
## What logs should I collect before asking for help?
133
+
## Before Asking For Help
134
+
135
+
Include:
72
136
73
-
Include the account context, exact command, timestamps, screenshots or CLI output, tested external IP:port for network failures, and the self-test support bundle when available. If support asks for machine-specific details, share them in the support channel rather than a public community thread.
137
+
- Account context and exact command.
138
+
- Machine ID, offer ID if relevant, and whether the account is the host account.
139
+
- Exact timestamps with timezone.
140
+
- Exact error strings, screenshots, or CLI output.
141
+
- What changed recently: reboot, driver, kernel, Docker, BIOS, storage, router, or ISP.
142
+
- Host OS, GPU model/count, NVIDIA driver version, and Docker storage mount output.
143
+
- Tested external IP and port for networking issues.
144
+
- Self-test support bundle when available.
145
+
- Installer log, daemon log, and relevant kernel log excerpts.
74
146
75
-
Review bundles before sharing them.
147
+
Review bundles before sharing them. Share sensitive machine/account details only in the appropriate support channel.
76
148
77
149
<aid="escalate-support" />
78
-
## What should I escalate to Vast support?
150
+
## Escalate To Vast Support
79
151
80
-
Escalate account conversion or hosting agreement issues, payment/tax/payout provider issues that the docs cannot resolve, backend machine records or ghost machines that normal UI/CLI actions cannot remove, API permission or account-visibility problems, suspected backend bugs, and cases where self-test or diagnostics show a platformstate mismatch after following documented steps.
152
+
Escalate account conversion, hosting agreement, payout, API permission, backend machine record, ghost machine, suspected backend bug, or platform-state mismatch issues.
81
153
82
-
Local Ubuntu, Docker, GPU driver, hardware, power, thermals, and consumer network setup are primarily host responsibilities. Bring logs and diagnostics when asking for help.
154
+
Local Ubuntu, Docker, GPU driver, hardware, power, thermals, and consumer networking are primarily host responsibilities. Vast hosting requires direct public inbound TCP/UDP reachability; CGNAT or double NAT without a real public forwarding path is not a supported hosting network setup.
0 commit comments