Commit 964dc1d
authored
docs: add distributed XGBoost on Kubernetes tutorial (#12080)
Add comprehensive documentation for distributed XGBoost training on
Kubernetes via Kubeflow Trainer. The tutorial covers:
- Overview of the distributed architecture and Collective protocol
- Environment variables (DMLC_*) injected by the XGBoost runtime plugin
- Worker count calculation for CPU and GPU training
- Prerequisites and installation verification
- ClusterTrainingRuntime configuration
- Example training code using Python SDK and kubectl YAML
- Best practices: QuantileDMatrix, early stopping, checkpointing,
logging, data partitioning, and rank-specific logic
- Common issues and edge cases
Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>1 parent 03c196f commit 964dc1d
1 file changed
Lines changed: 891 additions & 16 deletions
0 commit comments