Skip to content

Latest commit

 

History

History
229 lines (171 loc) · 8.36 KB

File metadata and controls

229 lines (171 loc) · 8.36 KB

Ray Architecture in TorchSpec

TorchSpec uses Ray as its distributed orchestration layer. All GPU workloads (inference, training, mooncake master) run as Ray actors, coordinated by a central controller.

Package Layout & Actor Hierarchy

RayActor is the base class for all GPU-bound actors. It provides GPU setup, IP discovery, and port allocation so each actor doesn't reinvent them.

torchspec/ray/
├── ray_actor.py                    RayActor base class
├── train_group.py                  RayTrainGroup (training actor group manager)
└── placement_group.py              Placement group creation & GPU resource management

torchspec/inference/engine/
├── hf_engine.py                    HFEngine(InferenceEngine, RayActor)
└── sgl_engine.py                   SglEngine(InferenceEngine, RayActor)

torchspec/training/
├── trainer.py                      Trainer (ABC base)
├── trainer_actor.py                TrainerActor(RayActor) — wraps Eagle3Trainer
└── eagle3_trainer.py               Eagle3Trainer(Trainer) — FSDP2 training logic

torchspec/transfer/mooncake/
└── utils.py                        MooncakeMaster(RayActor)

torchspec/controller/
├── training_controller.py          AsyncTrainingController (standalone Ray actor)
└── inference_manager.py            AsyncInferenceManager (standalone Ray actor)

Placement Groups

Placement groups reserve GPUs for training and inference as a unit and place them on the correct nodes. create_placement_groups(args) is the single entry point.

Mode Training GPUs Inference GPUs Use case
Default Sliced from unified PG Sliced from unified PG Production: deterministic node-to-role assignment
custom Sliced from custom unified PG Sliced from custom unified PG Production: explicit node choice with the same unified reservation semantics
colocate Shared PG Shared PG Dev: share GPUs between train & inference
debug_train_only Dedicated PG Empty Debug training without inference
debug_inference_only Empty Dedicated PG Debug inference without training

Each placement group probes bundles with a temporary InfoActor to discover the actual (node IP, GPU ID) mapping, then sorts by (node, GPU ID) for deterministic ordering. In custom mode, TorchSpec sorts by the configured node order first and by physical GPU ID within each selected node.

Ray Cluster Setup

TorchSpec connects to Ray via _ensure_ray_initialized() in placement_group.py. It reads the RAY_ADDRESS environment variable (defaulting to "auto") and calls ray.init(address=...). If no cluster is found, it falls back to starting a local instance.

Single-node (local)

No setup needed — TorchSpec starts a local Ray instance automatically. To pin specific GPUs on a shared machine, use CUDA_VISIBLE_DEVICES:

CUDA_VISIBLE_DEVICES=4,5,6,7 ./examples/qwen3-8b-single-node/run.sh

Note: On shared machines, auto may attach to another user's cluster. Use RAY_ADDRESS=local to force a fresh local instance:

RAY_ADDRESS=local CUDA_VISIBLE_DEVICES=4,5,6,7 ./examples/qwen3-8b-single-node/run.sh

Multi-node (local cluster)

Start Ray manually before launching TorchSpec. Run these commands on each node:

1. Head node (run first):

ray start --head \
  --port 6379 \
  --node-ip-address <HEAD_IP> \
  --num-gpus <N> \
  --temp-dir /tmp/ray_$(id -u) \
  --disable-usage-stats

2. Worker nodes (run after head is up):

ray start \
  --address <HEAD_IP>:6379 \
  --num-gpus <N> \
  --temp-dir /tmp/ray_$(id -u) \
  --disable-usage-stats

3. Run TorchSpec on the head node:

./examples/kimi-k25-3node-h100/run.sh

TorchSpec auto-detects the cluster via address="auto". Worker nodes don't need to be up before the script starts — _wait_for_gpu_resources() will block for up to 300 seconds until all expected GPUs are visible in the cluster.

See examples/kimi-k25-3node-h100/setup_ray_cluster.sh for a concrete 3-node example.

Kubernetes

On Kubernetes, we recommend using the KubeRay operator to manage the Ray cluster lifecycle. KubeRay handles head/worker pod scheduling, autoscaling, and fault recovery. Once the RayCluster resource is running, point TorchSpec at it with:

RAY_ADDRESS=ray://<kuberay-head-svc>:10001 ./examples/kimi-k25-3node-h100/run.sh

NCCL / Gloo networking

On multi-NIC machines, set the network interface explicitly:

export NCCL_SOCKET_IFNAME=<iface>   # e.g. eth0
export GLOO_SOCKET_IFNAME=<iface>
export TP_SOCKET_IFNAME=<iface>

Find your interface with ip -o addr show | grep <your_node_ip>.

Multi-Node Training & Inference Config

Training across nodes

RayTrainGroup creates training_num_nodes × training_num_gpus_per_node actors. The PACK placement strategy spreads them across nodes automatically.

Key Default Description
training.training_num_nodes 1 Number of training nodes
training.training_num_gpus_per_node 1 GPUs per training node

Custom node placement

By default, TorchSpec creates a unified placement group with Ray's PACK strategy, probes the resulting bundles, and assigns the ordered bundles to training or inference according to training.placement_strategy (training_first or inference_first). Set training.placement_strategy: custom to explicitly choose the nodes for each role while still reserving the non-colocated training and inference bundles in a single unified placement group.

IP-based placement uses Ray's per-node resource labels (node:<ip>) and does not require custom Ray labels:

training:
  placement_strategy: custom
  training_num_nodes: 2
  training_num_gpus_per_node: 8
  training_node_ips:
    - 10.0.0.1
    - 10.0.0.3

inference:
  inference_num_gpus: 16
  inference_num_gpus_per_node: 8
  inference_node_ips:
    - 10.0.0.2
    - 10.0.0.4

Ray label selectors are also supported when the installed Ray version supports placement group bundle_label_selector. Start Ray nodes with labels, then use matching selectors in the config:

training:
  placement_strategy: custom
  training_num_nodes: 2
  training_num_gpus_per_node: 8
  training_node_selectors:
    - {"torchspec/node": "trainer-0"}
    - {"torchspec/node": "trainer-1"}

inference:
  inference_node_selectors:
    - {"torchspec/node": "infer-0"}
    - {"torchspec/node": "infer-1"}

The configured node order is preserved. For multi-node inference, this order determines the order of inference engine actors and therefore the node_rank passed to SGLang or vLLM. Within each selected node, bundles are ordered by the actual GPU ID discovered by InfoActor.

The number of configured training nodes must equal training.training_num_nodes. The number of configured inference nodes must match ceil(inference.inference_num_gpus / inference.inference_num_gpus_per_node). For each role, set only one of *_node_ips or *_node_selectors.

Inference across nodes (SglEngine multi-node TP)

When a single model is too large for one node, SglEngine supports multi-node tensor parallelism via inference.sglang.nnodes.

Example: 16-GPU TP across 2 nodes, 8 GPUs each

  inference.inference_num_gpus=16, inference.sglang.nnodes=2, inference.inference_num_gpus_per_node=8

  Factory creates 2 SglEngine actors (one per node):
    engine 0: node_rank=0 (head)   — accepts generate() calls
    engine 1: node_rank=1 (worker) — participates in NCCL TP only
Key Default Description
inference.sglang.nnodes 1 Nodes per inference replica
inference.inference_num_gpus 1 Total inference GPUs across all nodes
inference.inference_num_gpus_per_node 8 GPUs per inference node
inference.sglang.dist_init_addr auto Override dist init address (auto-negotiated if unset)
inference.sglang.dist_timeout 60 Dist init timeout in seconds

Example: 3-node layout

Node 0 (head):   Ray head + 4 training GPUs
Node 1 (worker): 8 inference GPUs (TP node_rank=0, head)
Node 2 (worker): 8 inference GPUs (TP node_rank=1, worker)

training.training_num_nodes=1, training.training_num_gpus_per_node=4
inference.inference_num_gpus=16, inference.sglang.nnodes=2, inference.inference_num_gpus_per_node=8