Ray Architecture in TorchSpec

TorchSpec uses Ray as its distributed orchestration layer. All GPU workloads (inference, training, mooncake master) run as Ray actors, coordinated by a central controller.

Package Layout & Actor Hierarchy

RayActor is the base class for all GPU-bound actors. It provides GPU setup, IP discovery, and port allocation so each actor doesn't reinvent them.

torchspec/ray/
├── ray_actor.py                    RayActor base class
├── train_group.py                  RayTrainGroup (training actor group manager)
└── placement_group.py              Placement group creation & GPU resource management

torchspec/inference/engine/
├── hf_engine.py                    HFEngine(InferenceEngine, RayActor)
└── sgl_engine.py                   SglEngine(InferenceEngine, RayActor)

torchspec/training/
├── trainer.py                      Trainer (ABC base)
├── trainer_actor.py                TrainerActor(RayActor) — wraps Eagle3Trainer
└── eagle3_trainer.py               Eagle3Trainer(Trainer) — FSDP2 training logic

torchspec/transfer/mooncake/
└── utils.py                        MooncakeMaster(RayActor)

torchspec/controller/
├── training_controller.py          AsyncTrainingController (standalone Ray actor)
└── inference_manager.py            AsyncInferenceManager (standalone Ray actor)

Placement Groups

Placement groups reserve GPUs for training and inference as a unit and place them on the correct nodes. create_placement_groups(args) is the single entry point.

Mode	Training GPUs	Inference GPUs	Use case
Default	Sliced from unified PG	Sliced from unified PG	Production: deterministic node-to-role assignment
`custom`	Sliced from custom unified PG	Sliced from custom unified PG	Production: explicit node choice with the same unified reservation semantics
`colocate`	Shared PG	Shared PG	Dev: share GPUs between train & inference
`debug_train_only`	Dedicated PG	Empty	Debug training without inference
`debug_inference_only`	Empty	Dedicated PG	Debug inference without training

Each placement group probes bundles with a temporary InfoActor to discover the actual (node IP, GPU ID) mapping, then sorts by (node, GPU ID) for deterministic ordering. In custom mode, TorchSpec sorts by the configured node order first and by physical GPU ID within each selected node.

Ray Cluster Setup

TorchSpec connects to Ray via _ensure_ray_initialized() in placement_group.py. It reads the RAY_ADDRESS environment variable (defaulting to "auto") and calls ray.init(address=...). If no cluster is found, it falls back to starting a local instance.

Single-node (local)

No setup needed — TorchSpec starts a local Ray instance automatically. To pin specific GPUs on a shared machine, use CUDA_VISIBLE_DEVICES:

CUDA_VISIBLE_DEVICES=4,5,6,7 ./examples/qwen3-8b-single-node/run.sh

Note: On shared machines, auto may attach to another user's cluster. Use RAY_ADDRESS=local to force a fresh local instance:
RAY_ADDRESS=local CUDA_VISIBLE_DEVICES=4,5,6,7 ./examples/qwen3-8b-single-node/run.sh

Multi-node (local cluster)

Start Ray manually before launching TorchSpec. Run these commands on each node:

1. Head node (run first):

ray start --head \
  --port 6379 \
  --node-ip-address <HEAD_IP> \
  --num-gpus <N> \
  --temp-dir /tmp/ray_$(id -u) \
  --disable-usage-stats

2. Worker nodes (run after head is up):

ray start \
  --address <HEAD_IP>:6379 \
  --num-gpus <N> \
  --temp-dir /tmp/ray_$(id -u) \
  --disable-usage-stats

3. Run TorchSpec on the head node:

./examples/kimi-k25-3node-h100/run.sh

TorchSpec auto-detects the cluster via address="auto". Worker nodes don't need to be up before the script starts — _wait_for_gpu_resources() will block for up to 300 seconds until all expected GPUs are visible in the cluster.

See examples/kimi-k25-3node-h100/setup_ray_cluster.sh for a concrete 3-node example.

Kubernetes

On Kubernetes, we recommend using the KubeRay operator to manage the Ray cluster lifecycle. KubeRay handles head/worker pod scheduling, autoscaling, and fault recovery. Once the RayCluster resource is running, point TorchSpec at it with:

RAY_ADDRESS=ray://<kuberay-head-svc>:10001 ./examples/kimi-k25-3node-h100/run.sh

NCCL / Gloo networking

On multi-NIC machines, set the network interface explicitly:

export NCCL_SOCKET_IFNAME=<iface>   # e.g. eth0
export GLOO_SOCKET_IFNAME=<iface>
export TP_SOCKET_IFNAME=<iface>

Find your interface with ip -o addr show | grep <your_node_ip>.

Multi-Node Training & Inference Config

Training across nodes

RayTrainGroup creates training_num_nodes × training_num_gpus_per_node actors. The PACK placement strategy spreads them across nodes automatically.

Key	Default	Description
`training.training_num_nodes`	1	Number of training nodes
`training.training_num_gpus_per_node`	1	GPUs per training node

Custom node placement

By default, TorchSpec creates a unified placement group with Ray's PACK strategy, probes the resulting bundles, and assigns the ordered bundles to training or inference according to training.placement_strategy (training_first or inference_first). Set training.placement_strategy: custom to explicitly choose the nodes for each role while still reserving the non-colocated training and inference bundles in a single unified placement group.

IP-based placement uses Ray's per-node resource labels (node:<ip>) and does not require custom Ray labels:

training:
  placement_strategy: custom
  training_num_nodes: 2
  training_num_gpus_per_node: 8
  training_node_ips:
    - 10.0.0.1
    - 10.0.0.3

inference:
  inference_num_gpus: 16
  inference_num_gpus_per_node: 8
  inference_node_ips:
    - 10.0.0.2
    - 10.0.0.4

Ray label selectors are also supported when the installed Ray version supports placement group bundle_label_selector. Start Ray nodes with labels, then use matching selectors in the config:

training:
  placement_strategy: custom
  training_num_nodes: 2
  training_num_gpus_per_node: 8
  training_node_selectors:
    - {"torchspec/node": "trainer-0"}
    - {"torchspec/node": "trainer-1"}

inference:
  inference_node_selectors:
    - {"torchspec/node": "infer-0"}
    - {"torchspec/node": "infer-1"}

The configured node order is preserved. For multi-node inference, this order determines the order of inference engine actors and therefore the node_rank passed to SGLang or vLLM. Within each selected node, bundles are ordered by the actual GPU ID discovered by InfoActor.

The number of configured training nodes must equal training.training_num_nodes. The number of configured inference nodes must match ceil(inference.inference_num_gpus / inference.inference_num_gpus_per_node). For each role, set only one of *_node_ips or *_node_selectors.

Inference across nodes (SglEngine multi-node TP)

When a single model is too large for one node, SglEngine supports multi-node tensor parallelism via inference.sglang.nnodes.

Example: 16-GPU TP across 2 nodes, 8 GPUs each

  inference.inference_num_gpus=16, inference.sglang.nnodes=2, inference.inference_num_gpus_per_node=8

  Factory creates 2 SglEngine actors (one per node):
    engine 0: node_rank=0 (head)   — accepts generate() calls
    engine 1: node_rank=1 (worker) — participates in NCCL TP only

Key	Default	Description
`inference.sglang.nnodes`	1	Nodes per inference replica
`inference.inference_num_gpus`	1	Total inference GPUs across all nodes
`inference.inference_num_gpus_per_node`	8	GPUs per inference node
`inference.sglang.dist_init_addr`	auto	Override dist init address (auto-negotiated if unset)
`inference.sglang.dist_timeout`	60	Dist init timeout in seconds

Example: 3-node layout

Node 0 (head):   Ray head + 4 training GPUs
Node 1 (worker): 8 inference GPUs (TP node_rank=0, head)
Node 2 (worker): 8 inference GPUs (TP node_rank=1, worker)

training.training_num_nodes=1, training.training_num_gpus_per_node=4
inference.inference_num_gpus=16, inference.sglang.nnodes=2, inference.inference_num_gpus_per_node=8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ray Architecture in TorchSpec

Package Layout & Actor Hierarchy

Placement Groups

Ray Cluster Setup

Single-node (local)

Multi-node (local cluster)

Kubernetes

NCCL / Gloo networking

Multi-Node Training & Inference Config

Training across nodes

Custom node placement

Inference across nodes (SglEngine multi-node TP)

Example: 3-node layout

FilesExpand file tree

ray.md

Latest commit

History

ray.md

File metadata and controls

Ray Architecture in TorchSpec

Package Layout & Actor Hierarchy

Placement Groups

Ray Cluster Setup

Single-node (local)

Multi-node (local cluster)

Kubernetes

NCCL / Gloo networking

Multi-Node Training & Inference Config

Training across nodes

Custom node placement

Inference across nodes (SglEngine multi-node TP)

Example: 3-node layout