Skip to content
This repository was archived by the owner on Dec 20, 2025. It is now read-only.

Latest commit

 

History

History
94 lines (72 loc) · 2.82 KB

File metadata and controls

94 lines (72 loc) · 2.82 KB

HPC + Slurm vs Kubernetes + Ray

Comparison of two primary approaches for distributed ML workflows.

Side-by-Side Workflows

HPC + Slurm Workflow

Scenario: Large LLM pretraining on multi-node GPU cluster

# Step 1: Submit Job
$ sbatch train_llm.slurm
# Requests N nodes, M GPUs per node, time limit

# Step 2: Job Allocation
# Slurm queues → allocates nodes exclusively
# Wait time depends on cluster load

# Step 3: Launch Training
$ srun python train.py
# Uses MPI/NCCL for multi-node communication
# All nodes must start together

# Step 4: Monitor
# Slurm provides job stats (GPU, memory, time)
# Logs written to shared storage

# Step 5: Checkpoint & Resume
# Manual checkpoints
# Job resubmission if preempted/fails

Characteristics:

  • Throughput: Maximum
  • Flexibility: Low (queue times, static resources)
  • Fault tolerance: Minimal
  • 🎯 Best for: Long-running, tightly-coupled jobs

2️⃣ Kubernetes + Ray Workflow

Scenario: LLM fine-tuning or multi-task experiments on cloud GPUs

# Step 1: Launch Cluster
# K8s provisions GPU nodes dynamically
# Ray head + workers started as pods
# Auto-scaling enabled

# Step 2: Submit Training Job
from ray.train.torch import TorchTrainer
trainer = TorchTrainer(...)
result = trainer.fit()

# Step 3: Dynamic Resource Management
# Nodes/pods added or removed based on workload
# Multiple jobs run concurrently
# Hyperparameter tuning jobs start/stop independently

# Step 4: Monitor
# Ray dashboard: http://localhost:8265
# K8s dashboard, Prometheus metrics

# Step 5: Checkpoint & Resume
# Built-in checkpointing
# Elastic rescheduling on node failures

Characteristics:

  • Throughput: Slightly lower (network overhead)
  • Flexibility: High (elastic scaling, many workloads)
  • Fault tolerance: Built-in
  • 🎯 Best for: Agile experimentation, production ML pipelines

Feature Comparison

Feature HPC + Slurm Kubernetes + Ray
Resource allocation Fixed, queued Dynamic, elastic
Job start All nodes together Tasks scheduled as pods
Fault tolerance Minimal Automatic recovery
Scaling Hard, manual Easy, auto-scaling
Workload type Single large job Many simultaneous jobs
Monitoring Slurm logs Dashboards, metrics
Flexibility Low High
Peak throughput Max Slightly lower
Network InfiniBand Ethernet (+ RDMA in cloud)

Resources

TL;DR: HPC = max performance for long jobs; K8s+Ray = flexible multi-workload environment.