ROCm
diff --git a/‎docs/getting-started/installation.md‎
Lines changed: 4 additions & 1 deletion b/‎docs/getting-started/installation.md‎
Lines changed: 4 additions & 1 deletion
diff --git a/‎docs/getting-started/slurm.md‎
Lines changed: 290 additions & 0 deletions b/‎docs/getting-started/slurm.md‎
Lines changed: 290 additions & 0 deletions
diff --git a/‎docs/index.md‎
Lines changed: 3 additions & 2 deletions b/‎docs/index.md‎
Lines changed: 3 additions & 2 deletions
@@ -75,6 +75,8 @@ If you prefer to build and run Docker containers manually:
 pip install -e .
 ```
 
+For scheduler-managed clusters, see the [SLURM guide](slurm.md).
+
 
 ### 4. Apptainer/Singularity
 
@@ -97,4 +99,5 @@ pip install -e .
 Once you have Iris running with any of these methods:
 
 - Explore the [Examples](../reference/examples.md) directory
-- Learn about the [Programming Model](../conceptual/programming-model.md)
+- Learn about the [Programming Model](../conceptual/programming-model.md)
+- For batch-scheduled environments, see [Running Iris on SLURM](slurm.md)
@@ -0,0 +1,290 @@
+# Running Iris on SLURM
+
+This guide covers a practical Iris workflow on SLURM-managed GPU clusters. It is written to stay generic across clusters while matching the provided Iris scripts and working well on clusters where:
+
+- GPU nodes are scheduled with SLURM
+- Docker is available on compute nodes, but not necessarily on login nodes
+- fast local storage such as `/scratch` is preferred for builds and test output
+
+## What the provided SLURM script assumes
+
+The repository includes `scripts/run_core_tests_slurm.sh`, a batch wrapper for running `scripts/run_core_tests.sh`.
+
+It **assumes the container image already exists**. It does **not** build `iris-dev` for you.
+
+By default, the script:
+
+- requests 1 node with 4 GPUs
+- expects a Docker image named `iris-dev`
+- stages the repository into node-local storage when available
+- installs Iris in editable mode inside the container
+- runs `scripts/run_core_tests.sh`
+- copies the per-test logs back to `$HOME/slurm-logs/iris-core-tests-<jobid>/`
+
+If the image is missing, the job fails fast with an explicit error.
+
+## Fresh-clone workflow
+
+### 1. Clone the repository on shared storage
+
+Clone Iris somewhere visible from both the login node and the compute nodes.
+
+```bash
+git clone https://github.com/ROCm/iris.git
+cd iris
+```
+
+If your cluster provides both shared storage and node-local scratch, keep the source tree on shared storage and let jobs copy into scratch for execution.
+
+### 2. Request an interactive GPU allocation
+
+If Docker is only available on worker nodes, first allocate a node and enter it.
+
+```bash
+salloc --nodes=1 --gres=gpu:4 --time=02:00:00
+srun --pty $SHELL
+```
+
+Adjust GPUs, walltime, partition, account, memory, and CPU count to match your site policy.
+
+### 3. Build the Iris Docker image on the allocated node
+
+```bash
+cd /path/to/iris
+./docker/build.sh
+```
+
+This builds the default image name, `iris-dev`.
+
+If you want a custom image name:
+
+```bash
+./docker/build.sh my-iris-image
+```
+
+You can verify that the image exists with:
+
+```bash
+docker image inspect iris-dev
+```
+
+### 4. Submit the batch job
+
+From the repository root:
+
+```bash
+sbatch scripts/run_core_tests_slurm.sh
+```
+
+If you built a custom image:
+
+```bash
+sbatch --export=ALL,IMAGE_NAME=my-iris-image scripts/run_core_tests_slurm.sh
+```
+
+## Important note about node-local images
+
+Some clusters store Docker images per node rather than in a shared registry-backed cache. In that setup, building `iris-dev` on one node does not guarantee that another node can see it.
+
+If your cluster behaves this way, either:
+
+1. build and submit on the same node, or
+2. pin the batch job to the node where the image was built, or
+3. rebuild the image on the target node
+
+For example, after building the image on a worker node:
+
+```bash
+NODE_NAME=$(hostname)
+sbatch -w "$NODE_NAME" scripts/run_core_tests_slurm.sh
+```
+
+If your cluster has shared container storage, you can usually omit `-w`.
+
+## Monitoring the job
+
+Use normal SLURM tools:
+
+```bash
+squeue -j <jobid>
+sacct -j <jobid>
+```
+
+By default, the batch script writes SLURM stdout/stderr to:
+
+```bash
+iris_core_tests_<jobid>.out
+```
+
+in the directory where `sbatch` was invoked.
+
+The per-test logs are copied to:
+
+```bash
+$HOME/slurm-logs/iris-core-tests-<jobid>/
+```
+
+## Running interactively inside the container
+
+For development on an allocated node, you can also start the container manually:
+
+```bash
+./docker/run.sh iris-dev "$(pwd)"
+```
+
+Then install Iris in editable mode:
+
+```bash
+pip install -e ".[dev]"
+```
+
+This is useful when you want to debug failures before switching back to `sbatch`.
+
+## Running example programs under SLURM
+
+Many examples under `examples/` can be run directly with `python ... --num_ranks <N>` after Iris is installed in the container.
+
+The repository includes a generic example wrapper:
+
+```bash
+scripts/run_example_slurm.sh
+```
+
+It stages the repository into node-local storage, installs Iris in the container, runs a chosen example script, and copies any `logs/` or `results/` directories back to:
+
+```bash
+$HOME/slurm-logs/iris-example-<jobid>/
+```
+
+### Generic usage
+
+Submit any repo-relative example script and pass the example arguments after it:
+
+```bash
+sbatch scripts/run_example_slurm.sh <example_script> [example args...]
+```
+
+For example:
+
+```bash
+sbatch scripts/run_example_slurm.sh examples/00_load/load_bench.py --num_ranks 4
+sbatch scripts/run_example_slurm.sh examples/13_flash_decode/example_run.py --num_ranks 4
+```
+
+### Example: `examples/14_all_gather_gemm`
+
+This example directory provides both a pull-model and push-model entrypoint.
+
+Pull model:
+
+```bash
+sbatch scripts/run_example_slurm.sh \
+    examples/14_all_gather_gemm/example_run_pull.py \
+    --num_ranks 4
+```
+
+Push model:
+
+```bash
+sbatch scripts/run_example_slurm.sh \
+    examples/14_all_gather_gemm/example_run_push.py \
+    --num_ranks 4
+```
+
+If your image is node-local, build on a worker node first and optionally pin the submission to that node:
+
+```bash
+NODE_NAME=$(hostname)
+sbatch -w "$NODE_NAME" scripts/run_example_slurm.sh \
+    examples/14_all_gather_gemm/example_run_pull.py \
+    --num_ranks 4
+```
+
+Use a rank count that matches the GPUs allocated to the job.
+
+### Custom image or install method
+
+```bash
+sbatch --export=ALL,IMAGE_NAME=my-iris-image scripts/run_example_slurm.sh \
+    examples/14_all_gather_gemm/example_run_pull.py \
+    --num_ranks 4
+```
+
+```bash
+sbatch --export=ALL,INSTALL_METHOD=install scripts/run_example_slurm.sh \
+    examples/14_all_gather_gemm/example_run_pull.py \
+    --num_ranks 4
+```
+
+## Customizing the provided batch wrapper
+
+The provided script is intentionally conservative and is meant for a 4-GPU core-test workflow.
+
+Common customizations:
+
+### Use a different image name
+
+```bash
+sbatch --export=ALL,IMAGE_NAME=my-iris-image scripts/run_core_tests_slurm.sh
+```
+
+### Store copied logs elsewhere
+
+```bash
+sbatch --export=ALL,PERSIST_LOG_ROOT=$HOME/my-iris-logs scripts/run_core_tests_slurm.sh
+```
+
+### Use a different scratch location
+
+If your cluster does not use `/scratch`, point the job at another fast workspace:
+
+```bash
+sbatch --export=ALL,WORK_ROOT=/path/to/local/workdir scripts/run_core_tests_slurm.sh
+```
+
+### Change SLURM resources
+
+Either edit the `#SBATCH` lines in `scripts/run_core_tests_slurm.sh`, or override them at submission time:
+
+```bash
+sbatch --gres=gpu:4 --time=04:00:00 --cpus-per-task=32 scripts/run_core_tests_slurm.sh
+```
+
+The current wrapper is designed around 4 GPUs. Since `scripts/run_core_tests.sh` includes 1, 2, 4, and 8-rank configurations, the wrapper automatically skips 8-rank cases when only 4 GPUs are visible.
+
+## Troubleshooting
+
+### `Docker image iris-dev not found`
+
+Build the image first:
+
+```bash
+./docker/build.sh
+```
+
+If the image was built on another worker node, submit to that same node or rebuild locally.
+
+### `docker` is not available on the login node
+
+Request an interactive allocation and build from inside the worker node:
+
+```bash
+salloc --nodes=1 --gres=gpu:4 --time=02:00:00
+srun --pty $SHELL
+./docker/build.sh
+```
+
+### The job should run from fast local storage
+
+The provided wrapper already stages the repository into node-local storage when possible. If your cluster uses a different path than `/scratch`, set `WORK_ROOT` when submitting.
+
+### I need an Apptainer-based workflow instead
+
+Iris also includes Apptainer support:
+
+```bash
+./apptainer/build.sh
+./apptainer/run.sh
+```
+
+The provided `scripts/run_core_tests_slurm.sh` wrapper is Docker-based, so use the Apptainer scripts directly or create a cluster-specific batch wrapper around them.
@@ -195,12 +195,13 @@ if __name__ == "__main__":
 
 For more examples, see the [Examples](reference/examples.md) page with ready-to-run scripts and usage patterns.
 
-For other setup methods, see the [Installation Guide](getting-started/installation.md).
+For other setup methods, see the [Installation Guide](getting-started/installation.md). For scheduler-managed clusters, see [Running Iris on SLURM](getting-started/slurm.md).
 
 ## Documentation Structure
 
 ### 📚 **Getting Started**
   - **[Installation](getting-started/installation.md)**: Set up Iris on your system
+  - **[SLURM](getting-started/slurm.md)**: Build and run Iris on scheduler-managed GPU clusters
   - **[Examples](reference/examples.md)**: Working code examples
   - **[Contributing](CONTRIBUTING.md)**: How to contribute
 
@@ -243,4 +244,4 @@ Want to contribute to Iris? Check out the [Contributing Guide](CONTRIBUTING.md)
 
 ---
 
-**Ready to start your multi-GPU journey? Begin with the [Installation Guide](getting-started/installation.md)!**
+**Ready to start your multi-GPU journey? Begin with the [Installation Guide](getting-started/installation.md) or the [SLURM guide](getting-started/slurm.md)!**