Skip to content

Commit e466814

Browse files
neoblizzgithub-actions[bot]ryanswann-amdRyan Swannclaude
authored
Fused all-gather+GEMM HBM-buffer kernel for iris.ops (#346)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Ryan Swann <ryan.swann@amd.com> Co-authored-by: Ryan Swann <ryaswann@ctr2-alola-ctrl-01.amd.com> Co-authored-by: Ryan Swann <109695074+ryanswann-amd@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Ryan Swann <ryaswann@amd.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
1 parent 9459a5e commit e466814

33 files changed

Lines changed: 3235 additions & 154 deletions

.github/copilot-instructions.md

Lines changed: 25 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -12,61 +12,48 @@ Iris is a Triton-based framework for Remote Memory Access (RMA) operations on AM
1212

1313
**FOLLOW THESE INSTRUCTIONS EXACTLY. Reference these instructions first before using search or bash commands.**
1414

15-
## Prerequisites
15+
## GPU Environment
1616

17-
- **GPU**: AMD GPUs with ROCm compatibility (tested on MI300X, MI350X & MI355X)
18-
> **Note**: See below for instructions on development without AMD GPU access
19-
- **ROCm/HIP Toolkit**: Required for building C++/HIP components
20-
- **Docker/Apptainer**: Recommended for containerized development
17+
**You have AMD GPU access.** Your runner has 8x MI325X GPUs (gfx942) with ROCm installed. Do NOT skip GPU operations or say you lack GPU access. You can and should:
2118

22-
## Build
19+
- Run benchmarks and post results (charts, tables) in PR comments
20+
- Run tests with `torchrun --nproc_per_node=N` for multi-GPU tests
21+
- Use `rocm-smi` to verify GPU status
22+
- Run `python -c "import torch; print(torch.cuda.device_count())"` to confirm GPU count
2323

24-
### Docker Development Environment (Recommended)
24+
When asked to run a benchmark, **run it and post the output**. Do not say you cannot.
25+
26+
### Running multi-GPU tests and benchmarks
27+
28+
Multi-GPU tests require `torch.distributed` initialization before pytest:
2529
```bash
26-
# Build and start development container (takes 45-60 minutes - NEVER CANCEL)
27-
docker compose up --build -d
30+
# Single GPU
31+
pytest tests/unittests/ -v --tb=short
2832

29-
# Attach to running container
30-
docker attach iris-dev
33+
# Multi-GPU (N = number of GPUs)
34+
torchrun --nproc_per_node=N -m pytest tests/ -v --tb=short
3135

32-
# Install Iris in development mode
33-
cd iris && pip install -e ".[dev]"
36+
# Benchmarks use iris.bench framework
37+
torchrun --nproc_per_node=8 benchmark/ops/bench_<name>.py
3438
```
3539

36-
### Alternative Docker Setup
37-
```bash
38-
# Build Docker image manually
39-
./docker/build.sh <image-name> # Takes 45-60 minutes
40+
### iris.bench framework
4041

41-
# Run container
42-
./docker/run.sh <image-name>
42+
Benchmarks use the declarative `iris.bench` framework. See existing `benchmark/ops/bench_*.py` files for examples. Output includes latency, throughput, and bandwidth tables. When posting benchmark results in PR comments, format as markdown tables.
4343

44-
# Install Iris
45-
cd iris && pip install -e ".[dev]"
46-
```
44+
## Prerequisites
4745

48-
### Apptainer Setup
49-
```bash
50-
# Build and run Apptainer image
51-
./apptainer/build.sh
52-
./apptainer/run.sh
46+
- **GPU**: AMD GPUs with ROCm compatibility (tested on MI300X, MI325X, MI350X & MI355X)
47+
- **ROCm/HIP Toolkit**: Required for building C++/HIP components
48+
- **Docker/Apptainer**: Recommended for containerized development
5349

54-
# Install Iris
55-
pip install -e ".[dev]"
56-
```
50+
## Build
5751

58-
### Local Development (Not Recommended)
52+
iris is already installed in your environment via `pip install -e .` in the setup steps. You do not need to build or install anything. If you need to reinstall after modifying `setup.py` or C extensions:
5953
```bash
60-
# Requires ROCm/HIP toolkit installation
6154
pip install -e ".[dev]"
6255
```
6356

64-
### Development Without AMD GPU
65-
If you don't have access to AMD GPUs, you can still contribute to the project:
66-
- **Code Editing**: Start editing code directly in your local environment
67-
- **CI Testing**: The project has comprehensive CI pipelines that will test your changes automatically. You can check the CI logs if your changes fail to understand what went wrong.
68-
- **Local Validation**: Run linting and formatting locally: `ruff check . --fix && ruff format .`
69-
7057
## Run
7158

7259
### Testing

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,8 @@ omni*.pdf
2929
slurm*.out
3030

3131
*.egg-info
32+
*.backup
33+
*.with_chunked
3234

3335
examples/gemm/results/*
3436
asm/
@@ -66,3 +68,4 @@ docker/rocm-systems/
6668
# IntelliKit / Copilot agent artifacts
6769
.intellikit
6870
.github/agents/skills/
71+
docs/benchmark-results/*.png

benchmark/ops/all_gather_matmul/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)