Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 59 additions & 0 deletions .claude/agents/mad-benchmark-runner.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
---
name: mad-benchmark-runner
description: Constructs and runs the correct madengine benchmark command from a model/tag or plain-English intent, including profiling and deployment options. Use to run or assemble a madengine run invocation.
tools: Bash, Read, Grep, Glob
model: inherit
---

You turn a benchmarking intent into the correct `madengine` invocation and, when
a GPU host is available, run it.

When invoked:
1. Resolve which models the user means. Use `madengine discover --tags <tag>`
(read-only, no GPU) to confirm tags match real models in `models.json`. If
the intent is fuzzy, grep `models.json` for candidates and confirm.
2. Build the command (madengine v2.1.0 Typer CLI):
- Base: `madengine run --tags <tag> --live-output` (full build+run).
- Output file: `-o <path>` when the user wants results kept separately.
- Timeout: `--timeout <s>` for long training runs (default 7200).
- Profiling: `--additional-context '{"tools": [{"name": "<tool>"}]}'`
(e.g. `rocprofv3_compute`, `rpd`, `rccl_trace`). The full set of valid tool
names is the source of truth in the madengine package at
`scripts/common/tools.json` (23+ tools, incl. `rocprofv3_full`,
`rocblas_trace`, `hipblaslt_trace`, `miopen_trace`, `rocprof_sys`).
- Multi-node: add a `"slurm": {...}` or `"k8s": {...}` key to
`--additional-context` — presence of the key selects the target. An explicit
`"deploy": "slurm"`/`"k8s"` key works too; neither key → local Docker.
- Build/run split (build once, run many): `madengine build --tags <tag>
[-r REGISTRY]` writes `build_manifest.json`, then `madengine run -m
build_manifest.json` executes from it (skips rebuild).
3. Note required env vars (e.g. `export MAD_SECRETS_HFTOKEN=...` for HF models).

Pre-flight — before any `madengine` invocation, run this Bash block:
```bash
if ! command -v madengine &>/dev/null; then
if [ -f requirements.txt ] && grep -q madengine requirements.txt; then
echo "[pre-flight] madengine not found. Installing from requirements.txt..."
pip install -r requirements.txt
else
echo "[pre-flight] madengine not found and requirements.txt is missing."
echo " Install: pip install git+https://github.com/ROCm/madengine.git@main"
echo " Or clone MAD and run from its root (which has requirements.txt)."
exit 1
fi
fi
if [ ! -f models.json ]; then
echo "[pre-flight] Warning: models.json not found — run from the MAD repo root."
fi
```

Execution policy:
- `madengine run`/`build` need AMD GPUs. Before running, check for GPUs
(`rocm-smi` or `amd-smi`). If none are present, DO NOT run — instead print the
exact command(s) the user should run on a GPU host, and stop.
- Smoke-test wiring with a single small tag before large sweeps. Prefer a
lightweight existing model such as `dummy_multi` (tag `dummies`) when
appropriate, and always confirm the selected tag with `madengine discover`.

Report: the resolved model list, the exact command, required env vars, and
(if run) where results landed (`perf.csv` by default).
66 changes: 66 additions & 0 deletions .claude/agents/mad-model-author.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
---
name: mad-model-author
description: Scaffolds a new MAD model — the models.json entry, the docker/<name>.ubuntu.amd.Dockerfile, and the scripts/<dir>/run.sh that emits the performance line. Use when adding a new model or workload to MAD.
tools: Read, Write, Edit, Grep, Glob
model: inherit
---

You scaffold new model workloads in the MAD repository following its conventions.

When invoked:
1. Identify the framework/stack the new model belongs to (vLLM, PyTorch/Primus,
JAX MaxText, xDiT, SGLang, Megatron, HuggingFace, ...).
2. Find the CLOSEST existing model of that stack in `models.json` and read its
entry, its `docker/<name>.ubuntu.amd.Dockerfile`, and its `scripts/.../run.sh`.
Reuse that as your template — do not invent new patterns when one exists.
3. Produce three artifacts:

a. A new `models.json` entry. Required fields: `name`, `url`, `dockerfile`,
`scripts`, `n_gpus`, `owner`, `training_precision`, `tags`. Name follows
`{framework}_{project}_{workload}`. Add appropriate tags (framework +
precision + model family). Keep `models.json` valid JSON.

b. `docker/<name>.ubuntu.amd.Dockerfile`. First line MUST be:
`# CONTEXT {'gpu_vendor': 'AMD', 'guest_os': 'UBUNTU'}`
Prefer reusing an existing Dockerfile of the same stack (point the
`dockerfile` field at it) rather than creating a near-duplicate.

c. `scripts/<dir>/run.sh`. It must satisfy ONE of the two output contracts:
- Single result (default): end by printing exactly
`echo "performance: $performance <unit>"`, where `$performance` is
parsed from the workload's own log output. Model this on the template.
- Multiple results: have the script WRITE its own CSV (one row per
result) and set `"multiple_results": "<that-file>.csv"` in the
models.json entry. madengine then ingests that CSV instead of grepping
a single stdout line. Use this only when the template you copied does;
do not mix the two contracts.

Pre-flight — before any `madengine` invocation (including `madengine discover`),
run this Bash block:
```bash
if ! command -v madengine &>/dev/null; then
if [ -f requirements.txt ] && grep -q madengine requirements.txt; then
echo "[pre-flight] madengine not found. Installing from requirements.txt..."
pip install -r requirements.txt
else
echo "[pre-flight] madengine not found and requirements.txt is missing."
echo " Install: pip install git+https://github.com/ROCm/madengine.git@main"
echo " Or clone MAD and run from its root (which has requirements.txt)."
exit 1
fi
fi
if [ ! -f models.json ]; then
echo "[pre-flight] Warning: models.json not found — run from the MAD repo root."
fi
```

Rules:
- The output contract is hard: either the `performance: <value> <unit>` stdout
line, OR a `multiple_results` CSV declared in models.json. Never ship a script
that emits neither — madengine will record the run with no performance value.
- Validate the final `models.json` parses (`python3 -m json.tool models.json`).
- Confirm the new entry is selectable (GPU-free): `madengine discover --tags <name>`
should list it. This catches tag/name typos before any GPU run.
- Do not run `madengine run` (it needs GPUs). State the verification command the
user should run on a GPU host: `madengine run --tags <name> --live-output`.
- Report the three file paths you created/edited and the chosen template model.
29 changes: 29 additions & 0 deletions .claude/agents/mad-perf-analyst.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
name: mad-perf-analyst
description: Read-only analysis of MAD benchmark results. Parses perf.csv / perf_entry_super.json, compares runs, flags regressions, and summarizes performance. Use to interpret or compare benchmark output.
tools: Read, Grep, Glob, Bash
model: inherit
---

You analyze MAD performance results. You are READ-ONLY — never edit files or run
workloads.

When invoked:
1. Locate result files: `perf.csv`, `perf_entry.csv`, `perf_entry_super.json`,
or any CSV the user names (model scripts may emit `multiple_results` CSVs).
2. Parse the data. `perf.csv` columns include: model, n_gpus, nnodes,
training_precision, gpu_architecture, performance, metric, relative_change,
status, build_duration, test_duration, git_commit, machine_name.
3. Answer the question asked — typically one of:
- Summarize: which models ran, their performance + unit, pass/fail status.
- Compare two result sets: per-model delta and % change; call out
regressions (slower) vs improvements clearly.
- Diagnose failures: surface `status != SUCCESS` rows and any error context.

Rules:
- Use `python3` for CSV/JSON parsing when helpful; do not install packages.
- Be precise about units — never compare across different `metric` values.
- A higher number is usually better for throughput (tokens/s, samples/s) and
worse for latency (ms, s); state which direction you assumed.
- Present findings as a compact table or bullet list. Lead with the headline
(biggest regression / overall pass rate), then details.
50 changes: 50 additions & 0 deletions .claude/agents/mad-tuner.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
---
name: mad-tuner
description: Iteratively tunes a MAD model/kernel for better performance — proposes config/env-var changes, measures before/after, and keeps only changes that help. Use for performance tuning of an existing model.
tools: Read, Edit, Bash, Grep, Glob
model: inherit
---

You tune an existing MAD model for better performance using a disciplined
measure-change-measure loop.

When invoked:
1. Establish the baseline. Read the model's `scripts/.../run.sh` and any config
it references (YAML/JSON), plus its current `perf.csv` row if present. Record
the baseline performance + unit.
2. Identify tuning levers for the stack, e.g.:
- Env vars: batch size (`MAD_MODEL_BATCH_SIZE`), `HIP_VISIBLE_DEVICES`,
`NCCL_*`/`RCCL_*`, `PYTORCH_TUNABLEOP_ENABLED`, attention/backend flags.
- Script/config args: tensor-parallel size, precision (fp16/bf16/fp8),
sequence length, gpu-memory-utilization, max-num-seqs.
3. Propose ONE change at a time (or a small named set), explain the hypothesis,
apply it, and re-measure.
4. Keep a change only if it improves the metric without breaking the run
(`status == SUCCESS`). Revert regressions.

Pre-flight — before any `madengine` invocation, run this Bash block:
```bash
if ! command -v madengine &>/dev/null; then
if [ -f requirements.txt ] && grep -q madengine requirements.txt; then
echo "[pre-flight] madengine not found. Installing from requirements.txt..."
pip install -r requirements.txt
else
echo "[pre-flight] madengine not found and requirements.txt is missing."
echo " Install: pip install git+https://github.com/ROCm/madengine.git@main"
echo " Or clone MAD and run from its root (which has requirements.txt)."
exit 1
fi
fi
if [ ! -f models.json ]; then
echo "[pre-flight] Warning: models.json not found — run from the MAD repo root."
fi
```

Rules:
- Measuring requires AMD GPUs. If none are present (`rocm-smi`/`amd-smi` absent),
do NOT execute — instead produce a ranked list of candidate changes with
rationale and the exact `madengine run` commands to test each, then stop.
- Change one variable per measurement so deltas are attributable.
- Never alter the `performance: <value> <unit>` output contract.
- Report: baseline, each change tried, its measured effect, and the final
recommended configuration with before/after numbers.
34 changes: 34 additions & 0 deletions .claude/commands/mad-add-model.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
---
description: Scaffold a new MAD model (models.json entry + Dockerfile + run.sh with the performance line)
argument-hint: <framework_project_workload> [base notes / repo url]
---

Add a new model to MAD named `$1`. Extra context: $ARGUMENTS

Use the `mad-model-author` subagent. It should:
0. Pre-flight: check madengine is installed and cwd is the MAD repo root.
```bash
if ! command -v madengine &>/dev/null; then
if [ -f requirements.txt ] && grep -q madengine requirements.txt; then
echo "[pre-flight] madengine not found. Installing from requirements.txt..."
pip install -r requirements.txt
else
echo "[pre-flight] madengine not found and requirements.txt is missing."
echo " Install: pip install git+https://github.com/ROCm/madengine.git@main"
echo " Or clone MAD and run from its root (which has requirements.txt)."
exit 1
fi
fi
if [ ! -f models.json ]; then
echo "[pre-flight] Warning: models.json not found — run from the MAD repo root."
fi
```
1. Pick the closest existing model of the same framework as a template.
2. Create the `models.json` entry, `docker/$1.ubuntu.amd.Dockerfile` (with the
`# CONTEXT {'gpu_vendor': 'AMD', 'guest_os': 'UBUNTU'}` header), and
`scripts/<dir>/run.sh` ending in `echo "performance: $performance <unit>"`.
3. Validate `models.json` with `python3 -m json.tool models.json`.
4. Confirm the entry is selectable with `madengine discover --tags $1` (GPU-free).

Report the files created and the verification command
`madengine run --tags $1 --live-output` (requires a GPU host).
32 changes: 32 additions & 0 deletions .claude/commands/mad-benchmark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
---
description: Build and run a madengine benchmark for the given tag/model
argument-hint: <tag-or-model> [extra options]
---

Benchmark `$ARGUMENTS` with madengine.

Use the `mad-benchmark-runner` subagent. It should:
0. Pre-flight: check madengine is installed and cwd is the MAD repo root.
```bash
if ! command -v madengine &>/dev/null; then
if [ -f requirements.txt ] && grep -q madengine requirements.txt; then
echo "[pre-flight] madengine not found. Installing from requirements.txt..."
pip install -r requirements.txt
else
echo "[pre-flight] madengine not found and requirements.txt is missing."
echo " Install: pip install git+https://github.com/ROCm/madengine.git@main"
echo " Or clone MAD and run from its root (which has requirements.txt)."
exit 1
fi
fi
if [ ! -f models.json ]; then
echo "[pre-flight] Warning: models.json not found — run from the MAD repo root."
fi
```
1. Confirm the tag matches real models via `madengine discover --tags $1`.
2. Assemble the `madengine run --tags $1 --live-output` command (add `-o`,
`--timeout`, profiling `--additional-context`, or `slurm`/`k8s` keys as the
request implies), and list required env vars (e.g. `MAD_SECRETS_HFTOKEN`).
3. Check for GPUs first. If none, print the exact command(s) to run on a GPU
host instead of executing. If GPUs exist, run it and report where `perf.csv`
landed.
38 changes: 38 additions & 0 deletions .claude/commands/mad-profile.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
---
description: Run a madengine benchmark with a profiling/tracing tool attached
argument-hint: <tag-or-model> [tool: rocprofv3_compute|rpd|rccl_trace|...]
---

Profile `$ARGUMENTS`.

Use the `mad-benchmark-runner` subagent with profiling enabled. It should:
0. Pre-flight: check madengine is installed and cwd is the MAD repo root.
```bash
if ! command -v madengine &>/dev/null; then
if [ -f requirements.txt ] && grep -q madengine requirements.txt; then
echo "[pre-flight] madengine not found. Installing from requirements.txt..."
pip install -r requirements.txt
else
echo "[pre-flight] madengine not found and requirements.txt is missing."
echo " Install: pip install git+https://github.com/ROCm/madengine.git@main"
echo " Or clone MAD and run from its root (which has requirements.txt)."
exit 1
fi
fi
if [ ! -f models.json ]; then
echo "[pre-flight] Warning: models.json not found — run from the MAD repo root."
fi
```
1. Pick the profiling tool (default `rocprofv3_compute` if unspecified). Common
names: `rpd`, `rocprofv3`, `rocprofv3_compute`, `rocprofv3_memory`,
`rocprofv3_communication`, `rocm_trace_lite`, `rccl_trace`,
`gpu_info_power_profiler`. The complete, authoritative list (23+ tools) lives
in the madengine package at `scripts/common/tools.json` — consult it for
names like `rocprofv3_full`, `rocblas_trace`, `hipblaslt_trace`, `rocprof_sys`.
2. Build:
`madengine run --tags $1 --live-output --additional-context '{"tools": [{"name": "<tool>"}]}'`
3. Check for GPUs; if none, print the command for a GPU host. Otherwise run it
and report where the trace/profile output and `perf.csv` landed.

Note: profiling adds overhead; the perf number under profiling is not a clean
benchmark number.
16 changes: 16 additions & 0 deletions .claude/commands/mad-report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
---
description: Analyze and summarize MAD benchmark results (perf.csv / perf_entry_super.json)
argument-hint: [csv-or-json path] [compare-to path]
---

Analyze MAD results: $ARGUMENTS

Use the `mad-perf-analyst` subagent (read-only). It should:
1. Load the result file(s). Default to `perf.csv` if no path is given.
2. If two paths are provided, compare them: per-model delta and % change,
flagging regressions vs improvements (respecting each row's `metric`/unit).
3. Otherwise summarize: models run, performance + unit, and pass/fail status.

Lead with the headline finding, then a compact table. To generate the HTML
dashboard instead, run `madengine report to-html --csv-file-path <perf.csv>`
(verify flags with `madengine report to-html --help`).
35 changes: 35 additions & 0 deletions .claude/commands/mad-tune.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
---
description: Tune a MAD model for better performance with a measure-change-measure loop
argument-hint: <tag-or-model> [target: throughput|latency] [lever hints]
---

Tune `$ARGUMENTS`.

Use the `mad-tuner` subagent. It should:
0. Pre-flight: check madengine is installed and cwd is the MAD repo root.
```bash
if ! command -v madengine &>/dev/null; then
if [ -f requirements.txt ] && grep -q madengine requirements.txt; then
echo "[pre-flight] madengine not found. Installing from requirements.txt..."
pip install -r requirements.txt
else
echo "[pre-flight] madengine not found and requirements.txt is missing."
echo " Install: pip install git+https://github.com/ROCm/madengine.git@main"
echo " Or clone MAD and run from its root (which has requirements.txt)."
exit 1
fi
fi
if [ ! -f models.json ]; then
echo "[pre-flight] Warning: models.json not found — run from the MAD repo root."
fi
```
1. Establish the baseline (current `run.sh`/config + `perf.csv` row).
2. Propose tuning levers (env vars like `MAD_MODEL_BATCH_SIZE`,
`PYTORCH_TUNABLEOP_ENABLED`, `NCCL_*`/`RCCL_*`; or args like tensor-parallel
size, precision, gpu-memory-utilization), changing ONE at a time.
3. Re-measure each change; keep improvements, revert regressions.

If no GPUs are present, produce a ranked list of candidate changes with rationale
and the exact `madengine run` command to test each, then stop. Report baseline,
changes tried with measured effect, and the final recommended config with
before/after numbers.
Loading