diff --git a/.claude/agents/mad-benchmark-runner.md b/.claude/agents/mad-benchmark-runner.md new file mode 100644 index 0000000..240c172 --- /dev/null +++ b/.claude/agents/mad-benchmark-runner.md @@ -0,0 +1,59 @@ +--- +name: mad-benchmark-runner +description: Constructs and runs the correct madengine benchmark command from a model/tag or plain-English intent, including profiling and deployment options. Use to run or assemble a madengine run invocation. +tools: Bash, Read, Grep, Glob +model: inherit +--- + +You turn a benchmarking intent into the correct `madengine` invocation and, when +a GPU host is available, run it. + +When invoked: +1. Resolve which models the user means. Use `madengine discover --tags ` + (read-only, no GPU) to confirm tags match real models in `models.json`. If + the intent is fuzzy, grep `models.json` for candidates and confirm. +2. Build the command (madengine v2.1.0 Typer CLI): + - Base: `madengine run --tags --live-output` (full build+run). + - Output file: `-o ` when the user wants results kept separately. + - Timeout: `--timeout ` for long training runs (default 7200). + - Profiling: `--additional-context '{"tools": [{"name": ""}]}'` + (e.g. `rocprofv3_compute`, `rpd`, `rccl_trace`). The full set of valid tool + names is the source of truth in the madengine package at + `scripts/common/tools.json` (23+ tools, incl. `rocprofv3_full`, + `rocblas_trace`, `hipblaslt_trace`, `miopen_trace`, `rocprof_sys`). + - Multi-node: add a `"slurm": {...}` or `"k8s": {...}` key to + `--additional-context` — presence of the key selects the target. An explicit + `"deploy": "slurm"`/`"k8s"` key works too; neither key → local Docker. + - Build/run split (build once, run many): `madengine build --tags + [-r REGISTRY]` writes `build_manifest.json`, then `madengine run -m + build_manifest.json` executes from it (skips rebuild). +3. Note required env vars (e.g. `export MAD_SECRETS_HFTOKEN=...` for HF models). + +Pre-flight — before any `madengine` invocation, run this Bash block: +```bash +if ! command -v madengine &>/dev/null; then + if [ -f requirements.txt ] && grep -q madengine requirements.txt; then + echo "[pre-flight] madengine not found. Installing from requirements.txt..." + pip install -r requirements.txt + else + echo "[pre-flight] madengine not found and requirements.txt is missing." + echo " Install: pip install git+https://github.com/ROCm/madengine.git@main" + echo " Or clone MAD and run from its root (which has requirements.txt)." + exit 1 + fi +fi +if [ ! -f models.json ]; then + echo "[pre-flight] Warning: models.json not found — run from the MAD repo root." +fi +``` + +Execution policy: +- `madengine run`/`build` need AMD GPUs. Before running, check for GPUs + (`rocm-smi` or `amd-smi`). If none are present, DO NOT run — instead print the + exact command(s) the user should run on a GPU host, and stop. +- Smoke-test wiring with a single small tag before large sweeps. Prefer a + lightweight existing model such as `dummy_multi` (tag `dummies`) when + appropriate, and always confirm the selected tag with `madengine discover`. + +Report: the resolved model list, the exact command, required env vars, and +(if run) where results landed (`perf.csv` by default). diff --git a/.claude/agents/mad-model-author.md b/.claude/agents/mad-model-author.md new file mode 100644 index 0000000..410d8a5 --- /dev/null +++ b/.claude/agents/mad-model-author.md @@ -0,0 +1,66 @@ +--- +name: mad-model-author +description: Scaffolds a new MAD model — the models.json entry, the docker/.ubuntu.amd.Dockerfile, and the scripts//run.sh that emits the performance line. Use when adding a new model or workload to MAD. +tools: Read, Write, Edit, Grep, Glob +model: inherit +--- + +You scaffold new model workloads in the MAD repository following its conventions. + +When invoked: +1. Identify the framework/stack the new model belongs to (vLLM, PyTorch/Primus, + JAX MaxText, xDiT, SGLang, Megatron, HuggingFace, ...). +2. Find the CLOSEST existing model of that stack in `models.json` and read its + entry, its `docker/.ubuntu.amd.Dockerfile`, and its `scripts/.../run.sh`. + Reuse that as your template — do not invent new patterns when one exists. +3. Produce three artifacts: + + a. A new `models.json` entry. Required fields: `name`, `url`, `dockerfile`, + `scripts`, `n_gpus`, `owner`, `training_precision`, `tags`. Name follows + `{framework}_{project}_{workload}`. Add appropriate tags (framework + + precision + model family). Keep `models.json` valid JSON. + + b. `docker/.ubuntu.amd.Dockerfile`. First line MUST be: + `# CONTEXT {'gpu_vendor': 'AMD', 'guest_os': 'UBUNTU'}` + Prefer reusing an existing Dockerfile of the same stack (point the + `dockerfile` field at it) rather than creating a near-duplicate. + + c. `scripts//run.sh`. It must satisfy ONE of the two output contracts: + - Single result (default): end by printing exactly + `echo "performance: $performance "`, where `$performance` is + parsed from the workload's own log output. Model this on the template. + - Multiple results: have the script WRITE its own CSV (one row per + result) and set `"multiple_results": ".csv"` in the + models.json entry. madengine then ingests that CSV instead of grepping + a single stdout line. Use this only when the template you copied does; + do not mix the two contracts. + +Pre-flight — before any `madengine` invocation (including `madengine discover`), +run this Bash block: +```bash +if ! command -v madengine &>/dev/null; then + if [ -f requirements.txt ] && grep -q madengine requirements.txt; then + echo "[pre-flight] madengine not found. Installing from requirements.txt..." + pip install -r requirements.txt + else + echo "[pre-flight] madengine not found and requirements.txt is missing." + echo " Install: pip install git+https://github.com/ROCm/madengine.git@main" + echo " Or clone MAD and run from its root (which has requirements.txt)." + exit 1 + fi +fi +if [ ! -f models.json ]; then + echo "[pre-flight] Warning: models.json not found — run from the MAD repo root." +fi +``` + +Rules: +- The output contract is hard: either the `performance: ` stdout + line, OR a `multiple_results` CSV declared in models.json. Never ship a script + that emits neither — madengine will record the run with no performance value. +- Validate the final `models.json` parses (`python3 -m json.tool models.json`). +- Confirm the new entry is selectable (GPU-free): `madengine discover --tags ` + should list it. This catches tag/name typos before any GPU run. +- Do not run `madengine run` (it needs GPUs). State the verification command the + user should run on a GPU host: `madengine run --tags --live-output`. +- Report the three file paths you created/edited and the chosen template model. diff --git a/.claude/agents/mad-perf-analyst.md b/.claude/agents/mad-perf-analyst.md new file mode 100644 index 0000000..c01932a --- /dev/null +++ b/.claude/agents/mad-perf-analyst.md @@ -0,0 +1,29 @@ +--- +name: mad-perf-analyst +description: Read-only analysis of MAD benchmark results. Parses perf.csv / perf_entry_super.json, compares runs, flags regressions, and summarizes performance. Use to interpret or compare benchmark output. +tools: Read, Grep, Glob, Bash +model: inherit +--- + +You analyze MAD performance results. You are READ-ONLY — never edit files or run +workloads. + +When invoked: +1. Locate result files: `perf.csv`, `perf_entry.csv`, `perf_entry_super.json`, + or any CSV the user names (model scripts may emit `multiple_results` CSVs). +2. Parse the data. `perf.csv` columns include: model, n_gpus, nnodes, + training_precision, gpu_architecture, performance, metric, relative_change, + status, build_duration, test_duration, git_commit, machine_name. +3. Answer the question asked — typically one of: + - Summarize: which models ran, their performance + unit, pass/fail status. + - Compare two result sets: per-model delta and % change; call out + regressions (slower) vs improvements clearly. + - Diagnose failures: surface `status != SUCCESS` rows and any error context. + +Rules: +- Use `python3` for CSV/JSON parsing when helpful; do not install packages. +- Be precise about units — never compare across different `metric` values. +- A higher number is usually better for throughput (tokens/s, samples/s) and + worse for latency (ms, s); state which direction you assumed. +- Present findings as a compact table or bullet list. Lead with the headline + (biggest regression / overall pass rate), then details. diff --git a/.claude/agents/mad-tuner.md b/.claude/agents/mad-tuner.md new file mode 100644 index 0000000..32d0d18 --- /dev/null +++ b/.claude/agents/mad-tuner.md @@ -0,0 +1,50 @@ +--- +name: mad-tuner +description: Iteratively tunes a MAD model/kernel for better performance — proposes config/env-var changes, measures before/after, and keeps only changes that help. Use for performance tuning of an existing model. +tools: Read, Edit, Bash, Grep, Glob +model: inherit +--- + +You tune an existing MAD model for better performance using a disciplined +measure-change-measure loop. + +When invoked: +1. Establish the baseline. Read the model's `scripts/.../run.sh` and any config + it references (YAML/JSON), plus its current `perf.csv` row if present. Record + the baseline performance + unit. +2. Identify tuning levers for the stack, e.g.: + - Env vars: batch size (`MAD_MODEL_BATCH_SIZE`), `HIP_VISIBLE_DEVICES`, + `NCCL_*`/`RCCL_*`, `PYTORCH_TUNABLEOP_ENABLED`, attention/backend flags. + - Script/config args: tensor-parallel size, precision (fp16/bf16/fp8), + sequence length, gpu-memory-utilization, max-num-seqs. +3. Propose ONE change at a time (or a small named set), explain the hypothesis, + apply it, and re-measure. +4. Keep a change only if it improves the metric without breaking the run + (`status == SUCCESS`). Revert regressions. + +Pre-flight — before any `madengine` invocation, run this Bash block: +```bash +if ! command -v madengine &>/dev/null; then + if [ -f requirements.txt ] && grep -q madengine requirements.txt; then + echo "[pre-flight] madengine not found. Installing from requirements.txt..." + pip install -r requirements.txt + else + echo "[pre-flight] madengine not found and requirements.txt is missing." + echo " Install: pip install git+https://github.com/ROCm/madengine.git@main" + echo " Or clone MAD and run from its root (which has requirements.txt)." + exit 1 + fi +fi +if [ ! -f models.json ]; then + echo "[pre-flight] Warning: models.json not found — run from the MAD repo root." +fi +``` + +Rules: +- Measuring requires AMD GPUs. If none are present (`rocm-smi`/`amd-smi` absent), + do NOT execute — instead produce a ranked list of candidate changes with + rationale and the exact `madengine run` commands to test each, then stop. +- Change one variable per measurement so deltas are attributable. +- Never alter the `performance: ` output contract. +- Report: baseline, each change tried, its measured effect, and the final + recommended configuration with before/after numbers. diff --git a/.claude/commands/mad-add-model.md b/.claude/commands/mad-add-model.md new file mode 100644 index 0000000..f5acb64 --- /dev/null +++ b/.claude/commands/mad-add-model.md @@ -0,0 +1,34 @@ +--- +description: Scaffold a new MAD model (models.json entry + Dockerfile + run.sh with the performance line) +argument-hint: [base notes / repo url] +--- + +Add a new model to MAD named `$1`. Extra context: $ARGUMENTS + +Use the `mad-model-author` subagent. It should: +0. Pre-flight: check madengine is installed and cwd is the MAD repo root. + ```bash + if ! command -v madengine &>/dev/null; then + if [ -f requirements.txt ] && grep -q madengine requirements.txt; then + echo "[pre-flight] madengine not found. Installing from requirements.txt..." + pip install -r requirements.txt + else + echo "[pre-flight] madengine not found and requirements.txt is missing." + echo " Install: pip install git+https://github.com/ROCm/madengine.git@main" + echo " Or clone MAD and run from its root (which has requirements.txt)." + exit 1 + fi + fi + if [ ! -f models.json ]; then + echo "[pre-flight] Warning: models.json not found — run from the MAD repo root." + fi + ``` +1. Pick the closest existing model of the same framework as a template. +2. Create the `models.json` entry, `docker/$1.ubuntu.amd.Dockerfile` (with the + `# CONTEXT {'gpu_vendor': 'AMD', 'guest_os': 'UBUNTU'}` header), and + `scripts//run.sh` ending in `echo "performance: $performance "`. +3. Validate `models.json` with `python3 -m json.tool models.json`. +4. Confirm the entry is selectable with `madengine discover --tags $1` (GPU-free). + +Report the files created and the verification command +`madengine run --tags $1 --live-output` (requires a GPU host). diff --git a/.claude/commands/mad-benchmark.md b/.claude/commands/mad-benchmark.md new file mode 100644 index 0000000..39f5354 --- /dev/null +++ b/.claude/commands/mad-benchmark.md @@ -0,0 +1,32 @@ +--- +description: Build and run a madengine benchmark for the given tag/model +argument-hint: [extra options] +--- + +Benchmark `$ARGUMENTS` with madengine. + +Use the `mad-benchmark-runner` subagent. It should: +0. Pre-flight: check madengine is installed and cwd is the MAD repo root. + ```bash + if ! command -v madengine &>/dev/null; then + if [ -f requirements.txt ] && grep -q madengine requirements.txt; then + echo "[pre-flight] madengine not found. Installing from requirements.txt..." + pip install -r requirements.txt + else + echo "[pre-flight] madengine not found and requirements.txt is missing." + echo " Install: pip install git+https://github.com/ROCm/madengine.git@main" + echo " Or clone MAD and run from its root (which has requirements.txt)." + exit 1 + fi + fi + if [ ! -f models.json ]; then + echo "[pre-flight] Warning: models.json not found — run from the MAD repo root." + fi + ``` +1. Confirm the tag matches real models via `madengine discover --tags $1`. +2. Assemble the `madengine run --tags $1 --live-output` command (add `-o`, + `--timeout`, profiling `--additional-context`, or `slurm`/`k8s` keys as the + request implies), and list required env vars (e.g. `MAD_SECRETS_HFTOKEN`). +3. Check for GPUs first. If none, print the exact command(s) to run on a GPU + host instead of executing. If GPUs exist, run it and report where `perf.csv` + landed. diff --git a/.claude/commands/mad-profile.md b/.claude/commands/mad-profile.md new file mode 100644 index 0000000..9a45b1d --- /dev/null +++ b/.claude/commands/mad-profile.md @@ -0,0 +1,38 @@ +--- +description: Run a madengine benchmark with a profiling/tracing tool attached +argument-hint: [tool: rocprofv3_compute|rpd|rccl_trace|...] +--- + +Profile `$ARGUMENTS`. + +Use the `mad-benchmark-runner` subagent with profiling enabled. It should: +0. Pre-flight: check madengine is installed and cwd is the MAD repo root. + ```bash + if ! command -v madengine &>/dev/null; then + if [ -f requirements.txt ] && grep -q madengine requirements.txt; then + echo "[pre-flight] madengine not found. Installing from requirements.txt..." + pip install -r requirements.txt + else + echo "[pre-flight] madengine not found and requirements.txt is missing." + echo " Install: pip install git+https://github.com/ROCm/madengine.git@main" + echo " Or clone MAD and run from its root (which has requirements.txt)." + exit 1 + fi + fi + if [ ! -f models.json ]; then + echo "[pre-flight] Warning: models.json not found — run from the MAD repo root." + fi + ``` +1. Pick the profiling tool (default `rocprofv3_compute` if unspecified). Common + names: `rpd`, `rocprofv3`, `rocprofv3_compute`, `rocprofv3_memory`, + `rocprofv3_communication`, `rocm_trace_lite`, `rccl_trace`, + `gpu_info_power_profiler`. The complete, authoritative list (23+ tools) lives + in the madengine package at `scripts/common/tools.json` — consult it for + names like `rocprofv3_full`, `rocblas_trace`, `hipblaslt_trace`, `rocprof_sys`. +2. Build: + `madengine run --tags $1 --live-output --additional-context '{"tools": [{"name": ""}]}'` +3. Check for GPUs; if none, print the command for a GPU host. Otherwise run it + and report where the trace/profile output and `perf.csv` landed. + +Note: profiling adds overhead; the perf number under profiling is not a clean +benchmark number. diff --git a/.claude/commands/mad-report.md b/.claude/commands/mad-report.md new file mode 100644 index 0000000..5bbb74b --- /dev/null +++ b/.claude/commands/mad-report.md @@ -0,0 +1,16 @@ +--- +description: Analyze and summarize MAD benchmark results (perf.csv / perf_entry_super.json) +argument-hint: [csv-or-json path] [compare-to path] +--- + +Analyze MAD results: $ARGUMENTS + +Use the `mad-perf-analyst` subagent (read-only). It should: +1. Load the result file(s). Default to `perf.csv` if no path is given. +2. If two paths are provided, compare them: per-model delta and % change, + flagging regressions vs improvements (respecting each row's `metric`/unit). +3. Otherwise summarize: models run, performance + unit, and pass/fail status. + +Lead with the headline finding, then a compact table. To generate the HTML +dashboard instead, run `madengine report to-html --csv-file-path ` +(verify flags with `madengine report to-html --help`). diff --git a/.claude/commands/mad-tune.md b/.claude/commands/mad-tune.md new file mode 100644 index 0000000..11ed8aa --- /dev/null +++ b/.claude/commands/mad-tune.md @@ -0,0 +1,35 @@ +--- +description: Tune a MAD model for better performance with a measure-change-measure loop +argument-hint: [target: throughput|latency] [lever hints] +--- + +Tune `$ARGUMENTS`. + +Use the `mad-tuner` subagent. It should: +0. Pre-flight: check madengine is installed and cwd is the MAD repo root. + ```bash + if ! command -v madengine &>/dev/null; then + if [ -f requirements.txt ] && grep -q madengine requirements.txt; then + echo "[pre-flight] madengine not found. Installing from requirements.txt..." + pip install -r requirements.txt + else + echo "[pre-flight] madengine not found and requirements.txt is missing." + echo " Install: pip install git+https://github.com/ROCm/madengine.git@main" + echo " Or clone MAD and run from its root (which has requirements.txt)." + exit 1 + fi + fi + if [ ! -f models.json ]; then + echo "[pre-flight] Warning: models.json not found — run from the MAD repo root." + fi + ``` +1. Establish the baseline (current `run.sh`/config + `perf.csv` row). +2. Propose tuning levers (env vars like `MAD_MODEL_BATCH_SIZE`, + `PYTORCH_TUNABLEOP_ENABLED`, `NCCL_*`/`RCCL_*`; or args like tensor-parallel + size, precision, gpu-memory-utilization), changing ONE at a time. +3. Re-measure each change; keep improvements, revert regressions. + +If no GPUs are present, produce a ranked list of candidate changes with rationale +and the exact `madengine run` command to test each, then stop. Report baseline, +changes tried with measured effect, and the final recommended config with +before/after numbers. diff --git a/.claude/commands/mad-validate.md b/.claude/commands/mad-validate.md new file mode 100644 index 0000000..7b6fc52 --- /dev/null +++ b/.claude/commands/mad-validate.md @@ -0,0 +1,92 @@ +--- +description: Statically validate MAD model entries (no GPU) — JSON, paths, Dockerfile header, output contract +argument-hint: [tag-or-model | all] +--- + +Statically validate MAD model definitions for `${ARGUMENTS:-all}`. This is a +GPU-free lint of `models.json` and the files it points at — it does NOT build or +run anything. + +Run the checks below (use `python3`; do not install packages). Scope to the +named tag/model if one is given, otherwise check every entry. + +Two severities. **Errors** break a run; **warnings** are missing MAD-convention +metadata (madengine itself defaults these — only `name` is structurally +required, per its `Model` dataclass — so don't fail the build on them). + +Errors: +1. **JSON parses**: `python3 -m json.tool models.json` succeeds. +2. **Has `name`** and **names are unique** (no two entries share one). +3. **Dockerfile exists**: `.ubuntu.amd.Dockerfile` is a real file, + and its first line is the context header + `# CONTEXT {'gpu_vendor': 'AMD', 'guest_os': 'UBUNTU'}`. +4. **Script exists**: the `scripts` path exists (a `run.sh` file or a dir). +5. **Output contract** — exactly one of: + - the run script contains a line that echoes `performance: ...`, OR + - the entry sets `multiple_results` to a CSV filename. + Flag any entry that satisfies neither (madengine would record no perf value). + +Warnings (convention metadata, non-fatal): +6. Missing any of `url`, `owner`, `training_precision`, `tags`, `n_gpus`. + +Suggested one-shot checker: + +```bash +python3 - "${ARGUMENTS:-all}" <<'PY' +import json, os, sys, glob +sel = sys.argv[1] if len(sys.argv) > 1 else "all" +models = json.load(open("models.json")) +def selected(m): + return sel in ("all","") or sel == m.get("name") or sel in (m.get("tags") or []) +seen, errors, warns = {}, [], [] +for m in models: + n = m.get("name","") + if not selected(m): + continue + # --- errors (break a run) --- + if "name" not in m: errors.append(f"{n}: missing 'name'") + if n in seen: errors.append(f"{n}: duplicate name") + seen[n] = True + df = (m.get("dockerfile","") or "") + ".ubuntu.amd.Dockerfile" + if not m.get("dockerfile"): + errors.append(f"{n}: no dockerfile field") + elif not os.path.isfile(df): + errors.append(f"{n}: dockerfile not found: {df}") + else: + first = open(df).readline().strip() + if not first.startswith("# CONTEXT") or "AMD" not in first: + errors.append(f"{n}: dockerfile missing CONTEXT header: {df}") + sp = m.get("scripts","") + if not sp: + errors.append(f"{n}: no scripts field") + elif not os.path.exists(sp): + errors.append(f"{n}: scripts path not found: {sp}") + has_mr = bool(m.get("multiple_results")) + emits = False + if sp: + sh_files = [sp] if sp.endswith(".sh") else glob.glob(os.path.join(sp,"**","*.sh"), recursive=True) + for sh in sh_files: + if os.path.isfile(sh) and "performance:" in open(sh, errors="ignore").read(): + emits = True; break + if not (has_mr or emits): + errors.append(f"{n}: no output contract (no 'performance:' line and no multiple_results)") + # --- warnings (convention metadata) --- + for f in ("url","owner","training_precision","tags","n_gpus"): + if not m.get(f): + warns.append(f"{n}: missing convention field '{f}'") +checked = sum(1 for m in models if selected(m)) +print(f"Checked {checked} model(s). {len(errors)} error(s), {len(warns)} warning(s).") +if errors: + print("\nERRORS:") + for e in errors: print(" -", e) +if warns: + print("\nwarnings:") + for w in warns[:40]: print(" -", w) + if len(warns) > 40: print(f" ... and {len(warns)-40} more") +sys.exit(1 if errors else 0) +PY +``` + +Report a compact pass/fail summary. If anything fails, list each problem with the +model name and how to fix it. This is the right check to run after `/mad-add-model` +and before any GPU run. diff --git a/.claude/settings.json b/.claude/settings.json new file mode 100644 index 0000000..b4bbaa2 --- /dev/null +++ b/.claude/settings.json @@ -0,0 +1,18 @@ +{ + "permissions": { + "allow": [ + "Bash(madengine discover *)", + "Bash(madengine run *)", + "Bash(madengine build *)", + "Bash(python3 -m json.tool *)", + "Bash(git status)", + "Bash(git diff *)", + "Bash(git log *)", + "Bash(rocm-smi *)", + "Bash(amd-smi *)", + "Read", + "Grep", + "Glob" + ] + } +} diff --git a/.claude/workflows/mad-benchmark-sweep.js b/.claude/workflows/mad-benchmark-sweep.js new file mode 100644 index 0000000..9bb47bf --- /dev/null +++ b/.claude/workflows/mad-benchmark-sweep.js @@ -0,0 +1,188 @@ +export const meta = { + name: 'mad-benchmark-sweep', + description: 'Fan out a benchmark matrix across MAD models/tags, run (or plan) each with madengine, parse results, and synthesize a comparison table', + phases: [ + { title: 'Resolve', detail: 'expand the sweep matrix into concrete madengine invocations' }, + { title: 'Run', detail: 'execute or plan each benchmark in parallel' }, + { title: 'Synthesize', detail: 'parse results into a comparison table' }, + ], +} + +// ARGS — accepts EITHER: +// (a) a structured object: { tags: string[], nGpus?: string[], execute?: boolean, +// additionalContext?: string, timeout?: number } +// (b) a CLI-style string mirroring `/mad-benchmark`, e.g. +// --tags pyt_vllm_qwen3-8b,bert --additional-context '{"gpu_vendor":"AMD",...}' --live-output +// +// Multiple tags: pass several after --tags and/or comma-separate them. Each tag +// becomes one sweep cell. --additional-context is threaded verbatim into EVERY +// cell command (so HF tokens / docker_env_vars reach each run). +// +// EXECUTE semantics: runs for real by default. Pass --no-execute (or --plan, or +// execute:false) to only validate commands + resolve tags without touching GPUs. +// Even when executing, each cell first checks for AMD GPUs and self-skips if none. +// +// NOTE: there is intentionally no precision axis. In MAD, precision is NOT a +// runtime `madengine run` flag: for training it is fixed per-model via +// `training_precision` in models.json, and for inference it is baked into the +// pre-trained model/image. To compare precisions, list the distinct model tags. + +// --- tokenizer: split a CLI string respecting single/double quotes --- +function tokenize(s) { + const out = [] + let cur = '', q = null + for (let i = 0; i < s.length; i++) { + const c = s[i] + if (q) { + if (c === q) q = null + else cur += c + } else if (c === '"' || c === "'") { + q = c + } else if (c === ' ' || c === '\t' || c === '\n') { + if (cur) { out.push(cur); cur = '' } + } else { + cur += c + } + } + if (cur) out.push(cur) + return out +} + +// --- parse a CLI string into the structured cfg the sweep uses --- +function parseCli(s) { + const toks = tokenize(s) + const cfg = { tags: [], nGpus: [], execute: true } + for (let i = 0; i < toks.length; i++) { + const t = toks[i] + const next = () => toks[++i] + if (t === '--tags' || t === '-t') { + // collect following non-flag tokens, comma-splitting each + while (i + 1 < toks.length && !toks[i + 1].startsWith('-')) { + next().split(',').forEach(x => { if (x) cfg.tags.push(x) }) + } + } else if (t === '--additional-context' || t === '-c') { + cfg.additionalContext = next() + } else if (t === '--ngpus' || t === '--n-gpus') { + next().split(',').forEach(x => { if (x) cfg.nGpus.push(x) }) + } else if (t === '--timeout') { + cfg.timeout = next() + } else if (t === '-o' || t === '--output') { + cfg.output = next() + } else if (t === '--no-execute' || t === '--plan' || t === '--dry-run') { + cfg.execute = false + } else if (t === '--execute') { + cfg.execute = true + } + // --live-output and unknown flags are ignored (live-output is always added below) + } + return cfg +} + +// --- normalize args (string | object | undefined) into one cfg --- +let cfg +if (typeof args === 'string') { + cfg = parseCli(args) +} else if (args && typeof args === 'object') { + cfg = { execute: true, ...args } + if (typeof cfg.tags === 'string') cfg.tags = [cfg.tags] +} else { + cfg = { tags: [], nGpus: [], execute: true } +} + +const tags = (cfg.tags && cfg.tags.length) ? cfg.tags : ['bert'] +const nGpus = (cfg.nGpus && cfg.nGpus.length) ? cfg.nGpus : [null] +const execute = cfg.execute !== false +const addlCtx = cfg.additionalContext || null +const timeout = cfg.timeout || null + +// Build the matrix of {tag, nGpus} cells. +const matrix = [] +for (const tag of tags) + for (const g of nGpus) + matrix.push({ tag, nGpus: g }) + +log(`Sweep matrix: ${matrix.length} cell(s) across ${tags.length} tag(s) [${tags.join(', ')}]. execute=${execute}${addlCtx ? ', additional-context threaded' : ''}`) + +const CELL_SCHEMA = { + type: 'object', + required: ['cell', 'command', 'status'], + properties: { + cell: { type: 'string' }, + command: { type: 'string' }, + status: { type: 'string', enum: ['planned', 'ran', 'skipped', 'error'] }, + performance: { type: ['number', 'null'] }, + metric: { type: ['string', 'null'] }, + notes: { type: 'string' }, + }, +} + +const results = await parallel(matrix.map((cell, i) => () => { + const label = `${cell.tag}${cell.nGpus ? '/' + cell.nGpus + 'gpu' : ''}` + // Each cell writes its OWN perf file so parallel runs never clobber a shared + // perf.csv. A safe filename is derived from the label. + const safe = label.replace(/[^A-Za-z0-9._-]/g, '_') + const outFile = `perf_${safe}.csv` + const flags = [ + '--tags ' + cell.tag, + '--live-output', + '-o ' + outFile, + ] + if (timeout) flags.push('--timeout ' + timeout) + // Thread the user-supplied --additional-context verbatim into every cell. If a + // per-cell n_gpus axis is set, try to merge it into the JSON; fall back to the + // raw string + a separate context if merge isn't possible. + let ctx = '' + if (addlCtx) { + let merged = addlCtx + if (cell.nGpus) { + try { + const obj = JSON.parse(addlCtx) + obj.n_gpus = cell.nGpus + merged = JSON.stringify(obj) + } catch (e) { /* leave raw; n_gpus axis ignored for this cell */ } + } + ctx = ` --additional-context '${merged}'` + } else if (cell.nGpus) { + ctx = ` --additional-context '{"n_gpus": "${cell.nGpus}"}'` + } + const cmd = `madengine run ${flags.join(' ')}${ctx}` + + const action = execute + ? `If AMD GPUs are present (check rocm-smi/amd-smi), RUN this command from the /home/ysha/MAD repo root and parse results. This model may emit a "performance: " stdout line OR (if it is a multiple_results model) write per-metric rows to its own CSV — in that case report the primary throughput metric. Results land in ${outFile}. If no GPUs, set status "skipped".` + : `Do NOT execute. Validate the command is well-formed and the tag resolves via "madengine discover --tags ${cell.tag}". Set status "planned".` + + return agent( + `MAD benchmark sweep cell "${label}". +Pre-flight: before running madengine, verify it is installed: + if ! command -v madengine &>/dev/null; then + if [ -f requirements.txt ] && grep -q madengine requirements.txt; then + pip install -r requirements.txt + else + echo "[pre-flight] madengine not found. Install: pip install git+https://github.com/ROCm/madengine.git@main"; exit 1 + fi + fi + [ -f models.json ] || echo "[pre-flight] Warning: not in MAD repo root." +Command: ${cmd} +${action} +Return the cell label, the exact command, status, and performance/metric if available.`, + { label: `bench:${label}`, phase: 'Run', schema: CELL_SCHEMA } + ).then(r => ({ ...r, _label: label, _output: outFile })) +})) + +phase('Synthesize') +const clean = results.filter(Boolean) +const summary = await agent( + `Synthesize a MAD benchmark sweep into a markdown comparison table. +Cells (JSON, each with its own _output perf file): ${JSON.stringify(clean)} +Produce: (1) a table with columns Cell | Command | Status | Performance | Metric; +(2) a short headline (how many planned/ran/skipped/errored); +(3) explicitly flag any cell whose status is "error" or whose tag did not resolve — +do not let a failed/unresolved cell read as success; +(4) if any ran, the fastest and slowest cells (respecting that metric units differ); +(5) note that each cell wrote a separate perf_.csv (to avoid clobbering) and +that they can be concatenated for a combined report. +Do not invent numbers — only use provided data.`, + { phase: 'Synthesize' } +) + +return { execute, cells: clean, report: summary } diff --git a/.claude/workflows/mad-tune-search.js b/.claude/workflows/mad-tune-search.js new file mode 100644 index 0000000..9e1fd52 --- /dev/null +++ b/.claude/workflows/mad-tune-search.js @@ -0,0 +1,344 @@ +export const meta = { + name: 'mad-tune-search', + description: 'Profile a MAD model to diagnose its bottleneck, generate profiling-informed tuning candidates, measure each cleanly + adversarially verify, and synthesize the winning config', + phases: [ + { title: 'Baseline', detail: 'read the model + establish a clean baseline number' }, + { title: 'Diagnose', detail: 'profile once to find hotspots and classify the bottleneck' }, + { title: 'Propose', detail: 'generate distinct candidates, each citing profiling evidence' }, + { title: 'Evaluate', detail: 'measure each candidate cleanly (sequential) and verify its delta' }, + { title: 'Synthesize', detail: 'pick the winning configuration' }, + ], +} + +// ARGS — accepts EITHER: +// (a) a structured object: { tag, target?, candidates?, execute?, +// additionalContext?, timeout?, profileTool? } +// (b) a CLI-style string mirroring `/mad-benchmark` / `/mad-profile`, e.g. +// --tag pyt_vllm_qwen3-8b --candidates 6 \ +// --additional-context '{"gpu_vendor":"AMD","tools":[{"name":"rocm_trace_lite"}],...}' +// +// WHY profiling is split from measurement (the core design): +// - DIAGNOSE uses a profiler (tools[]) to find WHAT to tune: kernel hotspots, +// and whether the workload is compute- / memory- / communication- / launch-bound. +// - EVALUATE measures WHETHER a change helped using CLEAN runs (profiler OFF), +// because tracing overhead (e.g. rocm_trace_lite intercepts ~all GPU ops) buries +// the small perf deltas tuning produces and would make verify reject real wins. +// So we run the profiler exactly ONCE (Diagnose), then strip the `tools` key from +// the context for the baseline and every candidate measurement. +// +// EXECUTE semantics: runs for real by default. Pass --no-execute / --plan to only +// diagnose-from-static + produce the ranked candidate plan without GPU measurement. +// +// SEQUENTIAL evaluation: candidates are measured one at a time — parallel GPU runs +// contend (corrupting deltas) and parallel run.sh/config edits race on the same files. + +// --- tokenizer: split a CLI string respecting single/double quotes --- +function tokenize(s) { + const out = [] + let cur = '', q = null + for (let i = 0; i < s.length; i++) { + const c = s[i] + if (q) { + if (c === q) q = null + else cur += c + } else if (c === '"' || c === "'") { + q = c + } else if (c === ' ' || c === '\t' || c === '\n') { + if (cur) { out.push(cur); cur = '' } + } else { + cur += c + } + } + if (cur) out.push(cur) + return out +} + +// --- parse a CLI string into the structured cfg the search uses --- +function parseCli(s) { + const toks = tokenize(s) + const cfg = { execute: true } + const tags = [] + for (let i = 0; i < toks.length; i++) { + const t = toks[i] + const next = () => toks[++i] + if (t === '--tags' || t === '--tag' || t === '-t') { + while (i + 1 < toks.length && !toks[i + 1].startsWith('-')) { + next().split(',').forEach(x => { if (x) tags.push(x) }) + } + } else if (t === '--additional-context' || t === '-c') { + cfg.additionalContext = next() + } else if (t === '--target') { + cfg.target = next() + } else if (t === '--candidates' || t === '-n') { + cfg.candidates = parseInt(next(), 10) + } else if (t === '--profile-tool') { + cfg.profileTool = next() + } else if (t === '--timeout') { + cfg.timeout = next() + } else if (t === '--no-execute' || t === '--plan' || t === '--dry-run') { + cfg.execute = false + } else if (t === '--execute') { + cfg.execute = true + } + } + if (tags.length) cfg.tag = tags[0] + if (tags.length > 1) cfg._extraTags = tags.slice(1) + return cfg +} + +// --- normalize args (string | object | undefined) into one cfg --- +let cfg +if (typeof args === 'string') { + cfg = parseCli(args) +} else if (args && typeof args === 'object') { + cfg = { execute: true, ...args } +} else { + cfg = { execute: true } +} + +const tag = cfg.tag || 'bert' +const target = cfg.target || 'throughput' +const nCandidates = Math.max(1, Math.min(cfg.candidates || 4, 8)) +const execute = cfg.execute !== false +const timeout = cfg.timeout || null + +// --- Split the user's --additional-context into a PROFILED and a CLEAN variant. --- +// profiledCtx: keeps (or adds) a `tools` entry → used once in Diagnose. +// cleanCtx: `tools` key removed → used for baseline + every candidate measurement. +let ctxObj = null +if (cfg.additionalContext) { + try { ctxObj = JSON.parse(cfg.additionalContext) } + catch (e) { log(`Warning: --additional-context is not valid JSON; passing it through verbatim and skipping clean/profiled split.`) } +} + +let cleanCtx = null // JSON string, tools stripped +let profiledCtx = null // JSON string, tools present +let profileToolName = cfg.profileTool || null + +if (ctxObj && typeof ctxObj === 'object') { + // derive the profiling tool name from the context if not given explicitly + if (!profileToolName && Array.isArray(ctxObj.tools) && ctxObj.tools[0] && ctxObj.tools[0].name) + profileToolName = ctxObj.tools[0].name + if (!profileToolName) profileToolName = 'rocm_trace_lite' + + const cleanObj = { ...ctxObj } + delete cleanObj.tools + cleanCtx = JSON.stringify(cleanObj) + + const profObj = { ...ctxObj, tools: [{ name: profileToolName }] } + profiledCtx = JSON.stringify(profObj) +} else if (cfg.additionalContext) { + // not parseable as object — use verbatim for clean; still attempt a profiled variant + cleanCtx = cfg.additionalContext + if (!profileToolName) profileToolName = 'rocm_trace_lite' +} + +const cleanCtxFlag = cleanCtx ? ` --additional-context '${cleanCtx}'` : '' +const profCtxFlag = profiledCtx ? ` --additional-context '${profiledCtx}'` + : ` --additional-context '{"tools": [{"name": "${profileToolName || 'rocm_trace_lite'}"}]}'` + +if (cfg._extraTags && cfg._extraTags.length) + log(`Note: mad-tune-search tunes ONE model. Using "${tag}"; ignoring extra tags [${cfg._extraTags.join(', ')}]. Use mad-benchmark-sweep to compare multiple models.`) +log(`Tuning "${tag}" for ${target}. execute=${execute}. profile tool=${profileToolName || 'rocm_trace_lite'}. ${cleanCtx ? 'context split into clean+profiled' : 'no additional-context'}.`) + +// ── Phase 1: Baseline (CLEAN) ────────────────────────────────────────────── +phase('Baseline') +const BASELINE_SCHEMA = { + type: 'object', + required: ['model', 'scriptPath', 'baselineSummary'], + properties: { + model: { type: 'string' }, + scriptPath: { type: 'string' }, + baselineSummary: { type: 'string' }, + multipleResults: { type: ['string', 'null'] }, + baselinePerf: { type: ['number', 'null'] }, + baselineMetric: { type: ['string', 'null'] }, + levers: { type: 'array', items: { type: 'string' } }, + }, +} +const baseAction = execute + ? `If AMD GPUs are present, establish a CLEAN baseline number (profiler OFF): either reuse the most recent clean perf row for this model if one exists in perf.csv / the model's results CSV, or run "madengine run --tags ${tag} --live-output -o perf_tune_baseline.csv${cleanCtxFlag}" once and parse it. Report baselinePerf + baselineMetric.` + : `Do NOT run anything. Read the static config only; leave baselinePerf null.` +const baseline = await agent( + `In the MAD repo, establish the tuning baseline for tag "${tag}" (target: ${target}). +Pre-flight: before running madengine, verify it is installed: + if ! command -v madengine &>/dev/null; then + if [ -f requirements.txt ] && grep -q madengine requirements.txt; then + pip install -r requirements.txt + else + echo "[pre-flight] madengine not found. Install: pip install git+https://github.com/ROCm/madengine.git@main"; exit 1 + fi + fi + [ -f models.json ] || echo "[pre-flight] Warning: not in MAD repo root." +Find its models.json entry, its scripts/.../run.sh, and any config it references. +Summarize the current configuration and list the tuning levers available for this +stack (env vars like MAD_MODEL_BATCH_SIZE / PYTORCH_TUNABLEOP_ENABLED / NCCL_*/RCCL_*, +or args like tensor-parallel size, precision, gpu-memory-utilization, max-num-seqs). +Report whether the entry sets "multiple_results" (its CSV filename, or null). +${baseAction}`, + { phase: 'Baseline', schema: BASELINE_SCHEMA } +) +const multiCsv = baseline.multipleResults || null +const parseHint = multiCsv + ? `This is a multiple_results model — performance is NOT a stdout line; after the run read the per-metric CSV "${multiCsv}" (and the -o file) and report the primary ${target} metric.` + : `Parse the "performance: " line from stdout.` + +// ── Phase 2: Diagnose (PROFILED, once) ───────────────────────────────────── +phase('Diagnose') +const DIAG_SCHEMA = { + type: 'object', + required: ['bottleneck', 'evidence', 'hotspots'], + properties: { + bottleneck: { type: 'string', enum: ['compute', 'memory', 'communication', 'launch', 'mixed', 'unknown'] }, + evidence: { type: 'string' }, + hotspots: { type: 'array', items: { + type: 'object', + required: ['name', 'pct'], + properties: { name: { type: 'string' }, pct: { type: ['number', 'null'] }, kind: { type: 'string' } }, + } }, + recommendedLevers: { type: 'array', items: { type: 'string' } }, + tracePath: { type: ['string', 'null'] }, + }, +} +const diagAction = execute + ? `If AMD GPUs are present, profile the model ONCE: prefer reusing an existing recent trace dir (e.g. rocm_trace_lite_output/) if present and current; otherwise run "madengine run --tags ${tag} --live-output -o perf_tune_profiled.csv${profCtxFlag}". Then read the trace summary (e.g. trace_summary.txt / the trace db) to extract the top GPU-kernel hotspots with their % of GPU time. If no GPUs, classify from static config + known model characteristics and set tracePath null.` + : `Do NOT run anything. Classify the likely bottleneck from the static config, model architecture, and any EXISTING trace output already on disk (e.g. rocm_trace_lite_output/trace_summary.txt). Set tracePath to that file if found, else null.` +const diag = await agent( + `Diagnose the performance bottleneck of "${baseline.model}" to guide ${target} tuning. +Baseline config: ${baseline.baselineSummary} +${diagAction} + +Classify the bottleneck as exactly one of: compute, memory, communication, launch, mixed, unknown — using these best-practice signals: +- compute: GEMM/conv kernels (e.g. Cijk*, *gemm*, wvSplitK) dominate GPU time; high occupancy. + Levers: PYTORCH_TUNABLEOP_ENABLED, hipBLASLt tuning, fp8/lower precision, larger batch. +- memory: attention/elementwise/norm/copy kernels dominate; low arithmetic intensity / bandwidth-bound. + Levers: KV-cache dtype, kernel fusion, gpu-memory-utilization, batch size. +- communication: RCCL/NCCL collectives (AllReduce/AllGather) are a large share (multi-GPU). + Levers: NCCL_*/RCCL_* tuning, tensor-parallel vs pipeline-parallel topology. +- launch: many tiny kernels with timeline gaps / low GPU utilization. + Levers: HIP/CUDA graphs, larger batch, fewer sync points. +Return bottleneck, evidence (cite the hotspot %s), hotspots[], recommendedLevers[], tracePath.`, + { phase: 'Diagnose', schema: DIAG_SCHEMA } +) +log(`Diagnosis: ${diag.bottleneck}-bound. Top hotspots: ${(diag.hotspots || []).slice(0, 3).map(h => `${h.name} ${h.pct != null ? h.pct + '%' : ''}`).join(', ')}`) + +// ── Phase 3: Propose (profiling-informed) ────────────────────────────────── +phase('Propose') +const CANDS_SCHEMA = { + type: 'object', + required: ['candidates'], + properties: { + candidates: { type: 'array', items: { + type: 'object', + required: ['id', 'change', 'hypothesis', 'evidence'], + properties: { + id: { type: 'string' }, + change: { type: 'string' }, + hypothesis: { type: 'string' }, + evidence: { type: 'string' }, + leverKind: { type: 'string', enum: ['env', 'arg', 'config', 'other'] }, + }, + } }, + }, +} +const proposed = await agent( + `Propose ${nCandidates} DISTINCT tuning candidates to improve ${target} for "${baseline.model}". +Baseline: ${baseline.baselineSummary} +Diagnosis: ${diag.bottleneck}-bound. Evidence: ${diag.evidence} +Top hotspots: ${JSON.stringify((diag.hotspots || []).slice(0, 5))} +Recommended levers from diagnosis: ${(diag.recommendedLevers || []).join(', ')} +Generic available levers: ${(baseline.levers || []).join(', ')} + +Each candidate changes ONE lever (so its effect is attributable) and should TARGET the +diagnosed bottleneck. Return id, change (concrete value + how to apply it), hypothesis, +evidence (which hotspot/diagnosis finding motivates it), and leverKind: +"env" (environment variable), "arg" (run-script/CLI arg), "config" (config-file value), or "other".`, + { phase: 'Propose', schema: CANDS_SCHEMA } +) +const candidates = (proposed.candidates || []).slice(0, nCandidates) +log(`Evaluating ${candidates.length} candidate(s) sequentially, CLEAN (profiler off). execute=${execute}`) + +// ── Phase 4: Evaluate (CLEAN, sequential) + adversarial verify ───────────── +const EVAL_SCHEMA = { + type: 'object', + required: ['id', 'status'], + properties: { + id: { type: 'string' }, + status: { type: 'string', enum: ['planned', 'ran', 'skipped', 'error'] }, + command: { type: 'string' }, + performance: { type: ['number', 'null'] }, + metric: { type: ['string', 'null'] }, + notes: { type: 'string' }, + }, +} +const VERDICT_SCHEMA = { + type: 'object', + required: ['id', 'trustworthy'], + properties: { + id: { type: 'string' }, + trustworthy: { type: 'boolean' }, + reason: { type: 'string' }, + }, +} + +const ctxHint = cleanCtx + ? `A CLEAN base --additional-context (profiler OFF) is in the command. If this candidate's leverKind is "env", apply the change by MERGING the env var into that context's "docker_env_vars" object (show the merged command you actually run); do not export it separately.` + : `If leverKind is "env", apply by adding --additional-context '{"docker_env_vars": {"VAR": "value"}}' (or exporting the var).` + +const evaluated = [] +for (const cand of candidates) { + const safeId = String(cand.id).replace(/[^A-Za-z0-9._-]/g, '_') + const outFile = `perf_tune_${safeId}.csv` + const flags = ['--tags ' + tag, '--live-output', '-o ' + outFile] + if (timeout) flags.push('--timeout ' + timeout) + const cmd = `madengine run ${flags.join(' ')}${cleanCtxFlag}` + const action = execute + ? `If AMD GPUs are present (check rocm-smi/amd-smi), apply the change, RUN the command CLEAN (profiler OFF), parse the result, then REVERT any file/config edit so the next candidate starts clean. ${parseHint} Results land in ${outFile}. If no GPUs, set status "skipped".` + : `Do NOT execute. Produce the exact command (plus precisely how to apply the change) and set status "planned".` + + const evalResult = await agent( + `Evaluate tuning candidate ${cand.id} for "${baseline.model}" (leverKind: ${cand.leverKind || 'other'}). +Change: ${cand.change} +Hypothesis: ${cand.hypothesis} +Motivating evidence: ${cand.evidence} +Base command (CLEAN): ${cmd} +${ctxHint} +${action} +Return id, status, the exact command you ran, and performance/metric if measured.`, + { label: `eval:${cand.id}`, phase: 'Evaluate', schema: EVAL_SCHEMA } + ) + + const verdict = await agent( + `Adversarially review tuning candidate ${cand.id}. +Baseline: ${baseline.baselinePerf != null ? baseline.baselinePerf + ' ' + (baseline.baselineMetric || '') : baseline.baselineSummary} +Claimed result: ${JSON.stringify(evalResult)} +Hypothesis was: ${cand.hypothesis} +Is the conclusion trustworthy? Default to trustworthy=false if the change was not actually +measured (planned/skipped), if only one sample was taken, if it ran under a profiler, or if +the delta vs baseline is within run-to-run noise (~1-2%). Return id, trustworthy, reason.`, + { label: `verify:${cand.id}`, phase: 'Evaluate', schema: VERDICT_SCHEMA } + ) + evaluated.push({ ...evalResult, _change: cand.change, _evidence: cand.evidence, verdict }) +} + +// ── Phase 5: Synthesize ──────────────────────────────────────────────────── +phase('Synthesize') +const clean = evaluated.filter(Boolean) +const report = await agent( + `Synthesize a MAD tuning search for "${baseline.model}" (target: ${target}). +Baseline: ${baseline.baselineSummary}${baseline.baselinePerf != null ? ` (clean baseline: ${baseline.baselinePerf} ${baseline.baselineMetric || ''})` : ''} +Diagnosis: ${diag.bottleneck}-bound — ${diag.evidence} +Top hotspots: ${JSON.stringify((diag.hotspots || []).slice(0, 5))} +Evaluated candidates (JSON, each with a verdict): ${JSON.stringify(clean)} + +Produce: +(1) a one-line headline: the bottleneck and the best trustworthy improvement (or that none beat baseline); +(2) the diagnosis summary (what's the bottleneck and why, citing hotspot %s); +(3) a table: Candidate | Change | Evidence | Status | Performance | vs Baseline | Trustworthy; +(4) the recommended configuration — ONLY from candidates whose verdict is trustworthy; if none ran/were trustworthy, present the ranked plan to test on a GPU host; +(5) the exact next command(s) to run. +Each candidate wrote its own perf_tune_.csv (no shared-file clobbering). Do not invent measurements.`, + { phase: 'Synthesize' } +) + +return { model: baseline.model, target, execute, bottleneck: diag.bottleneck, diagnosis: diag, candidates: clean, report } diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..0c5648f --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,130 @@ +# CLAUDE.md + +Guidance for Claude Code when working in the MAD repository. + +## What MAD is + +MAD (Model Automation and Dashboarding) is a curated catalog of ~142 AI model +workloads that run on AMD Instinct GPUs via the **`madengine`** CLI. Each model +is one JSON entry in `models.json`. `madengine run --tags ` builds a Docker +image, runs the model in a container, parses a performance number from stdout, +and appends a row to `perf.csv`. + +The four common user tasks here are: **benchmarking**, **adding a new model**, +**tuning** a model/kernel for better perf, and **development** on the repo +itself. Dedicated subagents and `/mad-*` slash commands exist for each — see +`.claude/agents/` and `.claude/commands/`. + +## The performance contract (most important convention) + +A model's run script MUST print exactly one line to stdout: + +``` +performance: +``` + +madengine greps this line to populate the `performance` and `metric` columns of +`perf.csv`. A run that does not emit it is recorded with no performance value. +Real example (`scripts/huggingface_bert/run.sh`): + +```bash +performance=$(grep -Eo "train_samples_per_second':[^,]+" log.txt | sed "s/.*: //g" | head -n 1) +echo "performance: $performance samples_per_second" +``` + +If a script produces many results, set `multiple_results` in `models.json` to the +CSV filename the script writes (e.g. `"multiple_results": "perf_DeepSeek-R1.csv"`). + +## models.json schema + +One object per model. Required fields: `name`, `url`, `dockerfile`, `scripts`, +`n_gpus`, `owner`, `training_precision`, `tags`. Optional: `data`, `timeout`, +`multiple_results`, `args`, `skip_gpu_arch`. + +| Field | Notes | +|-------|-------| +| `name` | Unique. Convention: `{framework}_{project}_{workload}`, e.g. `pyt_vllm_deepseek-r1` | +| `url` | Git repo cloned into the container; `""` if none | +| `dockerfile` | Path prefix; engine appends `.ubuntu.amd.Dockerfile` → `docker/.ubuntu.amd.Dockerfile` | +| `scripts` | Path to a `run.sh` (or script dir) executed inside the container | +| `n_gpus` | String. `"-1"` = all available | +| `tags` | List; `--tags` matches any tag OR the exact `name` | +| `training_precision` | e.g. `fp16`, `bf16`, `fp8`, `fp32`, or `""` | +| `timeout` | Seconds; overrides the 7200s default. `-1` = no timeout | +| `skip_gpu_arch` | Skip on these archs, e.g. `"gfx942"` | +| `args` | Extra args appended to the run script | + +`models.json` must stay valid JSON — validate with `python3 -m json.tool models.json` +after editing. + +## Adding a new model (4 steps) + +1. **Name** it `{framework}_{project}_{workload}`. +2. **Add** an entry to `models.json` (copy the closest existing model of the same + framework as a template). +3. **Dockerfile** `docker/.ubuntu.amd.Dockerfile`. First line MUST be the + context header, then a base image: + ```dockerfile + # CONTEXT {'gpu_vendor': 'AMD', 'guest_os': 'UBUNTU'} + ARG BASE_DOCKER=rocm/pytorch + FROM $BASE_DOCKER + ``` + Reuse an existing `docker/*.Dockerfile` of the same stack when possible rather + than inventing a new base. +4. **Script** `scripts//run.sh` that runs the workload and ends by echoing + the `performance: ` line. + +Verify (needs a GPU host): `madengine run --tags --live-output`. + +## madengine commands + +`requirements.txt` pins madengine to `@main` (Typer CLI, v2.1.0). It exposes +five commands: `build`, `run`, `discover`, `report`, `database`. Verify exact +flags with `madengine --help`. + +- `madengine discover --tags ` — list matching models (read-only, no GPU). +- `madengine build --tags [-r REGISTRY] [-a gfx942,gfx90a] [-l]` — build images, write `build_manifest.json`. `--use-image` skips the build and uses a prebuilt image. +- `madengine run --tags [-l/--live-output] [-o out.csv] [--timeout S] [--keep-alive] [--skip-model-run] [-c '']` — full build+run (or, with `-m manifest.json`, execution-only). Writes `perf.csv`. +- `madengine report to-html --csv-file-path perf.csv` — HTML report. +- `madengine database -f perf_entry_super.json --db DB -c COLLECTION` — upload CSV/JSON to MongoDB (`-k model,timestamp` sets the unique key; `--dry-run` to preview). + +`--skip-model-run` builds the image without running the workload. +`tools/run_models.py` is the **deprecated** legacy runner — prefer `madengine`. + +## Deployment target (convention over configuration) + +Pass deployment intent via `--additional-context` (a Python-dict/JSON string, +parsed with `ast.literal_eval` so Python dict syntax is fine). The target is +inferred from which key is present: + +- `"slurm"` key → SLURM. `"k8s"`/`"kubernetes"` key → Kubernetes. Neither → local Docker. + +## Profiling and tracing + +Add a `tools` list to `--additional-context`: + +```bash +madengine run --tags --additional-context '{"tools": [{"name": "rocprofv3_compute"}]}' +``` + +Common tool names: `rpd`, `rocprofv3`, `rocprofv3_compute`, `rocprofv3_memory`, +`rocprofv3_communication`, `rocm_trace_lite`, `rccl_trace`, `gpu_info_power_profiler`. +Pre-built profiling context files live in the madengine package's +`examples/profiling-configs/`. + +## Outputs + +- `perf.csv` — one row per run (model, n_gpus, precision, performance, metric, status, durations, git_commit, gpu_architecture, ...). +- `perf_entry_super.json` — enriched record incl. a `configs` block; this is what gets pushed to MongoDB. + +## Environment variables + +`MAD_SECRETS_HFTOKEN` (HuggingFace token), `MAD_MODEL_NAME`, `MAD_RUNTIME_NGPUS`, +`MAD_SYSTEM_GPU_ARCHITECTURE`, `MAD_MODEL_BATCH_SIZE`. + +## Future (not yet wired) + +`reference_db/mad_agent.db` (tables: `model_baselines`, `optimization_history`, +`best_configurations`, `learned_patterns`) and `knowledge_base/` are planned +for a future persistent optimization-memory layer. They are not present or +populated yet — do not assume data is available. diff --git a/mad-automation-howto.html b/mad-automation-howto.html new file mode 100644 index 0000000..c4ab4be --- /dev/null +++ b/mad-automation-howto.html @@ -0,0 +1,2097 @@ + + + + + +MAD Claude Automation — How-To Guide + + + + + + + + +
+ + +
+
+

MAD Claude Automation

+

+ A practical guide to benchmarking, adding, tuning, and analyzing AI model workloads + on AMD Instinct GPUs — using Claude Code's slash commands, agents, and workflows. +

+
+
+
+
142
+
models
+
+
+
6
+
commands
+
+
+
4
+
agents
+
+
+
2
+
workflows
+
+
+ + +
+

1 What is MAD

+ +

+ MAD (Model Automation and Dashboarding) is a curated catalog of 142 AI model + workloads designed to run on AMD Instinct GPUs. Each model is one JSON entry in + models.json. The madengine CLI handles the full lifecycle: + building a Docker image, running the model inside a container, parsing a performance number + from stdout, and appending a row to perf.csv. +

+ +

+ Claude Code automation sits on top of madengine. Instead of memorizing flags and file + conventions, you type a slash command — Claude dispatches the right subagent which constructs + and (when a GPU host is available) executes the correct madengine invocation. +

+ +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
User taskSlash commandSubagent
Run a benchmark/mad-benchmarkmad-benchmark-runner
Add a new model/mad-add-modelmad-model-author
Profile / trace a model/mad-profilemad-benchmark-runner
Analyze results/mad-reportmad-perf-analyst
Tune for better perf/mad-tunemad-tuner
Validate model entries/mad-validate(inline script)
+
+ +

+ Two multi-agent workflows handle matrix benchmarking and systematic + tuning search: mad-benchmark-sweep and mad-tune-search. + These fan out parallel subagent calls and synthesize results into comparison tables. +

+
+ + +
+

2 Architecture

+ +

+ The system has three layers: the user interface (slash commands you type in + Claude Code), subagents (specialized Claude instances with tool access), and + the madengine CLI plus the MAD repository files. +

+ +
+ + + + + + + + + + + + + + + + + + + + + + + + USER + AGENTS + ENGINE + + + + + + + SLASH COMMANDS + WORKFLOWS + + + + + /mad-benchmark + build · run + + + + /mad-profile + + trace tool + + + + /mad-add-model + scaffold + + + + /mad-report + analyze csv + + + + /mad-tune + measure-loop + + + + /mad-validate + GPU-free lint + + + + mad-benchmark-sweep + matrix → compare + + + mad-tune-search + diagnose → tune + + + + + benchmark-runner + Bash · Read · Grep · Glob + run & profile + + + + model-author + Read · Write · Edit · Glob + scaffold files + + + + perf-analyst + Read · Grep · Bash (r/o) + parse & compare + + + + tuner + Read · Edit · Bash · Grep + measure-change + + + + inline script + json · paths · header + no subagent · GPU-free + + + + + madengine + build · run · discover · report · database + Docker / SLURM / Kubernetes + + + + MAD repo files + models.json · docker/*.Dockerfile · scripts/run.sh + perf.csv · perf_entry_super.json (performance: N unit) + + + + AMD GPU + required for build / run + + + + + + + + + + + + + + fan-out + + + + + + + + + + + + + + + + + + + + + +
+ +

+ command → agent + workflow fan-out + agent → engine / files + GPU-free path + needs AMD GPU +

+ +

+ Five of the six slash commands dispatch a subagent; /mad-validate runs an + inline script (no subagent, no GPU). The two workflows fan out to + multiple subagent calls in parallel — mad-benchmark-sweep drives the + benchmark-runner across a tag matrix, mad-tune-search drives the tuner through a + diagnose-then-measure search — then synthesize the results. Agents that need to run madengine + always check for AMD GPUs first; if none are present they produce a plan (the exact commands + to run on a GPU host) instead of executing. +

+
+ + +
+

3 Quick-Start: What Do You Want to Do?

+ +
+
+
Run one benchmark
+
+
/mad-benchmark pyt_vllm_llama-3.1-8b
+
+
+
Benchmark multiple tags at once
+
+
/mad-benchmark-sweep --tags bert,deepseek
+
+
+
Add a brand-new model
+
+
/mad-add-model pyt_vllm_gemma-3-27b https://github.com/...
+
+
+
Attach a GPU profiler
+
+
/mad-profile bert rocprofv3_compute
+
+
+
Read or compare benchmark results
+
+
/mad-report perf.csv perf_old.csv
+
+
+
Improve throughput of a model
+
+
/mad-tune bert throughput
+
+
+
Validate files before a GPU run
+
+
/mad-validate bert
+
+
+
Systematic tuning search (N candidates)
+
+
/mad-tune-search --tag bert --candidates 6
+
+
+ +
+
GPU host check
+

+ madengine build and run require AMD GPUs. On a machine without GPUs, + agents automatically switch to plan mode — they validate the command and print + the exact invocation for you to run on a GPU host. GPU-free commands (discover, validate) + always work. +

+
+
+ + +
+

4 The Performance Contract

+ +

+ This is the single most important MAD convention. Every model's run.sh must + emit exactly one line to stdout that madengine can grep. Without it, + madengine records the run with no performance value. +

+ +
+
Required output — single result
+
performance: <value> <unit>
+

+ For example: performance: 3847.2 tokens_per_second or + performance: 124 samples_per_second. +

+
+ +

Implementing the line in run.sh

+ +
copy# Parse from the workload's own log output, then emit the contract line
+performance=$(grep -Eo "train_samples_per_second':[^,]+" log.txt \
+  | sed "s/.*: //g" | head -n 1)
+echo "performance: $performance samples_per_second"
+ +

Alternative: multiple results

+ +

+ When a script produces many rows (e.g. one per concurrency level), have the script write + its own CSV and declare "multiple_results" in models.json: +

+ +
copy# models.json entry
+{
+  "name": "pyt_vllm_deepseek-r1",
+  ...
+  "multiple_results": "perf_DeepSeek-R1.csv"
+}
+ +
+
Hard rule
+

+ A run script must satisfy exactly one of the two contracts above. + Satisfying neither means madengine records a run with no performance value. + The /mad-validate command checks this statically before any GPU run. +

+
+
+ + +
+

5 Slash Command Reference

+ +
+
+
+ +
+
+
+ /mad-benchmark + command + benchmark-runner +
+
Build and run a madengine benchmark for the given tag or model name.
+
+
+
+

Usage

+
copy/mad-benchmark <tag-or-model> [extra options]
+ +

Examples

+
copy/mad-benchmark bert
+/mad-benchmark pyt_vllm_llama-3.1-8b --timeout 3600
+/mad-benchmark deepseek --output results/deepseek.csv
+ +

What the agent does

+
    +
  1. Runs madengine discover --tags <tag> to confirm the tag resolves (GPU-free).
  2. +
  3. Assembles: madengine run --tags <tag> --live-output [-o path] [--timeout S].
  4. +
  5. Checks for AMD GPUs via rocm-smi / amd-smi.
  6. +
  7. On GPU host: runs the command and reports where perf.csv landed.
  8. +
  9. No GPU: prints the exact command to run on a GPU host and stops.
  10. +
+ +

Expected output

+
    +
  • A perf.csv row with the performance value and unit.
  • +
  • Live stdout from the model during the run (via --live-output).
  • +
  • Summary of resolved model(s), command used, and output file path.
  • +
+
+
+
+ +
+
+
+
+ +
+
+
+ /mad-add-model + command + model-author +
+
Scaffold a new MAD model — models.json entry, Dockerfile, and run.sh with the performance line.
+
+
+
+

Usage

+
copy/mad-add-model <framework_project_workload> [base notes / repo url]
+ +

Examples

+
copy/mad-add-model pyt_vllm_gemma-3-27b https://github.com/google/gemma
+/mad-add-model jax_maxtext_llama-4 bf16 inference, context=8192
+ +

What the agent does

+
    +
  1. Identifies the closest existing model of the same framework to use as template.
  2. +
  3. Creates three files: +
      +
    • models.json entry (required fields: name, url, dockerfile, scripts, n_gpus, owner, training_precision, tags)
    • +
    • docker/<name>.ubuntu.amd.Dockerfile with the required context header
    • +
    • scripts/<dir>/run.sh ending in echo "performance: $performance <unit>"
    • +
    +
  4. +
  5. Validates models.json with python3 -m json.tool models.json.
  6. +
  7. Confirms the entry is selectable: madengine discover --tags <name>.
  8. +
+ +

Dockerfile required header (first line)

+
copy# CONTEXT {'gpu_vendor': 'AMD', 'guest_os': 'UBUNTU'}
+ARG BASE_DOCKER=rocm/pytorch
+FROM $BASE_DOCKER
+ +

Expected output

+
    +
  • Three file paths created/edited.
  • +
  • The template model that was used.
  • +
  • The verification command: madengine run --tags <name> --live-output (requires GPU).
  • +
+
+
+
+ +
+
+
+
+ +
+
+
+ /mad-profile + command + benchmark-runner +
+
Run a madengine benchmark with a GPU profiling or tracing tool attached.
+
+
+
+

Usage

+
copy/mad-profile <tag-or-model> [tool: rocprofv3_compute|rpd|rccl_trace|...]
+ +

Examples

+
copy/mad-profile bert
+/mad-profile pyt_vllm_llama-3.1-8b rccl_trace
+/mad-profile deepseek rocprofv3_memory
+ +

What the agent does

+
    +
  1. Picks the profiling tool (defaults to rocprofv3_compute if unspecified).
  2. +
  3. Builds the command with --additional-context:
  4. +
+
copymadengine run --tags <tag> --live-output \
+  --additional-context '{"tools": [{"name": "rocprofv3_compute"}]}'
+ +

Available profiling tools (common subset)

+
+ + + + + + + + + + + + +
Tool nameWhat it captures
rocprofv3_computeKernel execution, occupancy, compute metrics
rocprofv3_memoryMemory bandwidth, cache hit rates
rocprofv3_communicationGPU-to-GPU communication metrics
rpdROCm profiling data, timeline view
rccl_traceCollective communication (AllReduce, etc.)
rocm_trace_liteLightweight system trace
gpu_info_power_profilerPower draw over time
rocblas_trace / hipblaslt_traceBLAS kernel calls
+
+

The complete authoritative list (23+ tools) is in the madengine package at scripts/common/tools.json.

+ +
+
Performance note
+

Profiling adds overhead. The perf number recorded under profiling is not a clean benchmark — use /mad-benchmark for baseline measurements.

+
+
+
+
+ +
+
+
+
+ +
+
+
+ /mad-report + command + perf-analyst +
+
Parse and summarize MAD benchmark results from perf.csv or perf_entry_super.json. Compare two result sets to find regressions.
+
+
+
+

Usage

+
copy/mad-report [csv-or-json-path] [compare-to-path]
+ +

Examples

+
copy/mad-report                               # defaults to perf.csv
+/mad-report results/after.csv
+/mad-report perf_new.csv perf_baseline.csv  # comparison mode
+ +

What the agent does

+
    +
  • Locates the result file(s) — defaults to perf.csv.
  • +
  • Single file: summarizes models run, performance + unit, pass/fail status.
  • +
  • Two files: per-model delta and % change; flags regressions (slower) vs improvements; respects each row's metric unit.
  • +
  • Leads with the headline finding, then a compact table.
  • +
+ +

Generate an HTML dashboard instead

+
copymadengine report to-html --csv-file-path perf.csv
+ +

perf.csv columns

+
+ + + + + + + + + + + + + +
ColumnMeaning
modelModel name from models.json
performanceNumeric result parsed from stdout
metricUnit (tokens_per_second, samples_per_second, …)
statusSUCCESS or failure reason
n_gpusNumber of GPUs used
training_precisionfp16, bf16, fp8, fp32, …
gpu_architecturegfx942, gfx950, …
git_commitCommit hash at run time
build_duration / test_durationWall time in seconds
+
+
+
+
+ +
+
+
+
+ +
+
+
+ /mad-tune + command + tuner +
+
Iteratively tune a MAD model for better performance using a disciplined measure-change-measure loop.
+
+
+
+

Usage

+
copy/mad-tune <tag-or-model> [target: throughput|latency] [lever hints]
+ +

Examples

+
copy/mad-tune bert
+/mad-tune pyt_vllm_llama-3.1-8b throughput
+/mad-tune deepseek latency try tensor-parallel=4
+ +

What the agent does

+
    +
  1. Reads the model's run.sh, config files, and current perf.csv row to establish a baseline.
  2. +
  3. Identifies tuning levers for the stack: +
      +
    • Env vars: MAD_MODEL_BATCH_SIZE, HIP_VISIBLE_DEVICES, NCCL_*/RCCL_*, PYTORCH_TUNABLEOP_ENABLED
    • +
    • Script/config args: tensor-parallel size, precision (fp16/bf16/fp8), sequence length, gpu-memory-utilization, max-num-seqs
    • +
    +
  4. +
  5. Proposes one change at a time, applies it, re-measures, keeps improvements, reverts regressions.
  6. +
  7. No GPU: produces a ranked list of candidate changes with rationale and the exact madengine run commands to test each.
  8. +
+ +

Expected output

+
    +
  • Baseline performance value + unit.
  • +
  • Table of each change tried with measured delta.
  • +
  • Final recommended configuration with before/after numbers.
  • +
+
+
+
+ +
+
+
+
+ +
+
+
+ /mad-validate + command + GPU-free +
+
Statically validate MAD model entries — JSON, paths, Dockerfile header, output contract. No GPU required.
+
+
+
+

Usage

+
copy/mad-validate [tag-or-model | all]
+ +

Examples

+
copy/mad-validate                    # checks all 142 entries
+/mad-validate bert               # checks only models matching "bert"
+/mad-validate pyt_vllm_deepseek-r1
+ +

Checks performed

+ +

Errors (break a run)

+
    +
  • models.json parses as valid JSON.
  • +
  • All entries have a name field and names are unique.
  • +
  • Dockerfile exists at <dockerfile>.ubuntu.amd.Dockerfile and first line is the context header.
  • +
  • Scripts path exists (a run.sh or directory).
  • +
  • Output contract satisfied: run script contains a performance: line, or entry sets multiple_results.
  • +
+ +

Warnings (convention metadata, non-fatal)

+
    +
  • Missing any of: url, owner, training_precision, tags, n_gpus.
  • +
+ +
+
Workflow tip
+

Run /mad-validate <name> immediately after /mad-add-model and before any GPU run. It catches tag typos, missing files, and missing performance lines in seconds.

+
+
+
+
+ + +
+

6 Workflow Reference

+ +
+
+
+ +
+
+
+ mad-benchmark-sweep + workflow +
+
Fan out a benchmark matrix across multiple MAD tags, run (or plan) each in parallel, and synthesize a comparison table. Accepts the same CLI-style flags as /mad-benchmark.
+
+
+
+

When to use

+

When you want to compare performance across multiple model tags or GPU counts in one operation.

+ +

Phases

+
+
+
Phase 1
+
Resolve
+
Expand the sweep matrix into concrete madengine invocations
+
+
+
Phase 2
+
Run
+
Execute or plan each benchmark in parallel subagents
+
+
+
Phase 3
+
Synthesize
+
Parse results into a comparison table with headline finding
+
+
+ +

Arguments

+

+ Accepts CLI-style flags — the same syntax as /mad-benchmark. + Pass one or more tags after --tags (comma- and/or space-separated); each tag + becomes one sweep cell. A structured JSON object is also still accepted. +

+
+ + + + + + + + + +
FlagMeaning
--tags a,b cOne or more model tags (comma and/or space separated). Required.
--additional-context '{...}'Threaded verbatim into every cell (HF token, docker_env_vars, deployment keys). A per-cell n_gpus is merged in when --ngpus is set.
--ngpus 4,8Optional GPU-count axis — multiplies the matrix (each tag × each count).
--timeout SOptional per-cell timeout in seconds.
--no-execute / --planValidate + resolve tags only, no GPU. Executes by default.
+
+ +

Example invocations

+
copy# Execute mode (default) — AMD GPU required, context threaded into every cell
+/mad-benchmark-sweep --tags bert,pyt_vllm_qwen3-8b --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}' --live-output
+
+# Plan mode — safe, no GPU required
+/mad-benchmark-sweep --tags bert pyt_vllm_deepseek-r1 --plan
+
+# GPU-count axis: each tag × each count
+/mad-benchmark-sweep --tags bert --ngpus 4,8
+ +
+
Executes by default
+

+ On a GPU host this workflow runs immediately. Pass --plan + (or --no-execute) to only validate commands and resolve tags. Each cell + still self-skips if no AMD GPU is present. +

+
+ +
+
perf.csv isolation
+

+ Each sweep cell writes to its own perf_<tag>.csv — never a shared + perf.csv. This prevents parallel runs from clobbering each other. + Concatenate the files afterward for a combined report. +

+
+ +
+
Precision is not a sweep axis
+

+ In MAD, precision (fp16/bf16/fp8) is fixed per model in + models.json — it is not a runtime flag. To compare precisions, + list the distinct model tags (e.g. pyt_vllm_llama-fp16 and + pyt_vllm_llama-fp8) in the tags array instead. +

+
+
+
+
+ + + + +
+

7 Agent Reference

+ +

+ Agents are specialized Claude instances with a defined tool set and a focused role. + They are dispatched automatically by slash commands and workflows — you generally never + call them directly. +

+ +
+
+
+ +
+
+
+ mad-benchmark-runner + agent +
+
Constructs and runs the correct madengine benchmark command from a model/tag or plain-English intent.
+
+
+
+

Tools: Bash, Read, Grep, Glob

+

Called by: /mad-benchmark, /mad-profile, mad-benchmark-sweep (workflow)

+
    +
  • Resolves tags with madengine discover (read-only, no GPU).
  • +
  • Builds the full madengine run command including profiling context if requested.
  • +
  • Checks for GPUs — plan mode if absent, execute if present.
  • +
  • Reports resolved models, exact command, required env vars, and output file path.
  • +
+
+
+ +
+
+
+ +
+
+
+ mad-model-author + agent +
+
Scaffolds a new MAD model — the models.json entry, the Dockerfile, and the run.sh with the performance line.
+
+
+
+

Tools: Read, Write, Edit, Grep, Glob

+

Called by: /mad-add-model

+
    +
  • Finds the closest existing model of the same framework to use as a template.
  • +
  • Creates all three required files following MAD conventions.
  • +
  • Validates models.json and confirms discoverability — never runs a GPU benchmark.
  • +
  • Reports file paths, template used, and the GPU verification command.
  • +
+
+
+ +
+
+
+ +
+
+
+ mad-perf-analyst + agent + read-only +
+
Reads and analyzes MAD benchmark results from perf.csv and perf_entry_super.json. Never edits files or runs workloads.
+
+
+
+

Tools: Read, Grep, Glob, Bash (read-only)

+

Called by: /mad-report

+
    +
  • Parses perf.csv, perf_entry_super.json, or any named CSV.
  • +
  • Summarizes or compares results; precise about units — never compares across different metric values.
  • +
  • States direction assumption (higher = better for throughput, lower = better for latency).
  • +
  • Uses python3 for parsing — never installs packages.
  • +
+
+
+ +
+
+
+ +
+
+
+ mad-tuner + agent +
+
Iteratively tunes an existing MAD model using a measure-change-measure loop. Reverts regressions automatically.
+
+
+
+

Tools: Read, Edit, Bash, Grep, Glob

+

Called by: /mad-tune, mad-tune-search (workflow)

+
    +
  • Establishes baseline from run.sh, config files, and current perf.csv.
  • +
  • Proposes and applies one lever change at a time — keeps improvements, reverts regressions.
  • +
  • Requires AMD GPUs to measure. Without GPUs, produces a ranked candidate list with exact commands.
  • +
  • Never alters the performance: <value> <unit> output contract.
  • +
+
+
+
+ + +
+

8 Environment Variables

+ +

Pass these environment variables to control model behavior and authentication.

+ +
+
+
MAD_SECRETS_HFTOKEN
+
HuggingFace API token. Required for gated models (e.g. Llama 3, Gemma). Set before running any HF model.
+
+
+
MAD_MODEL_BATCH_SIZE
+
Override the batch size for a model run. Primary tuning lever for throughput experiments.
+
+
+
MAD_RUNTIME_NGPUS
+
Override the number of GPUs used at runtime (also controllable via n_gpus in models.json).
+
+
+
MAD_SYSTEM_GPU_ARCHITECTURE
+
GPU architecture identifier (e.g. gfx942, gfx950). Used for arch-specific builds and skip_gpu_arch filtering.
+
+
+
MAD_MODEL_NAME
+
Model name passed into the container at runtime. Some scripts use this to select between model checkpoints.
+
+
+
PYTORCH_TUNABLEOP_ENABLED
+
Enable PyTorch TunableOp for auto-tuned GEMM kernels. A common throughput tuning lever for PyTorch models. First run is very slow (online GEMM tuning collapses throughput to ~0.2–1 tok/s); results are only meaningful on a warm second run with cached kernels. Set --timeout generously or do a throwaway warmup pass before measuring.
+
+
+
NCCL_* / RCCL_*
+
Collective communication tuning (chunk sizes, algorithms, socket buffers). Relevant for multi-GPU runs.
+
+
+
HIP_VISIBLE_DEVICES
+
Which GPU indices are visible to the container. Use to pin a run to specific GPUs.
+
+
+ +

Export pattern (shell)

+
copyexport MAD_SECRETS_HFTOKEN=hf_...
+export MAD_MODEL_BATCH_SIZE=32
+madengine run --tags pyt_vllm_llama-3.1-8b --live-output
+
+ + +
+

9 Tips and Gotchas

+ +
+ GPU guard — plan mode vs execute mode +
+

+ All agents check for AMD GPUs (rocm-smi or amd-smi) before + attempting to build or run. On a machine without GPUs: +

+
    +
  • The agent prints the exact madengine command to run on a GPU host.
  • +
  • It lists required env vars (e.g. MAD_SECRETS_HFTOKEN).
  • +
  • It stops — it does not attempt to run the command.
  • +
+

+ GPU-free operations always work: madengine discover, /mad-validate, + and plan mode for benchmarks and tuning. +

+
copy# GPU-free: confirm a tag resolves
+madengine discover --tags bert
+
+
+ +
+ Precision is not a runtime flag +
+

+ In MAD, training_precision (fp16, bf16, fp8, fp32) is fixed per model in + models.json. It is baked into the Dockerfile and run script — you cannot + change it with a madengine flag at runtime. +

+

+ To compare precisions in a sweep, use separate model tags that differ + in precision. For example: pyt_vllm_llama-fp16 and + pyt_vllm_llama-fp8 as two entries in the tags array of + mad-benchmark-sweep. +

+
+
+ +
+ perf.csv isolation in parallel sweeps +
+

+ The mad-benchmark-sweep workflow gives each cell its own output file + (perf_<tag>.csv) to prevent parallel runs from clobbering a shared + perf.csv. After a sweep, concatenate for a combined view: +

+
copy# Merge sweep results (keep one header row)
+head -1 perf_bert.csv > perf_combined.csv
+tail -n +2 perf_bert.csv perf_deepseek.csv >> perf_combined.csv
+madengine report to-html --csv-file-path perf_combined.csv
+
+
+ +
+ madengine discover — GPU-free tag validation +
+

+ Before any GPU run, confirm that your tags resolve to real models. + madengine discover reads models.json without touching Docker + or GPUs. +

+
copymadengine discover --tags pyt_vllm_deepseek-r1
+madengine discover --tags vllm   # matches all tags containing "vllm"
+

+ The tags field in a model entry supports any-match: madengine + selects a model if --tags matches the exact name OR any + string in the tags list. +

+
+
+ +
+ TunableOp needs a warmup — don't measure cold +
+

+ PYTORCH_TUNABLEOP_ENABLED=1 triggers online GEMM auto-tuning on + the first run. During this phase throughput collapses to roughly + 0.2–1 tok/s as PyTorch benchmarks every matrix-multiply shape. The actual improvement + only shows on subsequent runs using the cached kernel selection. +

+

+ Measuring TunableOp cold produces a misleading (and huge) regression that the verify + step will correctly mark untrustworthy — but only after burning 30+ minutes of GPU time. + Best practice: +

+
    +
  • Run a throwaway warmup pass first (--skip-model-run then one full run) to populate the cache.
  • +
  • Always set --timeout so a single candidate cannot stall the whole tune-search.
  • +
  • The profiling bottleneck data guides whether TunableOp is even worth trying — only relevant for compute-bound (GEMM-dominant) workloads.
  • +
+
+
+ +
+ Qwen3-8B live tuning findings (MI350X / gfx950) +
+

+ From a live profiling + tuning run on 8× AMD Instinct MI350X: +

+

Profiling result (rocm_trace_lite, 397,292 GPU ops, 8.6% GPU utilization):

+
+ + + + + + + + +
Kernel% GPU timeImplication
Cijk_Alik_Bljk_BBS_BH_Bias_HA… (GEMM)27.1%Compute-bound — GEMM/wvSplitK dominate → target with TunableOp, hipBLASLt, higher concurrency
wvSplitK20.6%
_fwd_kernel (attention)8.3%Memory-bound secondary; paged attention
__amd_rocclr_copyBuffer7.0%Data movement overhead
+
+

Best candidate resultmax_concurrency=32 (vs baseline concurrency=1):

+
+ + + + + + +
MetricBaselineTunedDelta
throughput_tot~422 tok/s7,993 tok/s+18.9×
throughput_gen~211 tok/s3,996 tok/s+18.9×
+
+

+ Baseline was run at max_concurrency=1 (10 prompts) — a latency-optimized + config. Raising concurrency is the primary throughput lever for vLLM serving workloads. +

+
+
+ +
+ tools.json — authoritative profiler name list +
+

+ The /mad-profile command accepts any profiler tool name. The complete + authoritative list (23+ tools) lives in the madengine package at + scripts/common/tools.json. Common names include: +

+
copyrocprofv3_compute    rocprofv3_memory     rocprofv3_communication
+rocprofv3_full       rocprofv3            rpd
+rocm_trace_lite      rccl_trace           gpu_info_power_profiler
+rocblas_trace        hipblaslt_trace      miopen_trace
+rocprof_sys
+

+ If an agent gives you an unknown tool error, consult tools.json directly + for the exact name. +

+
+
+ +
+ Deployment targets — local Docker vs SLURM vs Kubernetes +
+

+ madengine infers the deployment target from which key is present in + --additional-context. No special flag needed. +

+
copy# Local Docker (default — no key)
+madengine run --tags bert --live-output
+
+# SLURM (presence of "slurm" key)
+madengine run --tags bert -c '{"slurm": {"partition": "gpu", "nodes": 2}}'
+
+# Kubernetes (presence of "k8s" or "kubernetes" key)
+madengine run --tags bert -c '{"k8s": {"namespace": "ml-jobs"}}'
+
+
+ +
+ Build-once, run-many pattern +
+

+ When running the same model many times (e.g. during tuning), build the image once and + reuse it to save time. +

+
copy# Step 1: build only, writes build_manifest.json
+madengine build --tags pyt_vllm_llama-3.1-8b
+
+# Step 2: run from manifest (skips rebuild)
+madengine run -m build_manifest.json --live-output
+
+# Alternatively: skip the model run but verify image builds
+madengine run --tags pyt_vllm_llama-3.1-8b --skip-model-run
+
+
+ +
+ models.json naming convention +
+

+ Model names follow {framework}_{project}_{workload}. Examples: +

+
+ + + + + + + + +
NameFrameworkProjectWorkload
pyt_vllm_deepseek-r1pyt (PyTorch)vllmdeepseek-r1
jax_maxtext_llama-4jaxmaxtextllama-4
pyt_primus_qwen3-1.7bpytprimusqwen3-1.7b
huggingface_berthuggingfacebert
+
+

+ The dockerfile field stores the path prefix — madengine automatically + appends .ubuntu.amd.Dockerfile. +

+
+
+ +
+ Uploading results to MongoDB +
+

Use madengine database to push results to MongoDB after a run.

+
copy# Dry run first to preview
+madengine database -f perf_entry_super.json --db mad_results -c benchmarks \
+  -k model,timestamp --dry-run
+
+# Upload for real
+madengine database -f perf_entry_super.json --db mad_results -c benchmarks \
+  -k model,timestamp
+

+ perf_entry_super.json is the enriched record (includes a configs + block) that gets pushed. -k model,timestamp sets the unique key for upsert. +

+
+
+
+ +
+ + + + + \ No newline at end of file