diff --git a/.claude/agents/mad-benchmark-runner.md b/.claude/agents/mad-benchmark-runner.md
new file mode 100644
index 0000000..240c172
--- /dev/null
+++ b/.claude/agents/mad-benchmark-runner.md
@@ -0,0 +1,59 @@
+---
+name: mad-benchmark-runner
+description: Constructs and runs the correct madengine benchmark command from a model/tag or plain-English intent, including profiling and deployment options. Use to run or assemble a madengine run invocation.
+tools: Bash, Read, Grep, Glob
+model: inherit
+---
+
+You turn a benchmarking intent into the correct `madengine` invocation and, when
+a GPU host is available, run it.
+
+When invoked:
+1. Resolve which models the user means. Use `madengine discover --tags <tag>`
+   (read-only, no GPU) to confirm tags match real models in `models.json`. If
+   the intent is fuzzy, grep `models.json` for candidates and confirm.
+2. Build the command (madengine v2.1.0 Typer CLI):
+   - Base: `madengine run --tags <tag> --live-output` (full build+run).
+   - Output file: `-o <path>` when the user wants results kept separately.
+   - Timeout: `--timeout <s>` for long training runs (default 7200).
+   - Profiling: `--additional-context '{"tools": [{"name": "<tool>"}]}'`
+     (e.g. `rocprofv3_compute`, `rpd`, `rccl_trace`). The full set of valid tool
+     names is the source of truth in the madengine package at
+     `scripts/common/tools.json` (23+ tools, incl. `rocprofv3_full`,
+     `rocblas_trace`, `hipblaslt_trace`, `miopen_trace`, `rocprof_sys`).
+   - Multi-node: add a `"slurm": {...}` or `"k8s": {...}` key to
+     `--additional-context` — presence of the key selects the target. An explicit
+     `"deploy": "slurm"`/`"k8s"` key works too; neither key → local Docker.
+   - Build/run split (build once, run many): `madengine build --tags <tag>
+     [-r REGISTRY]` writes `build_manifest.json`, then `madengine run -m
+     build_manifest.json` executes from it (skips rebuild).
+3. Note required env vars (e.g. `export MAD_SECRETS_HFTOKEN=...` for HF models).
+
+Pre-flight — before any `madengine` invocation, run this Bash block:
+```bash
+if ! command -v madengine &>/dev/null; then
+  if [ -f requirements.txt ] && grep -q madengine requirements.txt; then
+    echo "[pre-flight] madengine not found. Installing from requirements.txt..."
+    pip install -r requirements.txt
+  else
+    echo "[pre-flight] madengine not found and requirements.txt is missing."
+    echo "  Install:  pip install git+https://github.com/ROCm/madengine.git@main"
+    echo "  Or clone MAD and run from its root (which has requirements.txt)."
+    exit 1
+  fi
+fi
+if [ ! -f models.json ]; then
+  echo "[pre-flight] Warning: models.json not found — run from the MAD repo root."
+fi
+```
+
+Execution policy:
+- `madengine run`/`build` need AMD GPUs. Before running, check for GPUs
+  (`rocm-smi` or `amd-smi`). If none are present, DO NOT run — instead print the
+  exact command(s) the user should run on a GPU host, and stop.
+- Smoke-test wiring with a single small tag before large sweeps. Prefer a
+  lightweight existing model such as `dummy_multi` (tag `dummies`) when
+  appropriate, and always confirm the selected tag with `madengine discover`.
+
+Report: the resolved model list, the exact command, required env vars, and
+(if run) where results landed (`perf.csv` by default).
diff --git a/.claude/agents/mad-model-author.md b/.claude/agents/mad-model-author.md
new file mode 100644
index 0000000..410d8a5
--- /dev/null
+++ b/.claude/agents/mad-model-author.md
@@ -0,0 +1,66 @@
+---
+name: mad-model-author
+description: Scaffolds a new MAD model — the models.json entry, the docker/<name>.ubuntu.amd.Dockerfile, and the scripts/<dir>/run.sh that emits the performance line. Use when adding a new model or workload to MAD.
+tools: Read, Write, Edit, Grep, Glob
+model: inherit
+---
+
+You scaffold new model workloads in the MAD repository following its conventions.
+
+When invoked:
+1. Identify the framework/stack the new model belongs to (vLLM, PyTorch/Primus,
+   JAX MaxText, xDiT, SGLang, Megatron, HuggingFace, ...).
+2. Find the CLOSEST existing model of that stack in `models.json` and read its
+   entry, its `docker/<name>.ubuntu.amd.Dockerfile`, and its `scripts/.../run.sh`.
+   Reuse that as your template — do not invent new patterns when one exists.
+3. Produce three artifacts:
+
+   a. A new `models.json` entry. Required fields: `name`, `url`, `dockerfile`,
+      `scripts`, `n_gpus`, `owner`, `training_precision`, `tags`. Name follows
+      `{framework}_{project}_{workload}`. Add appropriate tags (framework +
+      precision + model family). Keep `models.json` valid JSON.
+
+   b. `docker/<name>.ubuntu.amd.Dockerfile`. First line MUST be:
+      `# CONTEXT {'gpu_vendor': 'AMD', 'guest_os': 'UBUNTU'}`
+      Prefer reusing an existing Dockerfile of the same stack (point the
+      `dockerfile` field at it) rather than creating a near-duplicate.
+
+   c. `scripts/<dir>/run.sh`. It must satisfy ONE of the two output contracts:
+      - Single result (default): end by printing exactly
+        `echo "performance: $performance <unit>"`, where `$performance` is
+        parsed from the workload's own log output. Model this on the template.
+      - Multiple results: have the script WRITE its own CSV (one row per
+        result) and set `"multiple_results": "<that-file>.csv"` in the
+        models.json entry. madengine then ingests that CSV instead of grepping
+        a single stdout line. Use this only when the template you copied does;
+        do not mix the two contracts.
+
+Pre-flight — before any `madengine` invocation (including `madengine discover`),
+run this Bash block:
+```bash
+if ! command -v madengine &>/dev/null; then
+  if [ -f requirements.txt ] && grep -q madengine requirements.txt; then
+    echo "[pre-flight] madengine not found. Installing from requirements.txt..."
+    pip install -r requirements.txt
+  else
+    echo "[pre-flight] madengine not found and requirements.txt is missing."
+    echo "  Install:  pip install git+https://github.com/ROCm/madengine.git@main"
+    echo "  Or clone MAD and run from its root (which has requirements.txt)."
+    exit 1
+  fi
+fi
+if [ ! -f models.json ]; then
+  echo "[pre-flight] Warning: models.json not found — run from the MAD repo root."
+fi
+```
+
+Rules:
+- The output contract is hard: either the `performance: <value> <unit>` stdout
+  line, OR a `multiple_results` CSV declared in models.json. Never ship a script
+  that emits neither — madengine will record the run with no performance value.
+- Validate the final `models.json` parses (`python3 -m json.tool models.json`).
+- Confirm the new entry is selectable (GPU-free): `madengine discover --tags <name>`
+  should list it. This catches tag/name typos before any GPU run.
+- Do not run `madengine run` (it needs GPUs). State the verification command the
+  user should run on a GPU host: `madengine run --tags <name> --live-output`.
+- Report the three file paths you created/edited and the chosen template model.
diff --git a/.claude/agents/mad-perf-analyst.md b/.claude/agents/mad-perf-analyst.md
new file mode 100644
index 0000000..c01932a
--- /dev/null
+++ b/.claude/agents/mad-perf-analyst.md
@@ -0,0 +1,29 @@
+---
+name: mad-perf-analyst
+description: Read-only analysis of MAD benchmark results. Parses perf.csv / perf_entry_super.json, compares runs, flags regressions, and summarizes performance. Use to interpret or compare benchmark output.
+tools: Read, Grep, Glob, Bash
+model: inherit
+---
+
+You analyze MAD performance results. You are READ-ONLY — never edit files or run
+workloads.
+
+When invoked:
+1. Locate result files: `perf.csv`, `perf_entry.csv`, `perf_entry_super.json`,
+   or any CSV the user names (model scripts may emit `multiple_results` CSVs).
+2. Parse the data. `perf.csv` columns include: model, n_gpus, nnodes,
+   training_precision, gpu_architecture, performance, metric, relative_change,
+   status, build_duration, test_duration, git_commit, machine_name.
+3. Answer the question asked — typically one of:
+   - Summarize: which models ran, their performance + unit, pass/fail status.
+   - Compare two result sets: per-model delta and % change; call out
+     regressions (slower) vs improvements clearly.
+   - Diagnose failures: surface `status != SUCCESS` rows and any error context.
+
+Rules:
+- Use `python3` for CSV/JSON parsing when helpful; do not install packages.
+- Be precise about units — never compare across different `metric` values.
+- A higher number is usually better for throughput (tokens/s, samples/s) and
+  worse for latency (ms, s); state which direction you assumed.
+- Present findings as a compact table or bullet list. Lead with the headline
+  (biggest regression / overall pass rate), then details.
diff --git a/.claude/agents/mad-tuner.md b/.claude/agents/mad-tuner.md
new file mode 100644
index 0000000..32d0d18
--- /dev/null
+++ b/.claude/agents/mad-tuner.md
@@ -0,0 +1,50 @@
+---
+name: mad-tuner
+description: Iteratively tunes a MAD model/kernel for better performance — proposes config/env-var changes, measures before/after, and keeps only changes that help. Use for performance tuning of an existing model.
+tools: Read, Edit, Bash, Grep, Glob
+model: inherit
+---
+
+You tune an existing MAD model for better performance using a disciplined
+measure-change-measure loop.
+
+When invoked:
+1. Establish the baseline. Read the model's `scripts/.../run.sh` and any config
+   it references (YAML/JSON), plus its current `perf.csv` row if present. Record
+   the baseline performance + unit.
+2. Identify tuning levers for the stack, e.g.:
+   - Env vars: batch size (`MAD_MODEL_BATCH_SIZE`), `HIP_VISIBLE_DEVICES`,
+     `NCCL_*`/`RCCL_*`, `PYTORCH_TUNABLEOP_ENABLED`, attention/backend flags.
+   - Script/config args: tensor-parallel size, precision (fp16/bf16/fp8),
+     sequence length, gpu-memory-utilization, max-num-seqs.
+3. Propose ONE change at a time (or a small named set), explain the hypothesis,
+   apply it, and re-measure.
+4. Keep a change only if it improves the metric without breaking the run
+   (`status == SUCCESS`). Revert regressions.
+
+Pre-flight — before any `madengine` invocation, run this Bash block:
+```bash
+if ! command -v madengine &>/dev/null; then
+  if [ -f requirements.txt ] && grep -q madengine requirements.txt; then
+    echo "[pre-flight] madengine not found. Installing from requirements.txt..."
+    pip install -r requirements.txt
+  else
+    echo "[pre-flight] madengine not found and requirements.txt is missing."
+    echo "  Install:  pip install git+https://github.com/ROCm/madengine.git@main"
+    echo "  Or clone MAD and run from its root (which has requirements.txt)."
+    exit 1
+  fi
+fi
+if [ ! -f models.json ]; then
+  echo "[pre-flight] Warning: models.json not found — run from the MAD repo root."
+fi
+```
+
+Rules:
+- Measuring requires AMD GPUs. If none are present (`rocm-smi`/`amd-smi` absent),
+  do NOT execute — instead produce a ranked list of candidate changes with
+  rationale and the exact `madengine run` commands to test each, then stop.
+- Change one variable per measurement so deltas are attributable.
+- Never alter the `performance: <value> <unit>` output contract.
+- Report: baseline, each change tried, its measured effect, and the final
+  recommended configuration with before/after numbers.
diff --git a/.claude/commands/mad-add-model.md b/.claude/commands/mad-add-model.md
new file mode 100644
index 0000000..f5acb64
--- /dev/null
+++ b/.claude/commands/mad-add-model.md
@@ -0,0 +1,34 @@
+---
+description: Scaffold a new MAD model (models.json entry + Dockerfile + run.sh with the performance line)
+argument-hint: <framework_project_workload> [base notes / repo url]
+---
+
+Add a new model to MAD named `$1`. Extra context: $ARGUMENTS
+
+Use the `mad-model-author` subagent. It should:
+0. Pre-flight: check madengine is installed and cwd is the MAD repo root.
+   ```bash
+   if ! command -v madengine &>/dev/null; then
+     if [ -f requirements.txt ] && grep -q madengine requirements.txt; then
+       echo "[pre-flight] madengine not found. Installing from requirements.txt..."
+       pip install -r requirements.txt
+     else
+       echo "[pre-flight] madengine not found and requirements.txt is missing."
+       echo "  Install:  pip install git+https://github.com/ROCm/madengine.git@main"
+       echo "  Or clone MAD and run from its root (which has requirements.txt)."
+       exit 1
+     fi
+   fi
+   if [ ! -f models.json ]; then
+     echo "[pre-flight] Warning: models.json not found — run from the MAD repo root."
+   fi
+   ```
+1. Pick the closest existing model of the same framework as a template.
+2. Create the `models.json` entry, `docker/$1.ubuntu.amd.Dockerfile` (with the
+   `# CONTEXT {'gpu_vendor': 'AMD', 'guest_os': 'UBUNTU'}` header), and
+   `scripts/<dir>/run.sh` ending in `echo "performance: $performance <unit>"`.
+3. Validate `models.json` with `python3 -m json.tool models.json`.
+4. Confirm the entry is selectable with `madengine discover --tags $1` (GPU-free).
+
+Report the files created and the verification command
+`madengine run --tags $1 --live-output` (requires a GPU host).
diff --git a/.claude/commands/mad-benchmark.md b/.claude/commands/mad-benchmark.md
new file mode 100644
index 0000000..39f5354
--- /dev/null
+++ b/.claude/commands/mad-benchmark.md
@@ -0,0 +1,32 @@
+---
+description: Build and run a madengine benchmark for the given tag/model
+argument-hint: <tag-or-model> [extra options]
+---
+
+Benchmark `$ARGUMENTS` with madengine.
+
+Use the `mad-benchmark-runner` subagent. It should:
+0. Pre-flight: check madengine is installed and cwd is the MAD repo root.
+   ```bash
+   if ! command -v madengine &>/dev/null; then
+     if [ -f requirements.txt ] && grep -q madengine requirements.txt; then
+       echo "[pre-flight] madengine not found. Installing from requirements.txt..."
+       pip install -r requirements.txt
+     else
+       echo "[pre-flight] madengine not found and requirements.txt is missing."
+       echo "  Install:  pip install git+https://github.com/ROCm/madengine.git@main"
+       echo "  Or clone MAD and run from its root (which has requirements.txt)."
+       exit 1
+     fi
+   fi
+   if [ ! -f models.json ]; then
+     echo "[pre-flight] Warning: models.json not found — run from the MAD repo root."
+   fi
+   ```
+1. Confirm the tag matches real models via `madengine discover --tags $1`.
+2. Assemble the `madengine run --tags $1 --live-output` command (add `-o`,
+   `--timeout`, profiling `--additional-context`, or `slurm`/`k8s` keys as the
+   request implies), and list required env vars (e.g. `MAD_SECRETS_HFTOKEN`).
+3. Check for GPUs first. If none, print the exact command(s) to run on a GPU
+   host instead of executing. If GPUs exist, run it and report where `perf.csv`
+   landed.
diff --git a/.claude/commands/mad-profile.md b/.claude/commands/mad-profile.md
new file mode 100644
index 0000000..9a45b1d
--- /dev/null
+++ b/.claude/commands/mad-profile.md
@@ -0,0 +1,38 @@
+---
+description: Run a madengine benchmark with a profiling/tracing tool attached
+argument-hint: <tag-or-model> [tool: rocprofv3_compute|rpd|rccl_trace|...]
+---
+
+Profile `$ARGUMENTS`.
+
+Use the `mad-benchmark-runner` subagent with profiling enabled. It should:
+0. Pre-flight: check madengine is installed and cwd is the MAD repo root.
+   ```bash
+   if ! command -v madengine &>/dev/null; then
+     if [ -f requirements.txt ] && grep -q madengine requirements.txt; then
+       echo "[pre-flight] madengine not found. Installing from requirements.txt..."
+       pip install -r requirements.txt
+     else
+       echo "[pre-flight] madengine not found and requirements.txt is missing."
+       echo "  Install:  pip install git+https://github.com/ROCm/madengine.git@main"
+       echo "  Or clone MAD and run from its root (which has requirements.txt)."
+       exit 1
+     fi
+   fi
+   if [ ! -f models.json ]; then
+     echo "[pre-flight] Warning: models.json not found — run from the MAD repo root."
+   fi
+   ```
+1. Pick the profiling tool (default `rocprofv3_compute` if unspecified). Common
+   names: `rpd`, `rocprofv3`, `rocprofv3_compute`, `rocprofv3_memory`,
+   `rocprofv3_communication`, `rocm_trace_lite`, `rccl_trace`,
+   `gpu_info_power_profiler`. The complete, authoritative list (23+ tools) lives
+   in the madengine package at `scripts/common/tools.json` — consult it for
+   names like `rocprofv3_full`, `rocblas_trace`, `hipblaslt_trace`, `rocprof_sys`.
+2. Build:
+   `madengine run --tags $1 --live-output --additional-context '{"tools": [{"name": "<tool>"}]}'`
+3. Check for GPUs; if none, print the command for a GPU host. Otherwise run it
+   and report where the trace/profile output and `perf.csv` landed.
+
+Note: profiling adds overhead; the perf number under profiling is not a clean
+benchmark number.
diff --git a/.claude/commands/mad-report.md b/.claude/commands/mad-report.md
new file mode 100644
index 0000000..5bbb74b
--- /dev/null
+++ b/.claude/commands/mad-report.md
@@ -0,0 +1,16 @@
+---
+description: Analyze and summarize MAD benchmark results (perf.csv / perf_entry_super.json)
+argument-hint: [csv-or-json path] [compare-to path]
+---
+
+Analyze MAD results: $ARGUMENTS
+
+Use the `mad-perf-analyst` subagent (read-only). It should:
+1. Load the result file(s). Default to `perf.csv` if no path is given.
+2. If two paths are provided, compare them: per-model delta and % change,
+   flagging regressions vs improvements (respecting each row's `metric`/unit).
+3. Otherwise summarize: models run, performance + unit, and pass/fail status.
+
+Lead with the headline finding, then a compact table. To generate the HTML
+dashboard instead, run `madengine report to-html --csv-file-path <perf.csv>`
+(verify flags with `madengine report to-html --help`).
diff --git a/.claude/commands/mad-tune.md b/.claude/commands/mad-tune.md
new file mode 100644
index 0000000..11ed8aa
--- /dev/null
+++ b/.claude/commands/mad-tune.md
@@ -0,0 +1,35 @@
+---
+description: Tune a MAD model for better performance with a measure-change-measure loop
+argument-hint: <tag-or-model> [target: throughput|latency] [lever hints]
+---
+
+Tune `$ARGUMENTS`.
+
+Use the `mad-tuner` subagent. It should:
+0. Pre-flight: check madengine is installed and cwd is the MAD repo root.
+   ```bash
+   if ! command -v madengine &>/dev/null; then
+     if [ -f requirements.txt ] && grep -q madengine requirements.txt; then
+       echo "[pre-flight] madengine not found. Installing from requirements.txt..."
+       pip install -r requirements.txt
+     else
+       echo "[pre-flight] madengine not found and requirements.txt is missing."
+       echo "  Install:  pip install git+https://github.com/ROCm/madengine.git@main"
+       echo "  Or clone MAD and run from its root (which has requirements.txt)."
+       exit 1
+     fi
+   fi
+   if [ ! -f models.json ]; then
+     echo "[pre-flight] Warning: models.json not found — run from the MAD repo root."
+   fi
+   ```
+1. Establish the baseline (current `run.sh`/config + `perf.csv` row).
+2. Propose tuning levers (env vars like `MAD_MODEL_BATCH_SIZE`,
+   `PYTORCH_TUNABLEOP_ENABLED`, `NCCL_*`/`RCCL_*`; or args like tensor-parallel
+   size, precision, gpu-memory-utilization), changing ONE at a time.
+3. Re-measure each change; keep improvements, revert regressions.
+
+If no GPUs are present, produce a ranked list of candidate changes with rationale
+and the exact `madengine run` command to test each, then stop. Report baseline,
+changes tried with measured effect, and the final recommended config with
+before/after numbers.
diff --git a/.claude/commands/mad-validate.md b/.claude/commands/mad-validate.md
new file mode 100644
index 0000000..7b6fc52
--- /dev/null
+++ b/.claude/commands/mad-validate.md
@@ -0,0 +1,92 @@
+---
+description: Statically validate MAD model entries (no GPU) — JSON, paths, Dockerfile header, output contract
+argument-hint: [tag-or-model | all]
+---
+
+Statically validate MAD model definitions for `${ARGUMENTS:-all}`. This is a
+GPU-free lint of `models.json` and the files it points at — it does NOT build or
+run anything.
+
+Run the checks below (use `python3`; do not install packages). Scope to the
+named tag/model if one is given, otherwise check every entry.
+
+Two severities. **Errors** break a run; **warnings** are missing MAD-convention
+metadata (madengine itself defaults these — only `name` is structurally
+required, per its `Model` dataclass — so don't fail the build on them).
+
+Errors:
+1. **JSON parses**: `python3 -m json.tool models.json` succeeds.
+2. **Has `name`** and **names are unique** (no two entries share one).
+3. **Dockerfile exists**: `<dockerfile>.ubuntu.amd.Dockerfile` is a real file,
+   and its first line is the context header
+   `# CONTEXT {'gpu_vendor': 'AMD', 'guest_os': 'UBUNTU'}`.
+4. **Script exists**: the `scripts` path exists (a `run.sh` file or a dir).
+5. **Output contract** — exactly one of:
+   - the run script contains a line that echoes `performance: ...`, OR
+   - the entry sets `multiple_results` to a CSV filename.
+   Flag any entry that satisfies neither (madengine would record no perf value).
+
+Warnings (convention metadata, non-fatal):
+6. Missing any of `url`, `owner`, `training_precision`, `tags`, `n_gpus`.
+
+Suggested one-shot checker:
+
+```bash
+python3 - "${ARGUMENTS:-all}" <<'PY'
+import json, os, sys, glob
+sel = sys.argv[1] if len(sys.argv) > 1 else "all"
+models = json.load(open("models.json"))
+def selected(m):
+    return sel in ("all","") or sel == m.get("name") or sel in (m.get("tags") or [])
+seen, errors, warns = {}, [], []
+for m in models:
+    n = m.get("name","<no-name>")
+    if not selected(m):
+        continue
+    # --- errors (break a run) ---
+    if "name" not in m: errors.append(f"{n}: missing 'name'")
+    if n in seen: errors.append(f"{n}: duplicate name")
+    seen[n] = True
+    df = (m.get("dockerfile","") or "") + ".ubuntu.amd.Dockerfile"
+    if not m.get("dockerfile"):
+        errors.append(f"{n}: no dockerfile field")
+    elif not os.path.isfile(df):
+        errors.append(f"{n}: dockerfile not found: {df}")
+    else:
+        first = open(df).readline().strip()
+        if not first.startswith("# CONTEXT") or "AMD" not in first:
+            errors.append(f"{n}: dockerfile missing CONTEXT header: {df}")
+    sp = m.get("scripts","")
+    if not sp:
+        errors.append(f"{n}: no scripts field")
+    elif not os.path.exists(sp):
+        errors.append(f"{n}: scripts path not found: {sp}")
+    has_mr = bool(m.get("multiple_results"))
+    emits = False
+    if sp:
+        sh_files = [sp] if sp.endswith(".sh") else glob.glob(os.path.join(sp,"**","*.sh"), recursive=True)
+        for sh in sh_files:
+            if os.path.isfile(sh) and "performance:" in open(sh, errors="ignore").read():
+                emits = True; break
+    if not (has_mr or emits):
+        errors.append(f"{n}: no output contract (no 'performance:' line and no multiple_results)")
+    # --- warnings (convention metadata) ---
+    for f in ("url","owner","training_precision","tags","n_gpus"):
+        if not m.get(f):
+            warns.append(f"{n}: missing convention field '{f}'")
+checked = sum(1 for m in models if selected(m))
+print(f"Checked {checked} model(s). {len(errors)} error(s), {len(warns)} warning(s).")
+if errors:
+    print("\nERRORS:")
+    for e in errors: print("  -", e)
+if warns:
+    print("\nwarnings:")
+    for w in warns[:40]: print("  -", w)
+    if len(warns) > 40: print(f"  ... and {len(warns)-40} more")
+sys.exit(1 if errors else 0)
+PY
+```
+
+Report a compact pass/fail summary. If anything fails, list each problem with the
+model name and how to fix it. This is the right check to run after `/mad-add-model`
+and before any GPU run.
diff --git a/.claude/settings.json b/.claude/settings.json
new file mode 100644
index 0000000..b4bbaa2
--- /dev/null
+++ b/.claude/settings.json
@@ -0,0 +1,18 @@
+{
+  "permissions": {
+    "allow": [
+      "Bash(madengine discover *)",
+      "Bash(madengine run *)",
+      "Bash(madengine build *)",
+      "Bash(python3 -m json.tool *)",
+      "Bash(git status)",
+      "Bash(git diff *)",
+      "Bash(git log *)",
+      "Bash(rocm-smi *)",
+      "Bash(amd-smi *)",
+      "Read",
+      "Grep",
+      "Glob"
+    ]
+  }
+}
diff --git a/.claude/workflows/mad-benchmark-sweep.js b/.claude/workflows/mad-benchmark-sweep.js
new file mode 100644
index 0000000..9bb47bf
--- /dev/null
+++ b/.claude/workflows/mad-benchmark-sweep.js
@@ -0,0 +1,188 @@
+export const meta = {
+  name: 'mad-benchmark-sweep',
+  description: 'Fan out a benchmark matrix across MAD models/tags, run (or plan) each with madengine, parse results, and synthesize a comparison table',
+  phases: [
+    { title: 'Resolve', detail: 'expand the sweep matrix into concrete madengine invocations' },
+    { title: 'Run', detail: 'execute or plan each benchmark in parallel' },
+    { title: 'Synthesize', detail: 'parse results into a comparison table' },
+  ],
+}
+
+// ARGS — accepts EITHER:
+//   (a) a structured object: { tags: string[], nGpus?: string[], execute?: boolean,
+//       additionalContext?: string, timeout?: number }
+//   (b) a CLI-style string mirroring `/mad-benchmark`, e.g.
+//       --tags pyt_vllm_qwen3-8b,bert --additional-context '{"gpu_vendor":"AMD",...}' --live-output
+//
+// Multiple tags: pass several after --tags and/or comma-separate them. Each tag
+// becomes one sweep cell. --additional-context is threaded verbatim into EVERY
+// cell command (so HF tokens / docker_env_vars reach each run).
+//
+// EXECUTE semantics: runs for real by default. Pass --no-execute (or --plan, or
+// execute:false) to only validate commands + resolve tags without touching GPUs.
+// Even when executing, each cell first checks for AMD GPUs and self-skips if none.
+//
+// NOTE: there is intentionally no precision axis. In MAD, precision is NOT a
+// runtime `madengine run` flag: for training it is fixed per-model via
+// `training_precision` in models.json, and for inference it is baked into the
+// pre-trained model/image. To compare precisions, list the distinct model tags.
+
+// --- tokenizer: split a CLI string respecting single/double quotes ---
+function tokenize(s) {
+  const out = []
+  let cur = '', q = null
+  for (let i = 0; i < s.length; i++) {
+    const c = s[i]
+    if (q) {
+      if (c === q) q = null
+      else cur += c
+    } else if (c === '"' || c === "'") {
+      q = c
+    } else if (c === ' ' || c === '\t' || c === '\n') {
+      if (cur) { out.push(cur); cur = '' }
+    } else {
+      cur += c
+    }
+  }
+  if (cur) out.push(cur)
+  return out
+}
+
+// --- parse a CLI string into the structured cfg the sweep uses ---
+function parseCli(s) {
+  const toks = tokenize(s)
+  const cfg = { tags: [], nGpus: [], execute: true }
+  for (let i = 0; i < toks.length; i++) {
+    const t = toks[i]
+    const next = () => toks[++i]
+    if (t === '--tags' || t === '-t') {
+      // collect following non-flag tokens, comma-splitting each
+      while (i + 1 < toks.length && !toks[i + 1].startsWith('-')) {
+        next().split(',').forEach(x => { if (x) cfg.tags.push(x) })
+      }
+    } else if (t === '--additional-context' || t === '-c') {
+      cfg.additionalContext = next()
+    } else if (t === '--ngpus' || t === '--n-gpus') {
+      next().split(',').forEach(x => { if (x) cfg.nGpus.push(x) })
+    } else if (t === '--timeout') {
+      cfg.timeout = next()
+    } else if (t === '-o' || t === '--output') {
+      cfg.output = next()
+    } else if (t === '--no-execute' || t === '--plan' || t === '--dry-run') {
+      cfg.execute = false
+    } else if (t === '--execute') {
+      cfg.execute = true
+    }
+    // --live-output and unknown flags are ignored (live-output is always added below)
+  }
+  return cfg
+}
+
+// --- normalize args (string | object | undefined) into one cfg ---
+let cfg
+if (typeof args === 'string') {
+  cfg = parseCli(args)
+} else if (args && typeof args === 'object') {
+  cfg = { execute: true, ...args }
+  if (typeof cfg.tags === 'string') cfg.tags = [cfg.tags]
+} else {
+  cfg = { tags: [], nGpus: [], execute: true }
+}
+
+const tags = (cfg.tags && cfg.tags.length) ? cfg.tags : ['bert']
+const nGpus = (cfg.nGpus && cfg.nGpus.length) ? cfg.nGpus : [null]
+const execute = cfg.execute !== false
+const addlCtx = cfg.additionalContext || null
+const timeout = cfg.timeout || null
+
+// Build the matrix of {tag, nGpus} cells.
+const matrix = []
+for (const tag of tags)
+  for (const g of nGpus)
+    matrix.push({ tag, nGpus: g })
+
+log(`Sweep matrix: ${matrix.length} cell(s) across ${tags.length} tag(s) [${tags.join(', ')}]. execute=${execute}${addlCtx ? ', additional-context threaded' : ''}`)
+
+const CELL_SCHEMA = {
+  type: 'object',
+  required: ['cell', 'command', 'status'],
+  properties: {
+    cell: { type: 'string' },
+    command: { type: 'string' },
+    status: { type: 'string', enum: ['planned', 'ran', 'skipped', 'error'] },
+    performance: { type: ['number', 'null'] },
+    metric: { type: ['string', 'null'] },
+    notes: { type: 'string' },
+  },
+}
+
+const results = await parallel(matrix.map((cell, i) => () => {
+  const label = `${cell.tag}${cell.nGpus ? '/' + cell.nGpus + 'gpu' : ''}`
+  // Each cell writes its OWN perf file so parallel runs never clobber a shared
+  // perf.csv. A safe filename is derived from the label.
+  const safe = label.replace(/[^A-Za-z0-9._-]/g, '_')
+  const outFile = `perf_${safe}.csv`
+  const flags = [
+    '--tags ' + cell.tag,
+    '--live-output',
+    '-o ' + outFile,
+  ]
+  if (timeout) flags.push('--timeout ' + timeout)
+  // Thread the user-supplied --additional-context verbatim into every cell. If a
+  // per-cell n_gpus axis is set, try to merge it into the JSON; fall back to the
+  // raw string + a separate context if merge isn't possible.
+  let ctx = ''
+  if (addlCtx) {
+    let merged = addlCtx
+    if (cell.nGpus) {
+      try {
+        const obj = JSON.parse(addlCtx)
+        obj.n_gpus = cell.nGpus
+        merged = JSON.stringify(obj)
+      } catch (e) { /* leave raw; n_gpus axis ignored for this cell */ }
+    }
+    ctx = ` --additional-context '${merged}'`
+  } else if (cell.nGpus) {
+    ctx = ` --additional-context '{"n_gpus": "${cell.nGpus}"}'`
+  }
+  const cmd = `madengine run ${flags.join(' ')}${ctx}`
+
+  const action = execute
+    ? `If AMD GPUs are present (check rocm-smi/amd-smi), RUN this command from the /home/ysha/MAD repo root and parse results. This model may emit a "performance: <value> <unit>" stdout line OR (if it is a multiple_results model) write per-metric rows to its own CSV — in that case report the primary throughput metric. Results land in ${outFile}. If no GPUs, set status "skipped".`
+    : `Do NOT execute. Validate the command is well-formed and the tag resolves via "madengine discover --tags ${cell.tag}". Set status "planned".`
+
+  return agent(
+    `MAD benchmark sweep cell "${label}".
+Pre-flight: before running madengine, verify it is installed:
+  if ! command -v madengine &>/dev/null; then
+    if [ -f requirements.txt ] && grep -q madengine requirements.txt; then
+      pip install -r requirements.txt
+    else
+      echo "[pre-flight] madengine not found. Install: pip install git+https://github.com/ROCm/madengine.git@main"; exit 1
+    fi
+  fi
+  [ -f models.json ] || echo "[pre-flight] Warning: not in MAD repo root."
+Command: ${cmd}
+${action}
+Return the cell label, the exact command, status, and performance/metric if available.`,
+    { label: `bench:${label}`, phase: 'Run', schema: CELL_SCHEMA }
+  ).then(r => ({ ...r, _label: label, _output: outFile }))
+}))
+
+phase('Synthesize')
+const clean = results.filter(Boolean)
+const summary = await agent(
+  `Synthesize a MAD benchmark sweep into a markdown comparison table.
+Cells (JSON, each with its own _output perf file): ${JSON.stringify(clean)}
+Produce: (1) a table with columns Cell | Command | Status | Performance | Metric;
+(2) a short headline (how many planned/ran/skipped/errored);
+(3) explicitly flag any cell whose status is "error" or whose tag did not resolve —
+do not let a failed/unresolved cell read as success;
+(4) if any ran, the fastest and slowest cells (respecting that metric units differ);
+(5) note that each cell wrote a separate perf_<cell>.csv (to avoid clobbering) and
+that they can be concatenated for a combined report.
+Do not invent numbers — only use provided data.`,
+  { phase: 'Synthesize' }
+)
+
+return { execute, cells: clean, report: summary }
diff --git a/.claude/workflows/mad-tune-search.js b/.claude/workflows/mad-tune-search.js
new file mode 100644
index 0000000..9e1fd52
--- /dev/null
+++ b/.claude/workflows/mad-tune-search.js
@@ -0,0 +1,344 @@
+export const meta = {
+  name: 'mad-tune-search',
+  description: 'Profile a MAD model to diagnose its bottleneck, generate profiling-informed tuning candidates, measure each cleanly + adversarially verify, and synthesize the winning config',
+  phases: [
+    { title: 'Baseline', detail: 'read the model + establish a clean baseline number' },
+    { title: 'Diagnose', detail: 'profile once to find hotspots and classify the bottleneck' },
+    { title: 'Propose', detail: 'generate distinct candidates, each citing profiling evidence' },
+    { title: 'Evaluate', detail: 'measure each candidate cleanly (sequential) and verify its delta' },
+    { title: 'Synthesize', detail: 'pick the winning configuration' },
+  ],
+}
+
+// ARGS — accepts EITHER:
+//   (a) a structured object: { tag, target?, candidates?, execute?,
+//       additionalContext?, timeout?, profileTool? }
+//   (b) a CLI-style string mirroring `/mad-benchmark` / `/mad-profile`, e.g.
+//       --tag pyt_vllm_qwen3-8b --candidates 6 \
+//         --additional-context '{"gpu_vendor":"AMD","tools":[{"name":"rocm_trace_lite"}],...}'
+//
+// WHY profiling is split from measurement (the core design):
+//   - DIAGNOSE uses a profiler (tools[]) to find WHAT to tune: kernel hotspots,
+//     and whether the workload is compute- / memory- / communication- / launch-bound.
+//   - EVALUATE measures WHETHER a change helped using CLEAN runs (profiler OFF),
+//     because tracing overhead (e.g. rocm_trace_lite intercepts ~all GPU ops) buries
+//     the small perf deltas tuning produces and would make verify reject real wins.
+//   So we run the profiler exactly ONCE (Diagnose), then strip the `tools` key from
+//   the context for the baseline and every candidate measurement.
+//
+// EXECUTE semantics: runs for real by default. Pass --no-execute / --plan to only
+// diagnose-from-static + produce the ranked candidate plan without GPU measurement.
+//
+// SEQUENTIAL evaluation: candidates are measured one at a time — parallel GPU runs
+// contend (corrupting deltas) and parallel run.sh/config edits race on the same files.
+
+// --- tokenizer: split a CLI string respecting single/double quotes ---
+function tokenize(s) {
+  const out = []
+  let cur = '', q = null
+  for (let i = 0; i < s.length; i++) {
+    const c = s[i]
+    if (q) {
+      if (c === q) q = null
+      else cur += c
+    } else if (c === '"' || c === "'") {
+      q = c
+    } else if (c === ' ' || c === '\t' || c === '\n') {
+      if (cur) { out.push(cur); cur = '' }
+    } else {
+      cur += c
+    }
+  }
+  if (cur) out.push(cur)
+  return out
+}
+
+// --- parse a CLI string into the structured cfg the search uses ---
+function parseCli(s) {
+  const toks = tokenize(s)
+  const cfg = { execute: true }
+  const tags = []
+  for (let i = 0; i < toks.length; i++) {
+    const t = toks[i]
+    const next = () => toks[++i]
+    if (t === '--tags' || t === '--tag' || t === '-t') {
+      while (i + 1 < toks.length && !toks[i + 1].startsWith('-')) {
+        next().split(',').forEach(x => { if (x) tags.push(x) })
+      }
+    } else if (t === '--additional-context' || t === '-c') {
+      cfg.additionalContext = next()
+    } else if (t === '--target') {
+      cfg.target = next()
+    } else if (t === '--candidates' || t === '-n') {
+      cfg.candidates = parseInt(next(), 10)
+    } else if (t === '--profile-tool') {
+      cfg.profileTool = next()
+    } else if (t === '--timeout') {
+      cfg.timeout = next()
+    } else if (t === '--no-execute' || t === '--plan' || t === '--dry-run') {
+      cfg.execute = false
+    } else if (t === '--execute') {
+      cfg.execute = true
+    }
+  }
+  if (tags.length) cfg.tag = tags[0]
+  if (tags.length > 1) cfg._extraTags = tags.slice(1)
+  return cfg
+}
+
+// --- normalize args (string | object | undefined) into one cfg ---
+let cfg
+if (typeof args === 'string') {
+  cfg = parseCli(args)
+} else if (args && typeof args === 'object') {
+  cfg = { execute: true, ...args }
+} else {
+  cfg = { execute: true }
+}
+
+const tag = cfg.tag || 'bert'
+const target = cfg.target || 'throughput'
+const nCandidates = Math.max(1, Math.min(cfg.candidates || 4, 8))
+const execute = cfg.execute !== false
+const timeout = cfg.timeout || null
+
+// --- Split the user's --additional-context into a PROFILED and a CLEAN variant. ---
+// profiledCtx: keeps (or adds) a `tools` entry → used once in Diagnose.
+// cleanCtx:    `tools` key removed → used for baseline + every candidate measurement.
+let ctxObj = null
+if (cfg.additionalContext) {
+  try { ctxObj = JSON.parse(cfg.additionalContext) }
+  catch (e) { log(`Warning: --additional-context is not valid JSON; passing it through verbatim and skipping clean/profiled split.`) }
+}
+
+let cleanCtx = null      // JSON string, tools stripped
+let profiledCtx = null   // JSON string, tools present
+let profileToolName = cfg.profileTool || null
+
+if (ctxObj && typeof ctxObj === 'object') {
+  // derive the profiling tool name from the context if not given explicitly
+  if (!profileToolName && Array.isArray(ctxObj.tools) && ctxObj.tools[0] && ctxObj.tools[0].name)
+    profileToolName = ctxObj.tools[0].name
+  if (!profileToolName) profileToolName = 'rocm_trace_lite'
+
+  const cleanObj = { ...ctxObj }
+  delete cleanObj.tools
+  cleanCtx = JSON.stringify(cleanObj)
+
+  const profObj = { ...ctxObj, tools: [{ name: profileToolName }] }
+  profiledCtx = JSON.stringify(profObj)
+} else if (cfg.additionalContext) {
+  // not parseable as object — use verbatim for clean; still attempt a profiled variant
+  cleanCtx = cfg.additionalContext
+  if (!profileToolName) profileToolName = 'rocm_trace_lite'
+}
+
+const cleanCtxFlag = cleanCtx ? ` --additional-context '${cleanCtx}'` : ''
+const profCtxFlag = profiledCtx ? ` --additional-context '${profiledCtx}'`
+  : ` --additional-context '{"tools": [{"name": "${profileToolName || 'rocm_trace_lite'}"}]}'`
+
+if (cfg._extraTags && cfg._extraTags.length)
+  log(`Note: mad-tune-search tunes ONE model. Using "${tag}"; ignoring extra tags [${cfg._extraTags.join(', ')}]. Use mad-benchmark-sweep to compare multiple models.`)
+log(`Tuning "${tag}" for ${target}. execute=${execute}. profile tool=${profileToolName || 'rocm_trace_lite'}. ${cleanCtx ? 'context split into clean+profiled' : 'no additional-context'}.`)
+
+// ── Phase 1: Baseline (CLEAN) ──────────────────────────────────────────────
+phase('Baseline')
+const BASELINE_SCHEMA = {
+  type: 'object',
+  required: ['model', 'scriptPath', 'baselineSummary'],
+  properties: {
+    model: { type: 'string' },
+    scriptPath: { type: 'string' },
+    baselineSummary: { type: 'string' },
+    multipleResults: { type: ['string', 'null'] },
+    baselinePerf: { type: ['number', 'null'] },
+    baselineMetric: { type: ['string', 'null'] },
+    levers: { type: 'array', items: { type: 'string' } },
+  },
+}
+const baseAction = execute
+  ? `If AMD GPUs are present, establish a CLEAN baseline number (profiler OFF): either reuse the most recent clean perf row for this model if one exists in perf.csv / the model's results CSV, or run "madengine run --tags ${tag} --live-output -o perf_tune_baseline.csv${cleanCtxFlag}" once and parse it. Report baselinePerf + baselineMetric.`
+  : `Do NOT run anything. Read the static config only; leave baselinePerf null.`
+const baseline = await agent(
+  `In the MAD repo, establish the tuning baseline for tag "${tag}" (target: ${target}).
+Pre-flight: before running madengine, verify it is installed:
+  if ! command -v madengine &>/dev/null; then
+    if [ -f requirements.txt ] && grep -q madengine requirements.txt; then
+      pip install -r requirements.txt
+    else
+      echo "[pre-flight] madengine not found. Install: pip install git+https://github.com/ROCm/madengine.git@main"; exit 1
+    fi
+  fi
+  [ -f models.json ] || echo "[pre-flight] Warning: not in MAD repo root."
+Find its models.json entry, its scripts/.../run.sh, and any config it references.
+Summarize the current configuration and list the tuning levers available for this
+stack (env vars like MAD_MODEL_BATCH_SIZE / PYTORCH_TUNABLEOP_ENABLED / NCCL_*/RCCL_*,
+or args like tensor-parallel size, precision, gpu-memory-utilization, max-num-seqs).
+Report whether the entry sets "multiple_results" (its CSV filename, or null).
+${baseAction}`,
+  { phase: 'Baseline', schema: BASELINE_SCHEMA }
+)
+const multiCsv = baseline.multipleResults || null
+const parseHint = multiCsv
+  ? `This is a multiple_results model — performance is NOT a stdout line; after the run read the per-metric CSV "${multiCsv}" (and the -o file) and report the primary ${target} metric.`
+  : `Parse the "performance: <value> <unit>" line from stdout.`
+
+// ── Phase 2: Diagnose (PROFILED, once) ─────────────────────────────────────
+phase('Diagnose')
+const DIAG_SCHEMA = {
+  type: 'object',
+  required: ['bottleneck', 'evidence', 'hotspots'],
+  properties: {
+    bottleneck: { type: 'string', enum: ['compute', 'memory', 'communication', 'launch', 'mixed', 'unknown'] },
+    evidence: { type: 'string' },
+    hotspots: { type: 'array', items: {
+      type: 'object',
+      required: ['name', 'pct'],
+      properties: { name: { type: 'string' }, pct: { type: ['number', 'null'] }, kind: { type: 'string' } },
+    } },
+    recommendedLevers: { type: 'array', items: { type: 'string' } },
+    tracePath: { type: ['string', 'null'] },
+  },
+}
+const diagAction = execute
+  ? `If AMD GPUs are present, profile the model ONCE: prefer reusing an existing recent trace dir (e.g. rocm_trace_lite_output/) if present and current; otherwise run "madengine run --tags ${tag} --live-output -o perf_tune_profiled.csv${profCtxFlag}". Then read the trace summary (e.g. trace_summary.txt / the trace db) to extract the top GPU-kernel hotspots with their % of GPU time. If no GPUs, classify from static config + known model characteristics and set tracePath null.`
+  : `Do NOT run anything. Classify the likely bottleneck from the static config, model architecture, and any EXISTING trace output already on disk (e.g. rocm_trace_lite_output/trace_summary.txt). Set tracePath to that file if found, else null.`
+const diag = await agent(
+  `Diagnose the performance bottleneck of "${baseline.model}" to guide ${target} tuning.
+Baseline config: ${baseline.baselineSummary}
+${diagAction}
+
+Classify the bottleneck as exactly one of: compute, memory, communication, launch, mixed, unknown — using these best-practice signals:
+- compute: GEMM/conv kernels (e.g. Cijk*, *gemm*, wvSplitK) dominate GPU time; high occupancy.
+  Levers: PYTORCH_TUNABLEOP_ENABLED, hipBLASLt tuning, fp8/lower precision, larger batch.
+- memory: attention/elementwise/norm/copy kernels dominate; low arithmetic intensity / bandwidth-bound.
+  Levers: KV-cache dtype, kernel fusion, gpu-memory-utilization, batch size.
+- communication: RCCL/NCCL collectives (AllReduce/AllGather) are a large share (multi-GPU).
+  Levers: NCCL_*/RCCL_* tuning, tensor-parallel vs pipeline-parallel topology.
+- launch: many tiny kernels with timeline gaps / low GPU utilization.
+  Levers: HIP/CUDA graphs, larger batch, fewer sync points.
+Return bottleneck, evidence (cite the hotspot %s), hotspots[], recommendedLevers[], tracePath.`,
+  { phase: 'Diagnose', schema: DIAG_SCHEMA }
+)
+log(`Diagnosis: ${diag.bottleneck}-bound. Top hotspots: ${(diag.hotspots || []).slice(0, 3).map(h => `${h.name} ${h.pct != null ? h.pct + '%' : ''}`).join(', ')}`)
+
+// ── Phase 3: Propose (profiling-informed) ──────────────────────────────────
+phase('Propose')
+const CANDS_SCHEMA = {
+  type: 'object',
+  required: ['candidates'],
+  properties: {
+    candidates: { type: 'array', items: {
+      type: 'object',
+      required: ['id', 'change', 'hypothesis', 'evidence'],
+      properties: {
+        id: { type: 'string' },
+        change: { type: 'string' },
+        hypothesis: { type: 'string' },
+        evidence: { type: 'string' },
+        leverKind: { type: 'string', enum: ['env', 'arg', 'config', 'other'] },
+      },
+    } },
+  },
+}
+const proposed = await agent(
+  `Propose ${nCandidates} DISTINCT tuning candidates to improve ${target} for "${baseline.model}".
+Baseline: ${baseline.baselineSummary}
+Diagnosis: ${diag.bottleneck}-bound. Evidence: ${diag.evidence}
+Top hotspots: ${JSON.stringify((diag.hotspots || []).slice(0, 5))}
+Recommended levers from diagnosis: ${(diag.recommendedLevers || []).join(', ')}
+Generic available levers: ${(baseline.levers || []).join(', ')}
+
+Each candidate changes ONE lever (so its effect is attributable) and should TARGET the
+diagnosed bottleneck. Return id, change (concrete value + how to apply it), hypothesis,
+evidence (which hotspot/diagnosis finding motivates it), and leverKind:
+"env" (environment variable), "arg" (run-script/CLI arg), "config" (config-file value), or "other".`,
+  { phase: 'Propose', schema: CANDS_SCHEMA }
+)
+const candidates = (proposed.candidates || []).slice(0, nCandidates)
+log(`Evaluating ${candidates.length} candidate(s) sequentially, CLEAN (profiler off). execute=${execute}`)
+
+// ── Phase 4: Evaluate (CLEAN, sequential) + adversarial verify ─────────────
+const EVAL_SCHEMA = {
+  type: 'object',
+  required: ['id', 'status'],
+  properties: {
+    id: { type: 'string' },
+    status: { type: 'string', enum: ['planned', 'ran', 'skipped', 'error'] },
+    command: { type: 'string' },
+    performance: { type: ['number', 'null'] },
+    metric: { type: ['string', 'null'] },
+    notes: { type: 'string' },
+  },
+}
+const VERDICT_SCHEMA = {
+  type: 'object',
+  required: ['id', 'trustworthy'],
+  properties: {
+    id: { type: 'string' },
+    trustworthy: { type: 'boolean' },
+    reason: { type: 'string' },
+  },
+}
+
+const ctxHint = cleanCtx
+  ? `A CLEAN base --additional-context (profiler OFF) is in the command. If this candidate's leverKind is "env", apply the change by MERGING the env var into that context's "docker_env_vars" object (show the merged command you actually run); do not export it separately.`
+  : `If leverKind is "env", apply by adding --additional-context '{"docker_env_vars": {"VAR": "value"}}' (or exporting the var).`
+
+const evaluated = []
+for (const cand of candidates) {
+  const safeId = String(cand.id).replace(/[^A-Za-z0-9._-]/g, '_')
+  const outFile = `perf_tune_${safeId}.csv`
+  const flags = ['--tags ' + tag, '--live-output', '-o ' + outFile]
+  if (timeout) flags.push('--timeout ' + timeout)
+  const cmd = `madengine run ${flags.join(' ')}${cleanCtxFlag}`
+  const action = execute
+    ? `If AMD GPUs are present (check rocm-smi/amd-smi), apply the change, RUN the command CLEAN (profiler OFF), parse the result, then REVERT any file/config edit so the next candidate starts clean. ${parseHint} Results land in ${outFile}. If no GPUs, set status "skipped".`
+    : `Do NOT execute. Produce the exact command (plus precisely how to apply the change) and set status "planned".`
+
+  const evalResult = await agent(
+    `Evaluate tuning candidate ${cand.id} for "${baseline.model}" (leverKind: ${cand.leverKind || 'other'}).
+Change: ${cand.change}
+Hypothesis: ${cand.hypothesis}
+Motivating evidence: ${cand.evidence}
+Base command (CLEAN): ${cmd}
+${ctxHint}
+${action}
+Return id, status, the exact command you ran, and performance/metric if measured.`,
+    { label: `eval:${cand.id}`, phase: 'Evaluate', schema: EVAL_SCHEMA }
+  )
+
+  const verdict = await agent(
+    `Adversarially review tuning candidate ${cand.id}.
+Baseline: ${baseline.baselinePerf != null ? baseline.baselinePerf + ' ' + (baseline.baselineMetric || '') : baseline.baselineSummary}
+Claimed result: ${JSON.stringify(evalResult)}
+Hypothesis was: ${cand.hypothesis}
+Is the conclusion trustworthy? Default to trustworthy=false if the change was not actually
+measured (planned/skipped), if only one sample was taken, if it ran under a profiler, or if
+the delta vs baseline is within run-to-run noise (~1-2%). Return id, trustworthy, reason.`,
+    { label: `verify:${cand.id}`, phase: 'Evaluate', schema: VERDICT_SCHEMA }
+  )
+  evaluated.push({ ...evalResult, _change: cand.change, _evidence: cand.evidence, verdict })
+}
+
+// ── Phase 5: Synthesize ────────────────────────────────────────────────────
+phase('Synthesize')
+const clean = evaluated.filter(Boolean)
+const report = await agent(
+  `Synthesize a MAD tuning search for "${baseline.model}" (target: ${target}).
+Baseline: ${baseline.baselineSummary}${baseline.baselinePerf != null ? ` (clean baseline: ${baseline.baselinePerf} ${baseline.baselineMetric || ''})` : ''}
+Diagnosis: ${diag.bottleneck}-bound — ${diag.evidence}
+Top hotspots: ${JSON.stringify((diag.hotspots || []).slice(0, 5))}
+Evaluated candidates (JSON, each with a verdict): ${JSON.stringify(clean)}
+
+Produce:
+(1) a one-line headline: the bottleneck and the best trustworthy improvement (or that none beat baseline);
+(2) the diagnosis summary (what's the bottleneck and why, citing hotspot %s);
+(3) a table: Candidate | Change | Evidence | Status | Performance | vs Baseline | Trustworthy;
+(4) the recommended configuration — ONLY from candidates whose verdict is trustworthy; if none ran/were trustworthy, present the ranked plan to test on a GPU host;
+(5) the exact next command(s) to run.
+Each candidate wrote its own perf_tune_<id>.csv (no shared-file clobbering). Do not invent measurements.`,
+  { phase: 'Synthesize' }
+)
+
+return { model: baseline.model, target, execute, bottleneck: diag.bottleneck, diagnosis: diag, candidates: clean, report }
diff --git a/CLAUDE.md b/CLAUDE.md
new file mode 100644
index 0000000..0c5648f
--- /dev/null
+++ b/CLAUDE.md
@@ -0,0 +1,130 @@
+# CLAUDE.md
+
+Guidance for Claude Code when working in the MAD repository.
+
+## What MAD is
+
+MAD (Model Automation and Dashboarding) is a curated catalog of ~142 AI model
+workloads that run on AMD Instinct GPUs via the **`madengine`** CLI. Each model
+is one JSON entry in `models.json`. `madengine run --tags <tag>` builds a Docker
+image, runs the model in a container, parses a performance number from stdout,
+and appends a row to `perf.csv`.
+
+The four common user tasks here are: **benchmarking**, **adding a new model**,
+**tuning** a model/kernel for better perf, and **development** on the repo
+itself. Dedicated subagents and `/mad-*` slash commands exist for each — see
+`.claude/agents/` and `.claude/commands/`.
+
+## The performance contract (most important convention)
+
+A model's run script MUST print exactly one line to stdout:
+
+```
+performance: <value> <unit>
+```
+
+madengine greps this line to populate the `performance` and `metric` columns of
+`perf.csv`. A run that does not emit it is recorded with no performance value.
+Real example (`scripts/huggingface_bert/run.sh`):
+
+```bash
+performance=$(grep -Eo "train_samples_per_second':[^,]+" log.txt | sed "s/.*: //g" | head -n 1)
+echo "performance: $performance samples_per_second"
+```
+
+If a script produces many results, set `multiple_results` in `models.json` to the
+CSV filename the script writes (e.g. `"multiple_results": "perf_DeepSeek-R1.csv"`).
+
+## models.json schema
+
+One object per model. Required fields: `name`, `url`, `dockerfile`, `scripts`,
+`n_gpus`, `owner`, `training_precision`, `tags`. Optional: `data`, `timeout`,
+`multiple_results`, `args`, `skip_gpu_arch`.
+
+| Field | Notes |
+|-------|-------|
+| `name` | Unique. Convention: `{framework}_{project}_{workload}`, e.g. `pyt_vllm_deepseek-r1` |
+| `url` | Git repo cloned into the container; `""` if none |
+| `dockerfile` | Path prefix; engine appends `.ubuntu.amd.Dockerfile` → `docker/<name>.ubuntu.amd.Dockerfile` |
+| `scripts` | Path to a `run.sh` (or script dir) executed inside the container |
+| `n_gpus` | String. `"-1"` = all available |
+| `tags` | List; `--tags` matches any tag OR the exact `name` |
+| `training_precision` | e.g. `fp16`, `bf16`, `fp8`, `fp32`, or `""` |
+| `timeout` | Seconds; overrides the 7200s default. `-1` = no timeout |
+| `skip_gpu_arch` | Skip on these archs, e.g. `"gfx942"` |
+| `args` | Extra args appended to the run script |
+
+`models.json` must stay valid JSON — validate with `python3 -m json.tool models.json`
+after editing.
+
+## Adding a new model (4 steps)
+
+1. **Name** it `{framework}_{project}_{workload}`.
+2. **Add** an entry to `models.json` (copy the closest existing model of the same
+   framework as a template).
+3. **Dockerfile** `docker/<name>.ubuntu.amd.Dockerfile`. First line MUST be the
+   context header, then a base image:
+   ```dockerfile
+   # CONTEXT {'gpu_vendor': 'AMD', 'guest_os': 'UBUNTU'}
+   ARG BASE_DOCKER=rocm/pytorch
+   FROM $BASE_DOCKER
+   ```
+   Reuse an existing `docker/*.Dockerfile` of the same stack when possible rather
+   than inventing a new base.
+4. **Script** `scripts/<dir>/run.sh` that runs the workload and ends by echoing
+   the `performance: <value> <unit>` line.
+
+Verify (needs a GPU host): `madengine run --tags <name> --live-output`.
+
+## madengine commands
+
+`requirements.txt` pins madengine to `@main` (Typer CLI, v2.1.0). It exposes
+five commands: `build`, `run`, `discover`, `report`, `database`. Verify exact
+flags with `madengine <cmd> --help`.
+
+- `madengine discover --tags <tag>` — list matching models (read-only, no GPU).
+- `madengine build --tags <tag> [-r REGISTRY] [-a gfx942,gfx90a] [-l]` — build images, write `build_manifest.json`. `--use-image` skips the build and uses a prebuilt image.
+- `madengine run --tags <tag> [-l/--live-output] [-o out.csv] [--timeout S] [--keep-alive] [--skip-model-run] [-c '<ctx>']` — full build+run (or, with `-m manifest.json`, execution-only). Writes `perf.csv`.
+- `madengine report to-html --csv-file-path perf.csv` — HTML report.
+- `madengine database -f perf_entry_super.json --db DB -c COLLECTION` — upload CSV/JSON to MongoDB (`-k model,timestamp` sets the unique key; `--dry-run` to preview).
+
+`--skip-model-run` builds the image without running the workload.
+`tools/run_models.py` is the **deprecated** legacy runner — prefer `madengine`.
+
+## Deployment target (convention over configuration)
+
+Pass deployment intent via `--additional-context` (a Python-dict/JSON string,
+parsed with `ast.literal_eval` so Python dict syntax is fine). The target is
+inferred from which key is present:
+
+- `"slurm"` key → SLURM. `"k8s"`/`"kubernetes"` key → Kubernetes. Neither → local Docker.
+
+## Profiling and tracing
+
+Add a `tools` list to `--additional-context`:
+
+```bash
+madengine run --tags <tag> --additional-context '{"tools": [{"name": "rocprofv3_compute"}]}'
+```
+
+Common tool names: `rpd`, `rocprofv3`, `rocprofv3_compute`, `rocprofv3_memory`,
+`rocprofv3_communication`, `rocm_trace_lite`, `rccl_trace`, `gpu_info_power_profiler`.
+Pre-built profiling context files live in the madengine package's
+`examples/profiling-configs/`.
+
+## Outputs
+
+- `perf.csv` — one row per run (model, n_gpus, precision, performance, metric, status, durations, git_commit, gpu_architecture, ...).
+- `perf_entry_super.json` — enriched record incl. a `configs` block; this is what gets pushed to MongoDB.
+
+## Environment variables
+
+`MAD_SECRETS_HFTOKEN` (HuggingFace token), `MAD_MODEL_NAME`, `MAD_RUNTIME_NGPUS`,
+`MAD_SYSTEM_GPU_ARCHITECTURE`, `MAD_MODEL_BATCH_SIZE`.
+
+## Future (not yet wired)
+
+`reference_db/mad_agent.db` (tables: `model_baselines`, `optimization_history`,
+`best_configurations`, `learned_patterns`) and `knowledge_base/` are planned
+for a future persistent optimization-memory layer. They are not present or
+populated yet — do not assume data is available.
diff --git a/mad-automation-howto.html b/mad-automation-howto.html
new file mode 100644
index 0000000..c4ab4be
--- /dev/null
+++ b/mad-automation-howto.html
@@ -0,0 +1,2097 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+<meta charset="UTF-8">
+<meta name="viewport" content="width=device-width, initial-scale=1.0">
+<title>MAD Claude Automation — How-To Guide</title>
+<style>
+  :root {
+    --bg: #0f1117;
+    --bg-elevated: #161b27;
+    --bg-card: #1a2035;
+    --bg-code: #0d1117;
+    --border: #2a3045;
+    --border-bright: #3a4460;
+    --text: #e2e8f0;
+    --text-muted: #94a3b8;
+    --text-dim: #64748b;
+    --accent-blue: #4f9cf9;
+    --accent-green: #34d399;
+    --accent-amber: #fbbf24;
+    --accent-red: #f87171;
+    --accent-purple: #a78bfa;
+    --accent-teal: #2dd4bf;
+    --badge-command: #1e3a5f;
+    --badge-command-text: #93c5fd;
+    --badge-agent: #1a3a2a;
+    --badge-agent-text: #6ee7b7;
+    --badge-workflow: #2d1f4a;
+    --badge-workflow-text: #c4b5fd;
+    --badge-warn: #3a2a10;
+    --badge-warn-text: #fcd34d;
+    --font-sans: -apple-system, BlinkMacSystemFont, 'Segoe UI', system-ui, sans-serif;
+    --font-mono: 'SF Mono', 'Fira Code', 'Cascadia Code', 'Consolas', monospace;
+  }
+
+  *, *::before, *::after { box-sizing: border-box; margin: 0; padding: 0; }
+
+  html { scroll-behavior: smooth; font-size: 16px; }
+
+  body {
+    background: var(--bg);
+    color: var(--text);
+    font-family: var(--font-sans);
+    line-height: 1.65;
+    display: flex;
+    min-height: 100vh;
+  }
+
+  /* ── Navigation ─────────────────────────────────────── */
+  nav {
+    position: fixed;
+    top: 0;
+    left: 0;
+    width: 256px;
+    height: 100vh;
+    background: var(--bg-elevated);
+    border-right: 1px solid var(--border);
+    overflow-y: auto;
+    padding: 24px 0 40px;
+    z-index: 100;
+    flex-shrink: 0;
+  }
+
+  nav .nav-logo {
+    padding: 0 20px 20px;
+    border-bottom: 1px solid var(--border);
+    margin-bottom: 16px;
+  }
+
+  nav .nav-logo .logo-title {
+    font-size: 0.9rem;
+    font-weight: 700;
+    letter-spacing: 0.08em;
+    text-transform: uppercase;
+    color: var(--accent-blue);
+  }
+
+  nav .nav-logo .logo-sub {
+    font-size: 0.75rem;
+    color: var(--text-dim);
+    margin-top: 2px;
+  }
+
+  nav .nav-section {
+    padding: 6px 20px 2px;
+    font-size: 0.65rem;
+    font-weight: 700;
+    letter-spacing: 0.12em;
+    text-transform: uppercase;
+    color: var(--text-dim);
+    margin-top: 12px;
+  }
+
+  nav a {
+    display: block;
+    padding: 6px 20px;
+    font-size: 0.82rem;
+    color: var(--text-muted);
+    text-decoration: none;
+    border-left: 2px solid transparent;
+    transition: all 0.15s;
+    white-space: nowrap;
+    overflow: hidden;
+    text-overflow: ellipsis;
+  }
+
+  nav a:hover {
+    color: var(--text);
+    background: rgba(79,156,249,0.07);
+    border-left-color: var(--accent-blue);
+  }
+
+  nav a.active {
+    color: var(--accent-blue);
+    border-left-color: var(--accent-blue);
+    background: rgba(79,156,249,0.1);
+  }
+
+  /* ── Main content ────────────────────────────────────── */
+  main {
+    margin-left: 256px;
+    flex: 1;
+    max-width: 960px;
+    padding: 48px 48px 80px;
+  }
+
+  /* ── Sections ────────────────────────────────────────── */
+  section {
+    margin-bottom: 72px;
+    scroll-margin-top: 24px;
+  }
+
+  h1 {
+    font-size: clamp(1.6rem, 2vw, 2.2rem);
+    font-weight: 800;
+    color: var(--text);
+    letter-spacing: -0.02em;
+    margin-bottom: 8px;
+    line-height: 1.2;
+  }
+
+  .page-subtitle {
+    font-size: 1rem;
+    color: var(--text-muted);
+    margin-bottom: 48px;
+    padding-bottom: 24px;
+    border-bottom: 1px solid var(--border);
+  }
+
+  h2 {
+    font-size: 1.35rem;
+    font-weight: 700;
+    color: var(--text);
+    margin-bottom: 20px;
+    letter-spacing: -0.01em;
+    display: flex;
+    align-items: center;
+    gap: 10px;
+  }
+
+  h2 .section-num {
+    display: inline-flex;
+    align-items: center;
+    justify-content: center;
+    width: 28px;
+    height: 28px;
+    background: var(--border-bright);
+    border-radius: 6px;
+    font-size: 0.75rem;
+    font-weight: 700;
+    color: var(--text-muted);
+    flex-shrink: 0;
+  }
+
+  h3 {
+    font-size: 1.05rem;
+    font-weight: 600;
+    color: var(--text);
+    margin: 24px 0 12px;
+    letter-spacing: -0.005em;
+  }
+
+  p {
+    color: var(--text-muted);
+    margin-bottom: 14px;
+    max-width: 72ch;
+  }
+
+  a { color: var(--accent-blue); text-decoration: none; }
+  a:hover { text-decoration: underline; }
+
+  ul, ol {
+    padding-left: 20px;
+    color: var(--text-muted);
+    margin-bottom: 14px;
+  }
+
+  li { margin-bottom: 5px; }
+
+  strong { color: var(--text); font-weight: 600; }
+
+  /* ── Badges ──────────────────────────────────────────── */
+  .badge {
+    display: inline-flex;
+    align-items: center;
+    gap: 5px;
+    padding: 2px 10px;
+    border-radius: 20px;
+    font-size: 0.72rem;
+    font-weight: 600;
+    letter-spacing: 0.04em;
+    text-transform: uppercase;
+  }
+
+  .badge-command {
+    background: var(--badge-command);
+    color: var(--badge-command-text);
+    border: 1px solid rgba(147,197,253,0.2);
+  }
+
+  .badge-agent {
+    background: var(--badge-agent);
+    color: var(--badge-agent-text);
+    border: 1px solid rgba(110,231,183,0.2);
+  }
+
+  .badge-workflow {
+    background: var(--badge-workflow);
+    color: var(--badge-workflow-text);
+    border: 1px solid rgba(196,181,253,0.2);
+  }
+
+  .badge-warn {
+    background: var(--badge-warn);
+    color: var(--badge-warn-text);
+    border: 1px solid rgba(252,211,77,0.2);
+  }
+
+  /* ── Code blocks ─────────────────────────────────────── */
+  pre {
+    background: var(--bg-code);
+    border: 1px solid var(--border);
+    border-radius: 10px;
+    padding: 18px 20px;
+    overflow-x: auto;
+    margin: 16px 0;
+    font-family: var(--font-mono);
+    font-size: 0.82rem;
+    line-height: 1.6;
+    position: relative;
+  }
+
+  pre .copy-hint {
+    position: absolute;
+    top: 10px;
+    right: 12px;
+    font-size: 0.68rem;
+    color: var(--text-dim);
+    font-family: var(--font-sans);
+    opacity: 0;
+    transition: opacity 0.2s;
+    cursor: pointer;
+    user-select: none;
+    padding: 2px 8px;
+    background: var(--border);
+    border-radius: 4px;
+  }
+
+  pre:hover .copy-hint { opacity: 1; }
+  pre .copy-hint:hover { color: var(--text-muted); }
+
+  code {
+    font-family: var(--font-mono);
+    font-size: 0.85em;
+    color: var(--accent-teal);
+    background: rgba(45,212,191,0.08);
+    padding: 1px 6px;
+    border-radius: 4px;
+  }
+
+  pre code {
+    background: none;
+    padding: 0;
+    color: inherit;
+    font-size: inherit;
+  }
+
+  /* syntax colors */
+  .t-comment { color: #5c7a9e; }
+  .t-cmd { color: #f8d07e; }
+  .t-flag { color: #93c5fd; }
+  .t-string { color: #86efac; }
+  .t-var { color: #f9a8d4; }
+  .t-key { color: #c4b5fd; }
+  .t-perf { color: #34d399; font-weight: 600; }
+
+  /* ── Cards ───────────────────────────────────────────── */
+  .card {
+    background: var(--bg-card);
+    border: 1px solid var(--border);
+    border-radius: 12px;
+    padding: 22px 24px;
+    margin-bottom: 18px;
+  }
+
+  .card-header {
+    display: flex;
+    align-items: flex-start;
+    gap: 14px;
+    margin-bottom: 14px;
+  }
+
+  .card-header .card-icon {
+    width: 40px;
+    height: 40px;
+    border-radius: 8px;
+    display: flex;
+    align-items: center;
+    justify-content: center;
+    font-size: 1.1rem;
+    flex-shrink: 0;
+  }
+
+  .card-header .card-meta {
+    flex: 1;
+  }
+
+  .card-header .card-title {
+    font-size: 0.95rem;
+    font-weight: 700;
+    color: var(--text);
+    margin-bottom: 4px;
+    display: flex;
+    align-items: center;
+    gap: 10px;
+    flex-wrap: wrap;
+  }
+
+  .card-header .card-desc {
+    font-size: 0.82rem;
+    color: var(--text-muted);
+    line-height: 1.4;
+  }
+
+  .card p { max-width: none; }
+
+  .card-body {
+    font-size: 0.85rem;
+    color: var(--text-muted);
+  }
+
+  .card-body ul {
+    margin: 8px 0 0;
+  }
+
+  /* ── Callout box ─────────────────────────────────────── */
+  .callout {
+    border-radius: 10px;
+    padding: 18px 20px;
+    margin: 20px 0;
+    border-left: 3px solid;
+  }
+
+  .callout-green {
+    background: rgba(52,211,153,0.07);
+    border-color: var(--accent-green);
+  }
+
+  .callout-amber {
+    background: rgba(251,191,36,0.07);
+    border-color: var(--accent-amber);
+  }
+
+  .callout-red {
+    background: rgba(248,113,113,0.07);
+    border-color: var(--accent-red);
+  }
+
+  .callout-blue {
+    background: rgba(79,156,249,0.07);
+    border-color: var(--accent-blue);
+  }
+
+  .callout-title {
+    font-size: 0.78rem;
+    font-weight: 700;
+    letter-spacing: 0.08em;
+    text-transform: uppercase;
+    margin-bottom: 8px;
+  }
+
+  .callout-green .callout-title { color: var(--accent-green); }
+  .callout-amber .callout-title { color: var(--accent-amber); }
+  .callout-red .callout-title { color: var(--accent-red); }
+  .callout-blue .callout-title { color: var(--accent-blue); }
+
+  .callout p { color: var(--text-muted); margin-bottom: 0; max-width: none; }
+
+  /* ── Table ───────────────────────────────────────────── */
+  .table-wrap {
+    overflow-x: auto;
+    margin: 16px 0;
+    border-radius: 10px;
+    border: 1px solid var(--border);
+  }
+
+  table {
+    width: 100%;
+    border-collapse: collapse;
+    font-size: 0.83rem;
+  }
+
+  th {
+    background: var(--bg-elevated);
+    color: var(--text-dim);
+    font-weight: 600;
+    font-size: 0.72rem;
+    letter-spacing: 0.07em;
+    text-transform: uppercase;
+    padding: 10px 14px;
+    text-align: left;
+    border-bottom: 1px solid var(--border);
+    position: sticky;
+    top: 0;
+  }
+
+  td {
+    padding: 10px 14px;
+    color: var(--text-muted);
+    border-bottom: 1px solid var(--border);
+    vertical-align: top;
+  }
+
+  tr:last-child td { border-bottom: none; }
+  tr:hover td { background: rgba(255,255,255,0.02); }
+
+  td code { font-size: 0.8em; }
+
+  /* ── Decision flowchart ──────────────────────────────── */
+  .flow {
+    display: grid;
+    gap: 12px;
+    margin: 20px 0;
+  }
+
+  .flow-row {
+    display: flex;
+    align-items: center;
+    gap: 12px;
+  }
+
+  .flow-q {
+    background: var(--bg-elevated);
+    border: 1px solid var(--border-bright);
+    border-radius: 8px;
+    padding: 10px 16px;
+    font-size: 0.85rem;
+    color: var(--text);
+    font-weight: 500;
+    min-width: 220px;
+  }
+
+  .flow-arrow {
+    color: var(--text-dim);
+    font-size: 1.1rem;
+    flex-shrink: 0;
+  }
+
+  .flow-cmd {
+    font-family: var(--font-mono);
+    font-size: 0.8rem;
+    color: var(--accent-teal);
+    background: rgba(45,212,191,0.08);
+    border: 1px solid rgba(45,212,191,0.2);
+    border-radius: 6px;
+    padding: 8px 14px;
+  }
+
+  /* ── Architecture diagram ────────────────────────────── */
+  .arch-diagram {
+    background: var(--bg-code);
+    border: 1px solid var(--border);
+    border-radius: 12px;
+    padding: 28px;
+    margin: 20px 0;
+    overflow-x: auto;
+  }
+
+  /* ── Collapsible details ─────────────────────────────── */
+  details {
+    border: 1px solid var(--border);
+    border-radius: 10px;
+    margin-bottom: 12px;
+    overflow: hidden;
+  }
+
+  summary {
+    padding: 14px 18px;
+    cursor: pointer;
+    list-style: none;
+    display: flex;
+    align-items: center;
+    gap: 10px;
+    font-size: 0.88rem;
+    font-weight: 600;
+    color: var(--text);
+    background: var(--bg-elevated);
+    user-select: none;
+    transition: background 0.15s;
+  }
+
+  summary:hover { background: var(--bg-card); }
+
+  summary::before {
+    content: '▶';
+    font-size: 0.65rem;
+    color: var(--text-dim);
+    transition: transform 0.2s;
+    flex-shrink: 0;
+  }
+
+  details[open] summary::before { transform: rotate(90deg); }
+  details::-webkit-details-marker { display: none; }
+
+  .details-body {
+    padding: 18px 20px;
+    background: var(--bg);
+  }
+
+  /* ── Phase list ──────────────────────────────────────── */
+  .phases {
+    display: flex;
+    gap: 0;
+    margin: 16px 0;
+    overflow-x: auto;
+  }
+
+  .phase-step {
+    flex: 1;
+    min-width: 120px;
+    padding: 12px 14px;
+    background: var(--bg-elevated);
+    border: 1px solid var(--border);
+    border-right: none;
+    position: relative;
+  }
+
+  .phase-step:first-child { border-radius: 8px 0 0 8px; }
+  .phase-step:last-child { border-radius: 0 8px 8px 0; border-right: 1px solid var(--border); }
+
+  .phase-step .phase-num {
+    font-size: 0.68rem;
+    color: var(--text-dim);
+    font-weight: 700;
+    letter-spacing: 0.08em;
+    text-transform: uppercase;
+    margin-bottom: 4px;
+  }
+
+  .phase-step .phase-name {
+    font-size: 0.85rem;
+    font-weight: 700;
+    color: var(--accent-blue);
+    margin-bottom: 4px;
+  }
+
+  .phase-step .phase-detail {
+    font-size: 0.74rem;
+    color: var(--text-dim);
+    line-height: 1.4;
+  }
+
+  /* ── Env var list ────────────────────────────────────── */
+  .env-grid {
+    display: grid;
+    grid-template-columns: repeat(auto-fill, minmax(280px, 1fr));
+    gap: 14px;
+    margin: 16px 0;
+  }
+
+  .env-card {
+    background: var(--bg-card);
+    border: 1px solid var(--border);
+    border-radius: 8px;
+    padding: 14px 16px;
+  }
+
+  .env-card .env-name {
+    font-family: var(--font-mono);
+    font-size: 0.8rem;
+    color: var(--accent-purple);
+    margin-bottom: 6px;
+    font-weight: 600;
+  }
+
+  .env-card .env-desc {
+    font-size: 0.8rem;
+    color: var(--text-muted);
+    line-height: 1.4;
+  }
+
+  /* ── Hero section ────────────────────────────────────── */
+  .hero {
+    background: linear-gradient(135deg, var(--bg-elevated) 0%, var(--bg-card) 100%);
+    border: 1px solid var(--border);
+    border-radius: 16px;
+    padding: 36px 40px;
+    margin-bottom: 48px;
+    display: flex;
+    gap: 32px;
+    align-items: flex-start;
+  }
+
+  .hero-stat {
+    text-align: center;
+    flex-shrink: 0;
+    min-width: 80px;
+  }
+
+  .hero-stat .stat-val {
+    font-size: 2rem;
+    font-weight: 800;
+    color: var(--accent-blue);
+    line-height: 1;
+  }
+
+  .hero-stat .stat-label {
+    font-size: 0.72rem;
+    color: var(--text-dim);
+    text-transform: uppercase;
+    letter-spacing: 0.08em;
+    margin-top: 4px;
+  }
+
+  .hero-divider {
+    width: 1px;
+    background: var(--border);
+    align-self: stretch;
+  }
+
+  .hero-text h1 {
+    margin-bottom: 10px;
+  }
+
+  /* ── Scrollbar ───────────────────────────────────────── */
+  ::-webkit-scrollbar { width: 6px; height: 6px; }
+  ::-webkit-scrollbar-track { background: var(--bg); }
+  ::-webkit-scrollbar-thumb { background: var(--border-bright); border-radius: 3px; }
+
+  /* ── Responsive ──────────────────────────────────────── */
+  @media (max-width: 800px) {
+    nav { display: none; }
+    main { margin-left: 0; padding: 24px 20px 60px; }
+  }
+
+  /* ── SVG diagram ─────────────────────────────────────── */
+  svg text { font-family: var(--font-sans); }
+</style>
+</head>
+<body>
+
+<!-- ── Sidebar navigation ───────────────────────────────── -->
+<nav id="sidebar">
+  <div class="nav-logo">
+    <div class="logo-title">MAD Automation</div>
+    <div class="logo-sub">Claude Code How-To Guide</div>
+  </div>
+
+  <div class="nav-section">Overview</div>
+  <a href="#what-is-mad">What is MAD</a>
+  <a href="#architecture">Architecture</a>
+
+  <div class="nav-section">Get Started</div>
+  <a href="#quickstart">Quick-start Decision Table</a>
+  <a href="#perf-contract">Performance Contract</a>
+
+  <div class="nav-section">Slash Commands</div>
+  <a href="#cmd-benchmark">/mad-benchmark</a>
+  <a href="#cmd-add-model">/mad-add-model</a>
+  <a href="#cmd-profile">/mad-profile</a>
+  <a href="#cmd-report">/mad-report</a>
+  <a href="#cmd-tune">/mad-tune</a>
+  <a href="#cmd-validate">/mad-validate</a>
+
+  <div class="nav-section">Workflows</div>
+  <a href="#wf-sweep">mad-benchmark-sweep</a>
+  <a href="#wf-tune-search">mad-tune-search</a>
+
+  <div class="nav-section">Reference</div>
+  <a href="#agents">Agent Cards</a>
+  <a href="#env-vars">Env Variables</a>
+  <a href="#tips">Tips &amp; Gotchas</a>
+</nav>
+
+<!-- ── Main content ─────────────────────────────────────── -->
+<main>
+
+<!-- Hero -->
+<div class="hero">
+  <div class="hero-text">
+    <h1>MAD Claude Automation</h1>
+    <p class="page-subtitle" style="border:none;padding:0;margin:0;">
+      A practical guide to benchmarking, adding, tuning, and analyzing AI model workloads
+      on AMD Instinct GPUs — using Claude Code's slash commands, agents, and workflows.
+    </p>
+  </div>
+  <div class="hero-divider"></div>
+  <div class="hero-stat">
+    <div class="stat-val">142</div>
+    <div class="stat-label">models</div>
+  </div>
+  <div class="hero-stat">
+    <div class="stat-val">6</div>
+    <div class="stat-label">commands</div>
+  </div>
+  <div class="hero-stat">
+    <div class="stat-val">4</div>
+    <div class="stat-label">agents</div>
+  </div>
+  <div class="hero-stat">
+    <div class="stat-val">2</div>
+    <div class="stat-label">workflows</div>
+  </div>
+</div>
+
+<!-- ── 1. What is MAD ─────────────────────────────────────── -->
+<section id="what-is-mad">
+  <h2><span class="section-num">1</span> What is MAD</h2>
+
+  <p>
+    <strong>MAD</strong> (Model Automation and Dashboarding) is a curated catalog of 142 AI model
+    workloads designed to run on AMD Instinct GPUs. Each model is one JSON entry in
+    <code>models.json</code>. The <strong>madengine</strong> CLI handles the full lifecycle:
+    building a Docker image, running the model inside a container, parsing a performance number
+    from stdout, and appending a row to <code>perf.csv</code>.
+  </p>
+
+  <p>
+    Claude Code automation sits on top of madengine. Instead of memorizing flags and file
+    conventions, you type a slash command — Claude dispatches the right subagent which constructs
+    and (when a GPU host is available) executes the correct madengine invocation.
+  </p>
+
+  <div class="table-wrap">
+    <table>
+      <thead>
+        <tr>
+          <th>User task</th>
+          <th>Slash command</th>
+          <th>Subagent</th>
+        </tr>
+      </thead>
+      <tbody>
+        <tr>
+          <td>Run a benchmark</td>
+          <td><code>/mad-benchmark</code></td>
+          <td>mad-benchmark-runner</td>
+        </tr>
+        <tr>
+          <td>Add a new model</td>
+          <td><code>/mad-add-model</code></td>
+          <td>mad-model-author</td>
+        </tr>
+        <tr>
+          <td>Profile / trace a model</td>
+          <td><code>/mad-profile</code></td>
+          <td>mad-benchmark-runner</td>
+        </tr>
+        <tr>
+          <td>Analyze results</td>
+          <td><code>/mad-report</code></td>
+          <td>mad-perf-analyst</td>
+        </tr>
+        <tr>
+          <td>Tune for better perf</td>
+          <td><code>/mad-tune</code></td>
+          <td>mad-tuner</td>
+        </tr>
+        <tr>
+          <td>Validate model entries</td>
+          <td><code>/mad-validate</code></td>
+          <td>(inline script)</td>
+        </tr>
+      </tbody>
+    </table>
+  </div>
+
+  <p>
+    Two multi-agent <strong>workflows</strong> handle matrix benchmarking and systematic
+    tuning search: <code>mad-benchmark-sweep</code> and <code>mad-tune-search</code>.
+    These fan out parallel subagent calls and synthesize results into comparison tables.
+  </p>
+</section>
+
+<!-- ── 2. Architecture ────────────────────────────────────── -->
+<section id="architecture">
+  <h2><span class="section-num">2</span> Architecture</h2>
+
+  <p>
+    The system has three layers: the <strong>user interface</strong> (slash commands you type in
+    Claude Code), <strong>subagents</strong> (specialized Claude instances with tool access), and
+    the <strong>madengine CLI</strong> plus the MAD repository files.
+  </p>
+
+  <div class="arch-diagram">
+    <svg viewBox="0 0 840 380" width="100%" xmlns="http://www.w3.org/2000/svg" role="img" aria-label="MAD architecture diagram">
+      <defs>
+        <marker id="arr" markerWidth="8" markerHeight="8" refX="6" refY="3" orient="auto">
+          <path d="M0,0 L0,6 L8,3 z" fill="#4f9cf9" opacity="0.85"/>
+        </marker>
+        <marker id="arrP" markerWidth="8" markerHeight="8" refX="6" refY="3" orient="auto">
+          <path d="M0,0 L0,6 L8,3 z" fill="#a78bfa" opacity="0.85"/>
+        </marker>
+        <marker id="arrG" markerWidth="8" markerHeight="8" refX="6" refY="3" orient="auto">
+          <path d="M0,0 L0,6 L8,3 z" fill="#34d399" opacity="0.8"/>
+        </marker>
+        <marker id="arrGray" markerWidth="8" markerHeight="8" refX="6" refY="3" orient="auto">
+          <path d="M0,0 L0,6 L8,3 z" fill="#94a3b8" opacity="0.6"/>
+        </marker>
+        <marker id="arrR" markerWidth="8" markerHeight="8" refX="6" refY="3" orient="auto">
+          <path d="M0,0 L0,6 L8,3 z" fill="#f87171" opacity="0.7"/>
+        </marker>
+        <marker id="arrBi" markerWidth="8" markerHeight="8" refX="2" refY="3" orient="auto">
+          <path d="M8,0 L8,6 L0,3 z" fill="#94a3b8" opacity="0.6"/>
+        </marker>
+      </defs>
+
+      <!-- Layer labels -->
+      <text x="13" y="60" font-size="10" fill="#64748b" font-weight="700" letter-spacing="1.5" text-anchor="middle" transform="rotate(-90,13,60)">USER</text>
+      <text x="13" y="190" font-size="10" fill="#64748b" font-weight="700" letter-spacing="1.5" text-anchor="middle" transform="rotate(-90,13,190)">AGENTS</text>
+      <text x="13" y="300" font-size="10" fill="#64748b" font-weight="700" letter-spacing="1.5" text-anchor="middle" transform="rotate(-90,13,300)">ENGINE</text>
+
+      <!-- Layer separators -->
+      <line x1="28" y1="120" x2="824" y2="120" stroke="#2a3045" stroke-width="1" stroke-dasharray="4,4"/>
+      <line x1="28" y1="216" x2="824" y2="216" stroke="#2a3045" stroke-width="1" stroke-dasharray="4,4"/>
+
+      <!-- Group labels -->
+      <text x="40" y="18" font-size="9" fill="#5c7a9e" font-weight="700" letter-spacing="1" text-anchor="start">SLASH COMMANDS</text>
+      <text x="658" y="18" font-size="9" fill="#7c6bb0" font-weight="700" letter-spacing="1" text-anchor="start">WORKFLOWS</text>
+
+      <!-- ── Slash commands (top row, 6 boxes) ── -->
+      <!-- /mad-benchmark -->
+      <rect x="40" y="26" width="92" height="40" rx="7" fill="#1e3a5f" stroke="#4f9cf9" stroke-width="1.4" opacity="0.92"/>
+      <text x="86" y="44" font-size="9.5" fill="#93c5fd" text-anchor="middle" font-weight="700">/mad-benchmark</text>
+      <text x="86" y="57" font-size="8" fill="#5c7a9e" text-anchor="middle">build · run</text>
+
+      <!-- /mad-profile -->
+      <rect x="142" y="26" width="92" height="40" rx="7" fill="#1e3a5f" stroke="#4f9cf9" stroke-width="1.4" opacity="0.92"/>
+      <text x="188" y="44" font-size="9.5" fill="#93c5fd" text-anchor="middle" font-weight="700">/mad-profile</text>
+      <text x="188" y="57" font-size="8" fill="#5c7a9e" text-anchor="middle">+ trace tool</text>
+
+      <!-- /mad-add-model -->
+      <rect x="244" y="26" width="92" height="40" rx="7" fill="#1e3a5f" stroke="#4f9cf9" stroke-width="1.4" opacity="0.92"/>
+      <text x="290" y="44" font-size="9.5" fill="#93c5fd" text-anchor="middle" font-weight="700">/mad-add-model</text>
+      <text x="290" y="57" font-size="8" fill="#5c7a9e" text-anchor="middle">scaffold</text>
+
+      <!-- /mad-report -->
+      <rect x="346" y="26" width="92" height="40" rx="7" fill="#1e3a5f" stroke="#4f9cf9" stroke-width="1.4" opacity="0.92"/>
+      <text x="392" y="44" font-size="9.5" fill="#93c5fd" text-anchor="middle" font-weight="700">/mad-report</text>
+      <text x="392" y="57" font-size="8" fill="#5c7a9e" text-anchor="middle">analyze csv</text>
+
+      <!-- /mad-tune -->
+      <rect x="448" y="26" width="92" height="40" rx="7" fill="#1e3a5f" stroke="#4f9cf9" stroke-width="1.4" opacity="0.92"/>
+      <text x="494" y="44" font-size="9.5" fill="#93c5fd" text-anchor="middle" font-weight="700">/mad-tune</text>
+      <text x="494" y="57" font-size="8" fill="#5c7a9e" text-anchor="middle">measure-loop</text>
+
+      <!-- /mad-validate -->
+      <rect x="550" y="26" width="92" height="40" rx="7" fill="#3a2a10" stroke="#fbbf24" stroke-width="1.4" opacity="0.92"/>
+      <text x="596" y="44" font-size="9.5" fill="#fcd34d" text-anchor="middle" font-weight="700">/mad-validate</text>
+      <text x="596" y="57" font-size="8" fill="#8a7340" text-anchor="middle">GPU-free lint</text>
+
+      <!-- ── Workflows (right group, stacked) ── -->
+      <rect x="658" y="26" width="160" height="40" rx="7" fill="#2d1f4a" stroke="#a78bfa" stroke-width="1.4" opacity="0.92"/>
+      <text x="738" y="44" font-size="10" fill="#c4b5fd" text-anchor="middle" font-weight="700">mad-benchmark-sweep</text>
+      <text x="738" y="57" font-size="8" fill="#7c6bb0" text-anchor="middle">matrix → compare</text>
+
+      <rect x="658" y="72" width="160" height="40" rx="7" fill="#2d1f4a" stroke="#a78bfa" stroke-width="1.4" opacity="0.92"/>
+      <text x="738" y="90" font-size="10" fill="#c4b5fd" text-anchor="middle" font-weight="700">mad-tune-search</text>
+      <text x="738" y="103" font-size="8" fill="#7c6bb0" text-anchor="middle">diagnose → tune</text>
+
+      <!-- ── Agents (middle row, 4 agents + inline validator) ── -->
+      <!-- benchmark-runner -->
+      <rect x="40" y="132" width="148" height="52" rx="8" fill="#1a3a2a" stroke="#34d399" stroke-width="1.5"/>
+      <text x="114" y="150" font-size="10.5" fill="#6ee7b7" text-anchor="middle" font-weight="700">benchmark-runner</text>
+      <text x="114" y="164" font-size="8" fill="#34d399" text-anchor="middle">Bash · Read · Grep · Glob</text>
+      <text x="114" y="176" font-size="8" fill="#5c7a9e" text-anchor="middle">run &amp; profile</text>
+
+      <!-- model-author -->
+      <rect x="200" y="132" width="148" height="52" rx="8" fill="#1a3a2a" stroke="#34d399" stroke-width="1.5"/>
+      <text x="274" y="150" font-size="10.5" fill="#6ee7b7" text-anchor="middle" font-weight="700">model-author</text>
+      <text x="274" y="164" font-size="8" fill="#34d399" text-anchor="middle">Read · Write · Edit · Glob</text>
+      <text x="274" y="176" font-size="8" fill="#5c7a9e" text-anchor="middle">scaffold files</text>
+
+      <!-- perf-analyst -->
+      <rect x="360" y="132" width="148" height="52" rx="8" fill="#1a3a2a" stroke="#34d399" stroke-width="1.5"/>
+      <text x="434" y="150" font-size="10.5" fill="#6ee7b7" text-anchor="middle" font-weight="700">perf-analyst</text>
+      <text x="434" y="164" font-size="8" fill="#34d399" text-anchor="middle">Read · Grep · Bash (r/o)</text>
+      <text x="434" y="176" font-size="8" fill="#5c7a9e" text-anchor="middle">parse &amp; compare</text>
+
+      <!-- tuner -->
+      <rect x="520" y="132" width="148" height="52" rx="8" fill="#1a3a2a" stroke="#34d399" stroke-width="1.5"/>
+      <text x="594" y="150" font-size="10.5" fill="#6ee7b7" text-anchor="middle" font-weight="700">tuner</text>
+      <text x="594" y="164" font-size="8" fill="#34d399" text-anchor="middle">Read · Edit · Bash · Grep</text>
+      <text x="594" y="176" font-size="8" fill="#5c7a9e" text-anchor="middle">measure-change</text>
+
+      <!-- inline validator (no agent) -->
+      <rect x="680" y="132" width="136" height="52" rx="8" fill="#241a08" stroke="#fbbf24" stroke-width="1.4" stroke-dasharray="5,3"/>
+      <text x="748" y="150" font-size="10.5" fill="#fcd34d" text-anchor="middle" font-weight="700">inline script</text>
+      <text x="748" y="164" font-size="8" fill="#c79a3a" text-anchor="middle">json · paths · header</text>
+      <text x="748" y="176" font-size="8" fill="#8a7340" text-anchor="middle">no subagent · GPU-free</text>
+
+      <!-- ── Engine layer (bottom) ── -->
+      <!-- madengine -->
+      <rect x="40" y="240" width="210" height="58" rx="8" fill="#161b27" stroke="#3a4460" stroke-width="1.6"/>
+      <text x="145" y="262" font-size="12.5" fill="#e2e8f0" text-anchor="middle" font-weight="800">madengine</text>
+      <text x="145" y="278" font-size="8.5" fill="#64748b" text-anchor="middle">build · run · discover · report · database</text>
+      <text x="145" y="291" font-size="8.5" fill="#5c7a9e" text-anchor="middle">Docker / SLURM / Kubernetes</text>
+
+      <!-- MAD repo files -->
+      <rect x="270" y="240" width="290" height="58" rx="8" fill="#161b27" stroke="#3a4460" stroke-width="1.6"/>
+      <text x="415" y="261" font-size="11.5" fill="#e2e8f0" text-anchor="middle" font-weight="700">MAD repo files</text>
+      <text x="415" y="277" font-size="8.5" fill="#64748b" text-anchor="middle">models.json · docker/*.Dockerfile · scripts/run.sh</text>
+      <text x="415" y="290" font-size="8.5" fill="#5c7a9e" text-anchor="middle">perf.csv · perf_entry_super.json  (performance: N unit)</text>
+
+      <!-- AMD GPU -->
+      <rect x="580" y="240" width="236" height="58" rx="8" fill="#1a1010" stroke="#f87171" stroke-width="1.6"/>
+      <text x="698" y="266" font-size="12" fill="#f87171" text-anchor="middle" font-weight="700">AMD GPU</text>
+      <text x="698" y="283" font-size="8.5" fill="#f87171" text-anchor="middle" opacity="0.7">required for build / run</text>
+
+      <!-- ── Arrows: commands → agents (solid blue) ── -->
+      <line x1="86"  y1="66" x2="108" y2="132" stroke="#4f9cf9" stroke-width="1.4" marker-end="url(#arr)" opacity="0.8"/>
+      <line x1="188" y1="66" x2="124" y2="132" stroke="#4f9cf9" stroke-width="1.4" marker-end="url(#arr)" opacity="0.8"/>
+      <line x1="290" y1="66" x2="278" y2="132" stroke="#4f9cf9" stroke-width="1.4" marker-end="url(#arr)" opacity="0.8"/>
+      <line x1="392" y1="66" x2="430" y2="132" stroke="#4f9cf9" stroke-width="1.4" marker-end="url(#arr)" opacity="0.8"/>
+      <line x1="494" y1="66" x2="586" y2="132" stroke="#4f9cf9" stroke-width="1.4" marker-end="url(#arr)" opacity="0.8"/>
+      <!-- /mad-validate → inline script (amber) -->
+      <line x1="596" y1="66" x2="744" y2="132" stroke="#fbbf24" stroke-width="1.4" marker-end="url(#arr)" opacity="0.7"/>
+
+      <!-- ── Workflows → agents (dashed purple fan-out) ── -->
+      <path d="M662,56 C420,96 240,100 150,131" fill="none" stroke="#a78bfa" stroke-width="1.4" marker-end="url(#arrP)" opacity="0.6" stroke-dasharray="5,3"/>
+      <line x1="700" y1="112" x2="616" y2="132" stroke="#a78bfa" stroke-width="1.4" marker-end="url(#arrP)" opacity="0.7" stroke-dasharray="5,3"/>
+      <text x="436" y="100" font-size="7.5" fill="#7c6bb0" text-anchor="middle" font-style="italic">fan-out</text>
+
+      <!-- ── Arrows: agents → engine ── -->
+      <!-- benchmark-runner → madengine -->
+      <line x1="114" y1="184" x2="132" y2="240" stroke="#34d399" stroke-width="1.4" marker-end="url(#arrG)" opacity="0.7"/>
+      <!-- model-author → repo files (writes) -->
+      <line x1="274" y1="184" x2="360" y2="240" stroke="#34d399" stroke-width="1.4" marker-end="url(#arrG)" opacity="0.7"/>
+      <!-- perf-analyst → repo files (reads perf.csv) -->
+      <line x1="434" y1="184" x2="430" y2="240" stroke="#34d399" stroke-width="1.4" marker-end="url(#arrG)" opacity="0.7"/>
+      <!-- tuner → repo files (edits configs) -->
+      <line x1="594" y1="184" x2="490" y2="240" stroke="#34d399" stroke-width="1.4" marker-end="url(#arrG)" opacity="0.7"/>
+      <!-- tuner → madengine (measures), faint long -->
+      <path d="M560,184 C400,212 230,214 170,239" fill="none" stroke="#34d399" stroke-width="1.2" marker-end="url(#arrG)" opacity="0.35" stroke-dasharray="4,3"/>
+      <!-- inline validator → repo files (read-only checks), faint -->
+      <line x1="748" y1="184" x2="540" y2="244" stroke="#fbbf24" stroke-width="1.2" marker-end="url(#arrG)" opacity="0.4" stroke-dasharray="4,3"/>
+
+      <!-- ── madengine ↔ repo files (reads models.json, writes perf.csv) ── -->
+      <line x1="250" y1="262" x2="270" y2="262" stroke="#94a3b8" stroke-width="1.3" marker-end="url(#arrGray)" marker-start="url(#arrBi)" opacity="0.6"/>
+      <!-- madengine → AMD GPU (build/run) -->
+      <line x1="560" y1="276" x2="580" y2="270" stroke="#f87171" stroke-width="1.4" marker-end="url(#arrR)" opacity="0.6"/>
+      <line x1="250" y1="284" x2="580" y2="284" stroke="#f87171" stroke-width="1.3" marker-end="url(#arrR)" opacity="0.45" stroke-dasharray="4,3"/>
+    </svg>
+  </div>
+
+  <p style="display:flex;flex-wrap:wrap;gap:18px;align-items:center;font-size:0.78rem;color:var(--text-dim);margin:4px 0 18px;">
+    <span><span style="display:inline-block;width:22px;height:0;border-top:2px solid #4f9cf9;vertical-align:middle;margin-right:6px;"></span>command → agent</span>
+    <span><span style="display:inline-block;width:22px;height:0;border-top:2px dashed #a78bfa;vertical-align:middle;margin-right:6px;"></span>workflow fan-out</span>
+    <span><span style="display:inline-block;width:22px;height:0;border-top:2px solid #34d399;vertical-align:middle;margin-right:6px;"></span>agent → engine / files</span>
+    <span><span style="display:inline-block;width:22px;height:0;border-top:2px solid #fbbf24;vertical-align:middle;margin-right:6px;"></span>GPU-free path</span>
+    <span><span style="display:inline-block;width:22px;height:0;border-top:2px solid #f87171;vertical-align:middle;margin-right:6px;"></span>needs AMD GPU</span>
+  </p>
+
+  <p>
+    Five of the six slash commands dispatch a subagent; <code>/mad-validate</code> runs an
+    <strong>inline script</strong> (no subagent, no GPU). The two workflows fan out to
+    <em>multiple</em> subagent calls in parallel — <code>mad-benchmark-sweep</code> drives the
+    benchmark-runner across a tag matrix, <code>mad-tune-search</code> drives the tuner through a
+    diagnose-then-measure search — then synthesize the results. Agents that need to run madengine
+    always check for AMD GPUs first; if none are present they produce a plan (the exact commands
+    to run on a GPU host) instead of executing.
+  </p>
+</section>
+
+<!-- ── 3. Quick-start ─────────────────────────────────────── -->
+<section id="quickstart">
+  <h2><span class="section-num">3</span> Quick-Start: What Do You Want to Do?</h2>
+
+  <div class="flow">
+    <div class="flow-row">
+      <div class="flow-q">Run one benchmark</div>
+      <div class="flow-arrow">→</div>
+      <div class="flow-cmd">/mad-benchmark pyt_vllm_llama-3.1-8b</div>
+    </div>
+    <div class="flow-row">
+      <div class="flow-q">Benchmark multiple tags at once</div>
+      <div class="flow-arrow">→</div>
+      <div class="flow-cmd">/mad-benchmark-sweep --tags bert,deepseek</div>
+    </div>
+    <div class="flow-row">
+      <div class="flow-q">Add a brand-new model</div>
+      <div class="flow-arrow">→</div>
+      <div class="flow-cmd">/mad-add-model pyt_vllm_gemma-3-27b https://github.com/...</div>
+    </div>
+    <div class="flow-row">
+      <div class="flow-q">Attach a GPU profiler</div>
+      <div class="flow-arrow">→</div>
+      <div class="flow-cmd">/mad-profile bert rocprofv3_compute</div>
+    </div>
+    <div class="flow-row">
+      <div class="flow-q">Read or compare benchmark results</div>
+      <div class="flow-arrow">→</div>
+      <div class="flow-cmd">/mad-report perf.csv perf_old.csv</div>
+    </div>
+    <div class="flow-row">
+      <div class="flow-q">Improve throughput of a model</div>
+      <div class="flow-arrow">→</div>
+      <div class="flow-cmd">/mad-tune bert throughput</div>
+    </div>
+    <div class="flow-row">
+      <div class="flow-q">Validate files before a GPU run</div>
+      <div class="flow-arrow">→</div>
+      <div class="flow-cmd">/mad-validate bert</div>
+    </div>
+    <div class="flow-row">
+      <div class="flow-q">Systematic tuning search (N candidates)</div>
+      <div class="flow-arrow">→</div>
+      <div class="flow-cmd">/mad-tune-search --tag bert --candidates 6</div>
+    </div>
+  </div>
+
+  <div class="callout callout-amber">
+    <div class="callout-title">GPU host check</div>
+    <p>
+      <strong>madengine build and run require AMD GPUs.</strong> On a machine without GPUs,
+      agents automatically switch to <em>plan mode</em> — they validate the command and print
+      the exact invocation for you to run on a GPU host. GPU-free commands (discover, validate)
+      always work.
+    </p>
+  </div>
+</section>
+
+<!-- ── Performance contract ──────────────────────────────── -->
+<section id="perf-contract">
+  <h2><span class="section-num">4</span> The Performance Contract</h2>
+
+  <p>
+    This is the single most important MAD convention. Every model's <code>run.sh</code> must
+    emit <strong>exactly one line</strong> to stdout that madengine can grep. Without it,
+    madengine records the run with no performance value.
+  </p>
+
+  <div class="callout callout-green">
+    <div class="callout-title">Required output — single result</div>
+    <pre style="background:transparent;border:none;padding:0;margin:8px 0 0;"><code><span class="t-perf">performance: &lt;value&gt; &lt;unit&gt;</span></code></pre>
+    <p style="margin-top:8px;">
+      For example: <code>performance: 3847.2 tokens_per_second</code> or
+      <code>performance: 124 samples_per_second</code>.
+    </p>
+  </div>
+
+  <h3>Implementing the line in run.sh</h3>
+
+  <pre><span class="copy-hint" onclick="copyCode(this)">copy</span><span class="t-comment"># Parse from the workload's own log output, then emit the contract line</span>
+<span class="t-var">performance</span>=$(grep -Eo <span class="t-string">"train_samples_per_second':[^,]+"</span> log.txt \
+  | sed <span class="t-string">"s/.*: //g"</span> | head -n 1)
+echo <span class="t-string">"performance: $performance samples_per_second"</span></pre>
+
+  <h3>Alternative: multiple results</h3>
+
+  <p>
+    When a script produces many rows (e.g. one per concurrency level), have the script write
+    its own CSV and declare <code>"multiple_results"</code> in <code>models.json</code>:
+  </p>
+
+  <pre><span class="copy-hint" onclick="copyCode(this)">copy</span><span class="t-comment"># models.json entry</span>
+{
+  <span class="t-key">"name"</span>: <span class="t-string">"pyt_vllm_deepseek-r1"</span>,
+  ...
+  <span class="t-key">"multiple_results"</span>: <span class="t-string">"perf_DeepSeek-R1.csv"</span>
+}</pre>
+
+  <div class="callout callout-red">
+    <div class="callout-title">Hard rule</div>
+    <p>
+      A run script must satisfy <strong>exactly one</strong> of the two contracts above.
+      Satisfying neither means madengine records a run with no performance value.
+      The <code>/mad-validate</code> command checks this statically before any GPU run.
+    </p>
+  </div>
+</section>
+
+<!-- ── Slash commands ─────────────────────────────────────── -->
+<section id="cmd-benchmark">
+  <h2><span class="section-num">5</span> Slash Command Reference</h2>
+
+  <div class="card">
+    <div class="card-header">
+      <div class="card-icon" style="background:#1e3a5f;">
+        <svg width="22" height="22" viewBox="0 0 24 24" fill="none" stroke="#93c5fd" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><polyline points="4 17 10 11 4 5"/><line x1="12" y1="19" x2="20" y2="19"/></svg>
+      </div>
+      <div class="card-meta">
+        <div class="card-title">
+          /mad-benchmark
+          <span class="badge badge-command">command</span>
+          <span class="badge badge-agent">benchmark-runner</span>
+        </div>
+        <div class="card-desc">Build and run a madengine benchmark for the given tag or model name.</div>
+      </div>
+    </div>
+    <div class="card-body">
+      <p><strong>Usage</strong></p>
+      <pre><span class="copy-hint" onclick="copyCode(this)">copy</span><span class="t-cmd">/mad-benchmark</span> <span class="t-string">&lt;tag-or-model&gt;</span> [extra options]</pre>
+
+      <p><strong>Examples</strong></p>
+      <pre><span class="copy-hint" onclick="copyCode(this)">copy</span><span class="t-cmd">/mad-benchmark</span> <span class="t-string">bert</span>
+<span class="t-cmd">/mad-benchmark</span> <span class="t-string">pyt_vllm_llama-3.1-8b</span> <span class="t-flag">--timeout 3600</span>
+<span class="t-cmd">/mad-benchmark</span> <span class="t-string">deepseek</span> <span class="t-flag">--output results/deepseek.csv</span></pre>
+
+      <p><strong>What the agent does</strong></p>
+      <ol>
+        <li>Runs <code>madengine discover --tags &lt;tag&gt;</code> to confirm the tag resolves (GPU-free).</li>
+        <li>Assembles: <code>madengine run --tags &lt;tag&gt; --live-output [-o path] [--timeout S]</code>.</li>
+        <li>Checks for AMD GPUs via <code>rocm-smi</code> / <code>amd-smi</code>.</li>
+        <li>On GPU host: runs the command and reports where <code>perf.csv</code> landed.</li>
+        <li>No GPU: prints the exact command to run on a GPU host and stops.</li>
+      </ol>
+
+      <p><strong>Expected output</strong></p>
+      <ul>
+        <li>A <code>perf.csv</code> row with the performance value and unit.</li>
+        <li>Live stdout from the model during the run (via <code>--live-output</code>).</li>
+        <li>Summary of resolved model(s), command used, and output file path.</li>
+      </ul>
+    </div>
+  </div>
+</section>
+
+<section id="cmd-add-model">
+  <div class="card">
+    <div class="card-header">
+      <div class="card-icon" style="background:#1a3a2a;">
+        <svg width="22" height="22" viewBox="0 0 24 24" fill="none" stroke="#6ee7b7" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><path d="M14 2H6a2 2 0 0 0-2 2v16a2 2 0 0 0 2 2h12a2 2 0 0 0 2-2V8z"/><polyline points="14 2 14 8 20 8"/><line x1="12" y1="18" x2="12" y2="12"/><line x1="9" y1="15" x2="15" y2="15"/></svg>
+      </div>
+      <div class="card-meta">
+        <div class="card-title">
+          /mad-add-model
+          <span class="badge badge-command">command</span>
+          <span class="badge badge-agent">model-author</span>
+        </div>
+        <div class="card-desc">Scaffold a new MAD model — models.json entry, Dockerfile, and run.sh with the performance line.</div>
+      </div>
+    </div>
+    <div class="card-body">
+      <p><strong>Usage</strong></p>
+      <pre><span class="copy-hint" onclick="copyCode(this)">copy</span><span class="t-cmd">/mad-add-model</span> <span class="t-string">&lt;framework_project_workload&gt;</span> [base notes / repo url]</pre>
+
+      <p><strong>Examples</strong></p>
+      <pre><span class="copy-hint" onclick="copyCode(this)">copy</span><span class="t-cmd">/mad-add-model</span> <span class="t-string">pyt_vllm_gemma-3-27b</span> <span class="t-string">https://github.com/google/gemma</span>
+<span class="t-cmd">/mad-add-model</span> <span class="t-string">jax_maxtext_llama-4</span> <span class="t-string">bf16 inference, context=8192</span></pre>
+
+      <p><strong>What the agent does</strong></p>
+      <ol>
+        <li>Identifies the closest existing model of the same framework to use as template.</li>
+        <li>Creates three files:
+          <ul>
+            <li><code>models.json</code> entry (required fields: name, url, dockerfile, scripts, n_gpus, owner, training_precision, tags)</li>
+            <li><code>docker/&lt;name&gt;.ubuntu.amd.Dockerfile</code> with the required context header</li>
+            <li><code>scripts/&lt;dir&gt;/run.sh</code> ending in <code>echo "performance: $performance &lt;unit&gt;"</code></li>
+          </ul>
+        </li>
+        <li>Validates <code>models.json</code> with <code>python3 -m json.tool models.json</code>.</li>
+        <li>Confirms the entry is selectable: <code>madengine discover --tags &lt;name&gt;</code>.</li>
+      </ol>
+
+      <p><strong>Dockerfile required header (first line)</strong></p>
+      <pre><span class="copy-hint" onclick="copyCode(this)">copy</span><span class="t-comment"># CONTEXT {'gpu_vendor': 'AMD', 'guest_os': 'UBUNTU'}</span>
+ARG BASE_DOCKER=rocm/pytorch
+FROM $BASE_DOCKER</pre>
+
+      <p><strong>Expected output</strong></p>
+      <ul>
+        <li>Three file paths created/edited.</li>
+        <li>The template model that was used.</li>
+        <li>The verification command: <code>madengine run --tags &lt;name&gt; --live-output</code> (requires GPU).</li>
+      </ul>
+    </div>
+  </div>
+</section>
+
+<section id="cmd-profile">
+  <div class="card">
+    <div class="card-header">
+      <div class="card-icon" style="background:#2d1a1a;">
+        <svg width="22" height="22" viewBox="0 0 24 24" fill="none" stroke="#f87171" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><circle cx="12" cy="12" r="10"/><polyline points="12 6 12 12 16 14"/></svg>
+      </div>
+      <div class="card-meta">
+        <div class="card-title">
+          /mad-profile
+          <span class="badge badge-command">command</span>
+          <span class="badge badge-agent">benchmark-runner</span>
+        </div>
+        <div class="card-desc">Run a madengine benchmark with a GPU profiling or tracing tool attached.</div>
+      </div>
+    </div>
+    <div class="card-body">
+      <p><strong>Usage</strong></p>
+      <pre><span class="copy-hint" onclick="copyCode(this)">copy</span><span class="t-cmd">/mad-profile</span> <span class="t-string">&lt;tag-or-model&gt;</span> [tool: rocprofv3_compute|rpd|rccl_trace|...]</pre>
+
+      <p><strong>Examples</strong></p>
+      <pre><span class="copy-hint" onclick="copyCode(this)">copy</span><span class="t-cmd">/mad-profile</span> <span class="t-string">bert</span>
+<span class="t-cmd">/mad-profile</span> <span class="t-string">pyt_vllm_llama-3.1-8b</span> <span class="t-string">rccl_trace</span>
+<span class="t-cmd">/mad-profile</span> <span class="t-string">deepseek</span> <span class="t-string">rocprofv3_memory</span></pre>
+
+      <p><strong>What the agent does</strong></p>
+      <ol>
+        <li>Picks the profiling tool (defaults to <code>rocprofv3_compute</code> if unspecified).</li>
+        <li>Builds the command with <code>--additional-context</code>:</li>
+      </ol>
+      <pre><span class="copy-hint" onclick="copyCode(this)">copy</span><span class="t-cmd">madengine run</span> <span class="t-flag">--tags</span> <span class="t-string">&lt;tag&gt;</span> <span class="t-flag">--live-output</span> \
+  <span class="t-flag">--additional-context</span> <span class="t-string">'{"tools": [{"name": "rocprofv3_compute"}]}'</span></pre>
+
+      <p><strong>Available profiling tools (common subset)</strong></p>
+      <div class="table-wrap">
+        <table>
+          <thead><tr><th>Tool name</th><th>What it captures</th></tr></thead>
+          <tbody>
+            <tr><td><code>rocprofv3_compute</code></td><td>Kernel execution, occupancy, compute metrics</td></tr>
+            <tr><td><code>rocprofv3_memory</code></td><td>Memory bandwidth, cache hit rates</td></tr>
+            <tr><td><code>rocprofv3_communication</code></td><td>GPU-to-GPU communication metrics</td></tr>
+            <tr><td><code>rpd</code></td><td>ROCm profiling data, timeline view</td></tr>
+            <tr><td><code>rccl_trace</code></td><td>Collective communication (AllReduce, etc.)</td></tr>
+            <tr><td><code>rocm_trace_lite</code></td><td>Lightweight system trace</td></tr>
+            <tr><td><code>gpu_info_power_profiler</code></td><td>Power draw over time</td></tr>
+            <tr><td><code>rocblas_trace</code> / <code>hipblaslt_trace</code></td><td>BLAS kernel calls</td></tr>
+          </tbody>
+        </table>
+      </div>
+      <p>The complete authoritative list (23+ tools) is in the madengine package at <code>scripts/common/tools.json</code>.</p>
+
+      <div class="callout callout-amber">
+        <div class="callout-title">Performance note</div>
+        <p>Profiling adds overhead. The perf number recorded under profiling is not a clean benchmark — use <code>/mad-benchmark</code> for baseline measurements.</p>
+      </div>
+    </div>
+  </div>
+</section>
+
+<section id="cmd-report">
+  <div class="card">
+    <div class="card-header">
+      <div class="card-icon" style="background:#1f2a3a;">
+        <svg width="22" height="22" viewBox="0 0 24 24" fill="none" stroke="#4f9cf9" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><line x1="18" y1="20" x2="18" y2="10"/><line x1="12" y1="20" x2="12" y2="4"/><line x1="6" y1="20" x2="6" y2="14"/></svg>
+      </div>
+      <div class="card-meta">
+        <div class="card-title">
+          /mad-report
+          <span class="badge badge-command">command</span>
+          <span class="badge badge-agent">perf-analyst</span>
+        </div>
+        <div class="card-desc">Parse and summarize MAD benchmark results from perf.csv or perf_entry_super.json. Compare two result sets to find regressions.</div>
+      </div>
+    </div>
+    <div class="card-body">
+      <p><strong>Usage</strong></p>
+      <pre><span class="copy-hint" onclick="copyCode(this)">copy</span><span class="t-cmd">/mad-report</span> [csv-or-json-path] [compare-to-path]</pre>
+
+      <p><strong>Examples</strong></p>
+      <pre><span class="copy-hint" onclick="copyCode(this)">copy</span><span class="t-cmd">/mad-report</span>                               <span class="t-comment"># defaults to perf.csv</span>
+<span class="t-cmd">/mad-report</span> <span class="t-string">results/after.csv</span>
+<span class="t-cmd">/mad-report</span> <span class="t-string">perf_new.csv</span> <span class="t-string">perf_baseline.csv</span>  <span class="t-comment"># comparison mode</span></pre>
+
+      <p><strong>What the agent does</strong></p>
+      <ul>
+        <li>Locates the result file(s) — defaults to <code>perf.csv</code>.</li>
+        <li>Single file: summarizes models run, performance + unit, pass/fail status.</li>
+        <li>Two files: per-model delta and % change; flags regressions (slower) vs improvements; respects each row's <code>metric</code> unit.</li>
+        <li>Leads with the headline finding, then a compact table.</li>
+      </ul>
+
+      <p><strong>Generate an HTML dashboard instead</strong></p>
+      <pre><span class="copy-hint" onclick="copyCode(this)">copy</span><span class="t-cmd">madengine report to-html</span> <span class="t-flag">--csv-file-path</span> perf.csv</pre>
+
+      <p><strong>perf.csv columns</strong></p>
+      <div class="table-wrap">
+        <table>
+          <thead><tr><th>Column</th><th>Meaning</th></tr></thead>
+          <tbody>
+            <tr><td><code>model</code></td><td>Model name from models.json</td></tr>
+            <tr><td><code>performance</code></td><td>Numeric result parsed from stdout</td></tr>
+            <tr><td><code>metric</code></td><td>Unit (tokens_per_second, samples_per_second, …)</td></tr>
+            <tr><td><code>status</code></td><td>SUCCESS or failure reason</td></tr>
+            <tr><td><code>n_gpus</code></td><td>Number of GPUs used</td></tr>
+            <tr><td><code>training_precision</code></td><td>fp16, bf16, fp8, fp32, …</td></tr>
+            <tr><td><code>gpu_architecture</code></td><td>gfx942, gfx950, …</td></tr>
+            <tr><td><code>git_commit</code></td><td>Commit hash at run time</td></tr>
+            <tr><td><code>build_duration</code> / <code>test_duration</code></td><td>Wall time in seconds</td></tr>
+          </tbody>
+        </table>
+      </div>
+    </div>
+  </div>
+</section>
+
+<section id="cmd-tune">
+  <div class="card">
+    <div class="card-header">
+      <div class="card-icon" style="background:#2a1f1a;">
+        <svg width="22" height="22" viewBox="0 0 24 24" fill="none" stroke="#fbbf24" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><circle cx="12" cy="12" r="3"/><path d="M19.07 4.93a10 10 0 0 1 0 14.14M4.93 4.93a10 10 0 0 0 0 14.14"/></svg>
+      </div>
+      <div class="card-meta">
+        <div class="card-title">
+          /mad-tune
+          <span class="badge badge-command">command</span>
+          <span class="badge badge-agent">tuner</span>
+        </div>
+        <div class="card-desc">Iteratively tune a MAD model for better performance using a disciplined measure-change-measure loop.</div>
+      </div>
+    </div>
+    <div class="card-body">
+      <p><strong>Usage</strong></p>
+      <pre><span class="copy-hint" onclick="copyCode(this)">copy</span><span class="t-cmd">/mad-tune</span> <span class="t-string">&lt;tag-or-model&gt;</span> [target: throughput|latency] [lever hints]</pre>
+
+      <p><strong>Examples</strong></p>
+      <pre><span class="copy-hint" onclick="copyCode(this)">copy</span><span class="t-cmd">/mad-tune</span> <span class="t-string">bert</span>
+<span class="t-cmd">/mad-tune</span> <span class="t-string">pyt_vllm_llama-3.1-8b</span> <span class="t-string">throughput</span>
+<span class="t-cmd">/mad-tune</span> <span class="t-string">deepseek</span> <span class="t-string">latency</span> <span class="t-string">try tensor-parallel=4</span></pre>
+
+      <p><strong>What the agent does</strong></p>
+      <ol>
+        <li>Reads the model's <code>run.sh</code>, config files, and current <code>perf.csv</code> row to establish a baseline.</li>
+        <li>Identifies tuning levers for the stack:
+          <ul>
+            <li>Env vars: <code>MAD_MODEL_BATCH_SIZE</code>, <code>HIP_VISIBLE_DEVICES</code>, <code>NCCL_*</code>/<code>RCCL_*</code>, <code>PYTORCH_TUNABLEOP_ENABLED</code></li>
+            <li>Script/config args: tensor-parallel size, precision (fp16/bf16/fp8), sequence length, <code>gpu-memory-utilization</code>, <code>max-num-seqs</code></li>
+          </ul>
+        </li>
+        <li>Proposes one change at a time, applies it, re-measures, keeps improvements, reverts regressions.</li>
+        <li>No GPU: produces a ranked list of candidate changes with rationale and the exact madengine run commands to test each.</li>
+      </ol>
+
+      <p><strong>Expected output</strong></p>
+      <ul>
+        <li>Baseline performance value + unit.</li>
+        <li>Table of each change tried with measured delta.</li>
+        <li>Final recommended configuration with before/after numbers.</li>
+      </ul>
+    </div>
+  </div>
+</section>
+
+<section id="cmd-validate">
+  <div class="card">
+    <div class="card-header">
+      <div class="card-icon" style="background:#1f2a1f;">
+        <svg width="22" height="22" viewBox="0 0 24 24" fill="none" stroke="#34d399" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><polyline points="9 11 12 14 22 4"/><path d="M21 12v7a2 2 0 0 1-2 2H5a2 2 0 0 1-2-2V5a2 2 0 0 1 2-2h11"/></svg>
+      </div>
+      <div class="card-meta">
+        <div class="card-title">
+          /mad-validate
+          <span class="badge badge-command">command</span>
+          <span class="badge badge-warn">GPU-free</span>
+        </div>
+        <div class="card-desc">Statically validate MAD model entries — JSON, paths, Dockerfile header, output contract. No GPU required.</div>
+      </div>
+    </div>
+    <div class="card-body">
+      <p><strong>Usage</strong></p>
+      <pre><span class="copy-hint" onclick="copyCode(this)">copy</span><span class="t-cmd">/mad-validate</span> [tag-or-model | all]</pre>
+
+      <p><strong>Examples</strong></p>
+      <pre><span class="copy-hint" onclick="copyCode(this)">copy</span><span class="t-cmd">/mad-validate</span>                    <span class="t-comment"># checks all 142 entries</span>
+<span class="t-cmd">/mad-validate</span> <span class="t-string">bert</span>               <span class="t-comment"># checks only models matching "bert"</span>
+<span class="t-cmd">/mad-validate</span> <span class="t-string">pyt_vllm_deepseek-r1</span></pre>
+
+      <p><strong>Checks performed</strong></p>
+
+      <p style="font-size:0.82rem;font-weight:700;color:var(--accent-red);margin:12px 0 6px;">Errors (break a run)</p>
+      <ul>
+        <li><code>models.json</code> parses as valid JSON.</li>
+        <li>All entries have a <code>name</code> field and names are unique.</li>
+        <li>Dockerfile exists at <code>&lt;dockerfile&gt;.ubuntu.amd.Dockerfile</code> and first line is the context header.</li>
+        <li>Scripts path exists (a <code>run.sh</code> or directory).</li>
+        <li>Output contract satisfied: run script contains a <code>performance:</code> line, or entry sets <code>multiple_results</code>.</li>
+      </ul>
+
+      <p style="font-size:0.82rem;font-weight:700;color:var(--accent-amber);margin:12px 0 6px;">Warnings (convention metadata, non-fatal)</p>
+      <ul>
+        <li>Missing any of: <code>url</code>, <code>owner</code>, <code>training_precision</code>, <code>tags</code>, <code>n_gpus</code>.</li>
+      </ul>
+
+      <div class="callout callout-blue">
+        <div class="callout-title">Workflow tip</div>
+        <p>Run <code>/mad-validate &lt;name&gt;</code> immediately after <code>/mad-add-model</code> and before any GPU run. It catches tag typos, missing files, and missing performance lines in seconds.</p>
+      </div>
+    </div>
+  </div>
+</section>
+
+<!-- ── Workflows ──────────────────────────────────────────── -->
+<section id="wf-sweep">
+  <h2><span class="section-num">6</span> Workflow Reference</h2>
+
+  <div class="card">
+    <div class="card-header">
+      <div class="card-icon" style="background:#2d1f4a;">
+        <svg width="22" height="22" viewBox="0 0 24 24" fill="none" stroke="#c4b5fd" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><rect x="3" y="3" width="7" height="7"/><rect x="14" y="3" width="7" height="7"/><rect x="14" y="14" width="7" height="7"/><rect x="3" y="14" width="7" height="7"/></svg>
+      </div>
+      <div class="card-meta">
+        <div class="card-title">
+          mad-benchmark-sweep
+          <span class="badge badge-workflow">workflow</span>
+        </div>
+        <div class="card-desc">Fan out a benchmark matrix across multiple MAD tags, run (or plan) each in parallel, and synthesize a comparison table. Accepts the same CLI-style flags as <code>/mad-benchmark</code>.</div>
+      </div>
+    </div>
+    <div class="card-body">
+      <p><strong>When to use</strong></p>
+      <p>When you want to compare performance across multiple model tags or GPU counts in one operation.</p>
+
+      <p><strong>Phases</strong></p>
+      <div class="phases">
+        <div class="phase-step">
+          <div class="phase-num">Phase 1</div>
+          <div class="phase-name">Resolve</div>
+          <div class="phase-detail">Expand the sweep matrix into concrete madengine invocations</div>
+        </div>
+        <div class="phase-step">
+          <div class="phase-num">Phase 2</div>
+          <div class="phase-name">Run</div>
+          <div class="phase-detail">Execute or plan each benchmark in parallel subagents</div>
+        </div>
+        <div class="phase-step">
+          <div class="phase-num">Phase 3</div>
+          <div class="phase-name">Synthesize</div>
+          <div class="phase-detail">Parse results into a comparison table with headline finding</div>
+        </div>
+      </div>
+
+      <p><strong>Arguments</strong></p>
+      <p>
+        Accepts <strong>CLI-style flags</strong> — the same syntax as <code>/mad-benchmark</code>.
+        Pass one or more tags after <code>--tags</code> (comma- and/or space-separated); each tag
+        becomes one sweep cell. A structured JSON object is also still accepted.
+      </p>
+      <div class="table-wrap">
+        <table>
+          <thead><tr><th>Flag</th><th>Meaning</th></tr></thead>
+          <tbody>
+            <tr><td><code>--tags a,b c</code></td><td>One or more model tags (comma and/or space separated). Required.</td></tr>
+            <tr><td><code>--additional-context '{...}'</code></td><td>Threaded <em>verbatim</em> into every cell (HF token, <code>docker_env_vars</code>, deployment keys). A per-cell <code>n_gpus</code> is merged in when <code>--ngpus</code> is set.</td></tr>
+            <tr><td><code>--ngpus 4,8</code></td><td>Optional GPU-count axis — multiplies the matrix (each tag × each count).</td></tr>
+            <tr><td><code>--timeout S</code></td><td>Optional per-cell timeout in seconds.</td></tr>
+            <tr><td><code>--no-execute</code> / <code>--plan</code></td><td>Validate + resolve tags only, no GPU. <strong>Executes by default.</strong></td></tr>
+          </tbody>
+        </table>
+      </div>
+
+      <p><strong>Example invocations</strong></p>
+      <pre><span class="copy-hint" onclick="copyCode(this)">copy</span><span class="t-comment"># Execute mode (default) — AMD GPU required, context threaded into every cell</span>
+<span class="t-cmd">/mad-benchmark-sweep</span> <span class="t-flag">--tags</span> <span class="t-string">bert,pyt_vllm_qwen3-8b</span> <span class="t-flag">--additional-context</span> <span class="t-string">'{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'</span> <span class="t-flag">--live-output</span>
+
+<span class="t-comment"># Plan mode — safe, no GPU required</span>
+<span class="t-cmd">/mad-benchmark-sweep</span> <span class="t-flag">--tags</span> <span class="t-string">bert pyt_vllm_deepseek-r1</span> <span class="t-flag">--plan</span>
+
+<span class="t-comment"># GPU-count axis: each tag × each count</span>
+<span class="t-cmd">/mad-benchmark-sweep</span> <span class="t-flag">--tags</span> <span class="t-string">bert</span> <span class="t-flag">--ngpus</span> <span class="t-string">4,8</span></pre>
+
+      <div class="callout callout-amber">
+        <div class="callout-title">Executes by default</div>
+        <p>
+          On a GPU host this workflow <strong>runs immediately</strong>. Pass <code>--plan</code>
+          (or <code>--no-execute</code>) to only validate commands and resolve tags. Each cell
+          still self-skips if no AMD GPU is present.
+        </p>
+      </div>
+
+      <div class="callout callout-blue">
+        <div class="callout-title">perf.csv isolation</div>
+        <p>
+          Each sweep cell writes to its own <code>perf_&lt;tag&gt;.csv</code> — never a shared
+          <code>perf.csv</code>. This prevents parallel runs from clobbering each other.
+          Concatenate the files afterward for a combined report.
+        </p>
+      </div>
+
+      <div class="callout callout-amber">
+        <div class="callout-title">Precision is not a sweep axis</div>
+        <p>
+          In MAD, precision (fp16/bf16/fp8) is <strong>fixed per model</strong> in
+          <code>models.json</code> — it is not a runtime flag. To compare precisions,
+          list the distinct model tags (e.g. <code>pyt_vllm_llama-fp16</code> and
+          <code>pyt_vllm_llama-fp8</code>) in the <code>tags</code> array instead.
+        </p>
+      </div>
+    </div>
+  </div>
+</section>
+
+<section id="wf-tune-search">
+  <div class="card">
+    <div class="card-header">
+      <div class="card-icon" style="background:#2d1f4a;">
+        <svg width="22" height="22" viewBox="0 0 24 24" fill="none" stroke="#c4b5fd" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><circle cx="11" cy="11" r="8"/><line x1="21" y1="21" x2="16.65" y2="16.65"/><line x1="11" y1="8" x2="11" y2="14"/><line x1="8" y1="11" x2="14" y2="11"/></svg>
+      </div>
+      <div class="card-meta">
+        <div class="card-title">
+          mad-tune-search
+          <span class="badge badge-workflow">workflow</span>
+        </div>
+        <div class="card-desc"><strong>Profiles the model to diagnose its bottleneck</strong>, then generates profiling-informed tuning candidates, measures each <strong>cleanly and sequentially</strong>, adversarially verifies deltas, and synthesizes the winning config. Accepts the same CLI-style flags as <code>/mad-benchmark</code> / <code>/mad-profile</code>.</div>
+      </div>
+    </div>
+    <div class="card-body">
+      <p><strong>When to use</strong></p>
+      <p>When you want tuning driven by real profiling evidence — not guesswork — and have the workflow verify which improvements are real. The Diagnose phase classifies the workload (compute / memory / communication / launch-bound) and the proposals target that specific bottleneck.</p>
+
+      <div class="callout callout-blue">
+        <div class="callout-title">Core design — profile to diagnose, clean-run to measure</div>
+        <p>
+          Profiling plays <strong>two distinct roles</strong>. The profiler runs <strong>once</strong>
+          (Diagnose) to find <em>what</em> to tune — kernel hotspots and whether the model is compute-,
+          memory-, communication-, or launch-bound. Then it is turned <strong>off</strong> for every
+          measurement run, because tracing overhead (e.g. <code>rocm_trace_lite</code> intercepts
+          nearly all GPU ops) buries the small perf deltas tuning produces and would make the verify
+          step reject real wins. The workflow automatically splits a single
+          <code>--additional-context</code> into a <em>profiled</em> variant (keeps <code>tools</code>)
+          for Diagnose and a <em>clean</em> variant (<code>tools</code> stripped) for the baseline and
+          all candidate measurements.
+        </p>
+      </div>
+
+      <p><strong>Phases</strong></p>
+      <div class="phases">
+        <div class="phase-step">
+          <div class="phase-num">Phase 1</div>
+          <div class="phase-name">Baseline</div>
+          <div class="phase-detail">Establish a <strong>clean</strong> baseline number (profiler off)</div>
+        </div>
+        <div class="phase-step">
+          <div class="phase-num">Phase 2</div>
+          <div class="phase-name">Diagnose</div>
+          <div class="phase-detail">Profile <strong>once</strong>; read hotspots; classify the bottleneck (compute/memory/comm/launch)</div>
+        </div>
+        <div class="phase-step">
+          <div class="phase-num">Phase 3</div>
+          <div class="phase-name">Propose</div>
+          <div class="phase-detail">Generate N single-lever candidates, each <strong>citing the profiling evidence</strong></div>
+        </div>
+        <div class="phase-step">
+          <div class="phase-num">Phase 4</div>
+          <div class="phase-name">Evaluate</div>
+          <div class="phase-detail">Measure each <strong>clean, one at a time</strong> + adversarially verify the delta inline</div>
+        </div>
+        <div class="phase-step">
+          <div class="phase-num">Phase 5</div>
+          <div class="phase-name">Synthesize</div>
+          <div class="phase-detail">Pick the winning config from trustworthy results only</div>
+        </div>
+      </div>
+
+      <p><strong>Arguments</strong></p>
+      <p>
+        Accepts <strong>CLI-style flags</strong> — the same syntax as <code>/mad-benchmark</code>.
+        Tunes <em>one</em> model; if multiple tags are given, the first is used and the rest are
+        ignored (use <code>mad-benchmark-sweep</code> to compare models). A structured JSON object
+        is also still accepted.
+      </p>
+      <div class="table-wrap">
+        <table>
+          <thead><tr><th>Flag</th><th>Meaning</th></tr></thead>
+          <tbody>
+            <tr><td><code>--tag NAME</code></td><td>The model to tune (<code>--tags</code> also accepted). Required.</td></tr>
+            <tr><td><code>--candidates N</code></td><td>Number of single-lever candidates, 1–8 (default 4).</td></tr>
+            <tr><td><code>--target throughput|latency</code></td><td>Optimization target (default throughput).</td></tr>
+            <tr><td><code>--additional-context '{...}'</code></td><td>Split into a <em>profiled</em> variant (Diagnose) and a <em>clean</em> variant (baseline + measurements). Threaded into every command (HF token, <code>docker_env_vars</code>). A <code>tools</code> entry here sets the profiler used in Diagnose. For <code>env</code>-lever candidates the agent merges the var into the clean context's <code>docker_env_vars</code>.</td></tr>
+            <tr><td><code>--profile-tool NAME</code></td><td>Profiler for the Diagnose phase (default <code>rocm_trace_lite</code>, or taken from a <code>tools</code> entry in <code>--additional-context</code>).</td></tr>
+            <tr><td><code>--timeout S</code></td><td>Optional per-run timeout in seconds.</td></tr>
+            <tr><td><code>--no-execute</code> / <code>--plan</code></td><td>Produce the ranked candidate plan only, no GPU. <strong>Executes by default.</strong></td></tr>
+          </tbody>
+        </table>
+      </div>
+
+      <p><strong>Example invocations</strong></p>
+      <pre><span class="copy-hint" onclick="copyCode(this)">copy</span><span class="t-comment"># Execute (default): diagnose with rocm_trace_lite, then tune throughput clean.</span>
+<span class="t-comment"># The "tools" key drives Diagnose; it is auto-stripped for the clean measurement runs.</span>
+<span class="t-cmd">/mad-tune-search</span> <span class="t-flag">--tag</span> <span class="t-string">pyt_vllm_qwen3-8b</span> <span class="t-flag">--candidates</span> <span class="t-string">6</span> <span class="t-flag">--additional-context</span> <span class="t-string">'{"gpu_vendor": "AMD", "guest_os": "UBUNTU", "tools": [{"name": "rocm_trace_lite"}]}'</span>
+
+<span class="t-comment"># Pick the diagnosis profiler explicitly (e.g. memory roofline)</span>
+<span class="t-cmd">/mad-tune-search</span> <span class="t-flag">--tag</span> <span class="t-string">pyt_vllm_qwen3-8b</span> <span class="t-flag">--profile-tool</span> <span class="t-string">rocprofv3_memory</span>
+
+<span class="t-comment"># Plan mode — diagnose from existing trace + static config, rank candidates, no GPU run</span>
+<span class="t-cmd">/mad-tune-search</span> <span class="t-flag">--tag</span> <span class="t-string">bert</span> <span class="t-flag">--candidates</span> <span class="t-string">6</span> <span class="t-flag">--plan</span>
+
+<span class="t-comment"># Latency optimization</span>
+<span class="t-cmd">/mad-tune-search</span> <span class="t-flag">--tag</span> <span class="t-string">pyt_vllm_llama-3.1-8b</span> <span class="t-flag">--target</span> <span class="t-string">latency</span></pre>
+
+      <div class="callout callout-amber">
+        <div class="callout-title">Sequential evaluation — and why</div>
+        <p>
+          Candidates are measured <strong>one at a time</strong>, not in parallel. Running several
+          benchmarks at once on the same GPUs makes them contend for compute and produces
+          meaningless deltas; and candidates that edit <code>run.sh</code>/configs in parallel
+          would race on the same files and clobber each other's apply/revert. GPUs are a serial
+          resource for clean benchmarking, so one-at-a-time is both correct and safe. The cheap
+          adversarial-verify step still runs inline right after each measurement. Each candidate
+          writes its own <code>perf_tune_&lt;id&gt;.csv</code>.
+        </p>
+      </div>
+
+      <p><strong>Bottleneck → lever map (best practice)</strong></p>
+      <p>The Diagnose phase classifies the workload from kernel hotspots, which dictates the levers the proposals target:</p>
+      <div class="table-wrap">
+        <table>
+          <thead><tr><th>Bottleneck</th><th>Profiling signal</th><th>Levers to try</th></tr></thead>
+          <tbody>
+            <tr>
+              <td><strong>Compute-bound</strong></td>
+              <td>GEMM/conv kernels dominate GPU time (e.g. <code>Cijk*</code>, <code>*gemm*</code>, <code>wvSplitK</code>); high occupancy. Signal: <code>rocm_trace_lite</code>/<code>rpd</code> hotspots, <code>rocprofv3_compute</code> occupancy.</td>
+              <td><code>PYTORCH_TUNABLEOP_ENABLED</code>, hipBLASLt tuning, fp8 / lower precision, larger batch size</td>
+            </tr>
+            <tr>
+              <td><strong>Memory-bound</strong></td>
+              <td>Attention / elementwise / norm / copy kernels dominate; low arithmetic intensity. Signal: <code>rocprofv3_memory</code> bandwidth.</td>
+              <td>KV-cache dtype, kernel fusion, <code>gpu-memory-utilization</code>, batch size</td>
+            </tr>
+            <tr>
+              <td><strong>Communication-bound</strong></td>
+              <td>RCCL/NCCL collectives (AllReduce/AllGather) are a large share (multi-GPU). Signal: <code>rocprofv3_communication</code>, <code>rccl_trace</code>.</td>
+              <td><code>NCCL_*</code>/<code>RCCL_*</code> tuning, tensor-parallel vs pipeline-parallel topology</td>
+            </tr>
+            <tr>
+              <td><strong>Launch-bound</strong></td>
+              <td>Many tiny kernels with timeline gaps / low GPU utilization. Signal: timeline gaps in <code>rocm_trace_lite</code>.</td>
+              <td>HIP/CUDA graphs, larger batch, fewer sync points</td>
+            </tr>
+          </tbody>
+        </table>
+      </div>
+
+      <p><strong>Adversarial verification</strong></p>
+      <p>
+        After each candidate is evaluated, a separate subagent reviews the claimed delta.
+        It marks results as <code>trustworthy: false</code> if the change was not actually
+        measured (planned/skipped), only one sample was taken, <strong>it ran under a profiler</strong>,
+        or the delta vs baseline is within run-to-run noise (~1–2%).
+        The final recommended configuration comes only from trustworthy results.
+      </p>
+
+      <div class="callout callout-red">
+        <div class="callout-title">Set --timeout for long-running candidates</div>
+        <p>
+          Sequential evaluation means one slow candidate stalls the entire search.
+          Some levers (e.g. <code>PYTORCH_TUNABLEOP_ENABLED</code>) have a pathologically
+          slow first run (online GEMM benchmarking collapses throughput to &lt;1 tok/s).
+          The Docker container launched by madengine outlives the Claude session — if the
+          session ends mid-run the container keeps going but the workflow never completes
+          and the config edit is not reverted. Always pass <code>--timeout</code> to bound
+          each candidate, and confirm <code>extended.yaml</code> (or any edited config) is
+          clean after an interrupted run.
+        </p>
+        <pre style="background:transparent;border:none;padding:4px 0 0;margin:0;"><code>/mad-tune-search --tag pyt_vllm_qwen3-8b --timeout 1200</code></pre>
+      </div>
+    </div>
+  </div>
+</section>
+
+<!-- ── Agents ─────────────────────────────────────────────── -->
+<section id="agents">
+  <h2><span class="section-num">7</span> Agent Reference</h2>
+
+  <p>
+    Agents are specialized Claude instances with a defined tool set and a focused role.
+    They are dispatched automatically by slash commands and workflows — you generally never
+    call them directly.
+  </p>
+
+  <div class="card">
+    <div class="card-header">
+      <div class="card-icon" style="background:#1a3a2a;">
+        <svg width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="#6ee7b7" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><polyline points="4 17 10 11 4 5"/><line x1="12" y1="19" x2="20" y2="19"/></svg>
+      </div>
+      <div class="card-meta">
+        <div class="card-title">
+          mad-benchmark-runner
+          <span class="badge badge-agent">agent</span>
+        </div>
+        <div class="card-desc">Constructs and runs the correct madengine benchmark command from a model/tag or plain-English intent.</div>
+      </div>
+    </div>
+    <div class="card-body">
+      <p><strong>Tools:</strong> Bash, Read, Grep, Glob</p>
+      <p><strong>Called by:</strong> <code>/mad-benchmark</code>, <code>/mad-profile</code>, <code>mad-benchmark-sweep</code> (workflow)</p>
+      <ul>
+        <li>Resolves tags with <code>madengine discover</code> (read-only, no GPU).</li>
+        <li>Builds the full madengine run command including profiling context if requested.</li>
+        <li>Checks for GPUs — plan mode if absent, execute if present.</li>
+        <li>Reports resolved models, exact command, required env vars, and output file path.</li>
+      </ul>
+    </div>
+  </div>
+
+  <div class="card">
+    <div class="card-header">
+      <div class="card-icon" style="background:#1a3a2a;">
+        <svg width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="#6ee7b7" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><path d="M14 2H6a2 2 0 0 0-2 2v16a2 2 0 0 0 2 2h12a2 2 0 0 0 2-2V8z"/><polyline points="14 2 14 8 20 8"/><line x1="12" y1="18" x2="12" y2="12"/><line x1="9" y1="15" x2="15" y2="15"/></svg>
+      </div>
+      <div class="card-meta">
+        <div class="card-title">
+          mad-model-author
+          <span class="badge badge-agent">agent</span>
+        </div>
+        <div class="card-desc">Scaffolds a new MAD model — the models.json entry, the Dockerfile, and the run.sh with the performance line.</div>
+      </div>
+    </div>
+    <div class="card-body">
+      <p><strong>Tools:</strong> Read, Write, Edit, Grep, Glob</p>
+      <p><strong>Called by:</strong> <code>/mad-add-model</code></p>
+      <ul>
+        <li>Finds the closest existing model of the same framework to use as a template.</li>
+        <li>Creates all three required files following MAD conventions.</li>
+        <li>Validates <code>models.json</code> and confirms discoverability — never runs a GPU benchmark.</li>
+        <li>Reports file paths, template used, and the GPU verification command.</li>
+      </ul>
+    </div>
+  </div>
+
+  <div class="card">
+    <div class="card-header">
+      <div class="card-icon" style="background:#1a3a2a;">
+        <svg width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="#6ee7b7" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><line x1="18" y1="20" x2="18" y2="10"/><line x1="12" y1="20" x2="12" y2="4"/><line x1="6" y1="20" x2="6" y2="14"/></svg>
+      </div>
+      <div class="card-meta">
+        <div class="card-title">
+          mad-perf-analyst
+          <span class="badge badge-agent">agent</span>
+          <span class="badge badge-warn">read-only</span>
+        </div>
+        <div class="card-desc">Reads and analyzes MAD benchmark results from perf.csv and perf_entry_super.json. Never edits files or runs workloads.</div>
+      </div>
+    </div>
+    <div class="card-body">
+      <p><strong>Tools:</strong> Read, Grep, Glob, Bash (read-only)</p>
+      <p><strong>Called by:</strong> <code>/mad-report</code></p>
+      <ul>
+        <li>Parses <code>perf.csv</code>, <code>perf_entry_super.json</code>, or any named CSV.</li>
+        <li>Summarizes or compares results; precise about units — never compares across different metric values.</li>
+        <li>States direction assumption (higher = better for throughput, lower = better for latency).</li>
+        <li>Uses <code>python3</code> for parsing — never installs packages.</li>
+      </ul>
+    </div>
+  </div>
+
+  <div class="card">
+    <div class="card-header">
+      <div class="card-icon" style="background:#1a3a2a;">
+        <svg width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="#6ee7b7" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><circle cx="12" cy="12" r="3"/><path d="M19.07 4.93a10 10 0 0 1 0 14.14M4.93 4.93a10 10 0 0 0 0 14.14"/></svg>
+      </div>
+      <div class="card-meta">
+        <div class="card-title">
+          mad-tuner
+          <span class="badge badge-agent">agent</span>
+        </div>
+        <div class="card-desc">Iteratively tunes an existing MAD model using a measure-change-measure loop. Reverts regressions automatically.</div>
+      </div>
+    </div>
+    <div class="card-body">
+      <p><strong>Tools:</strong> Read, Edit, Bash, Grep, Glob</p>
+      <p><strong>Called by:</strong> <code>/mad-tune</code>, <code>mad-tune-search</code> (workflow)</p>
+      <ul>
+        <li>Establishes baseline from <code>run.sh</code>, config files, and current <code>perf.csv</code>.</li>
+        <li>Proposes and applies one lever change at a time — keeps improvements, reverts regressions.</li>
+        <li>Requires AMD GPUs to measure. Without GPUs, produces a ranked candidate list with exact commands.</li>
+        <li>Never alters the <code>performance: &lt;value&gt; &lt;unit&gt;</code> output contract.</li>
+      </ul>
+    </div>
+  </div>
+</section>
+
+<!-- ── Env vars ───────────────────────────────────────────── -->
+<section id="env-vars">
+  <h2><span class="section-num">8</span> Environment Variables</h2>
+
+  <p>Pass these environment variables to control model behavior and authentication.</p>
+
+  <div class="env-grid">
+    <div class="env-card">
+      <div class="env-name">MAD_SECRETS_HFTOKEN</div>
+      <div class="env-desc">HuggingFace API token. Required for gated models (e.g. Llama 3, Gemma). Set before running any HF model.</div>
+    </div>
+    <div class="env-card">
+      <div class="env-name">MAD_MODEL_BATCH_SIZE</div>
+      <div class="env-desc">Override the batch size for a model run. Primary tuning lever for throughput experiments.</div>
+    </div>
+    <div class="env-card">
+      <div class="env-name">MAD_RUNTIME_NGPUS</div>
+      <div class="env-desc">Override the number of GPUs used at runtime (also controllable via n_gpus in models.json).</div>
+    </div>
+    <div class="env-card">
+      <div class="env-name">MAD_SYSTEM_GPU_ARCHITECTURE</div>
+      <div class="env-desc">GPU architecture identifier (e.g. gfx942, gfx950). Used for arch-specific builds and skip_gpu_arch filtering.</div>
+    </div>
+    <div class="env-card">
+      <div class="env-name">MAD_MODEL_NAME</div>
+      <div class="env-desc">Model name passed into the container at runtime. Some scripts use this to select between model checkpoints.</div>
+    </div>
+    <div class="env-card">
+      <div class="env-name">PYTORCH_TUNABLEOP_ENABLED</div>
+      <div class="env-desc">Enable PyTorch TunableOp for auto-tuned GEMM kernels. A common throughput tuning lever for PyTorch models. <strong>First run is very slow</strong> (online GEMM tuning collapses throughput to ~0.2–1 tok/s); results are only meaningful on a warm second run with cached kernels. Set <code>--timeout</code> generously or do a throwaway warmup pass before measuring.</div>
+    </div>
+    <div class="env-card">
+      <div class="env-name">NCCL_* / RCCL_*</div>
+      <div class="env-desc">Collective communication tuning (chunk sizes, algorithms, socket buffers). Relevant for multi-GPU runs.</div>
+    </div>
+    <div class="env-card">
+      <div class="env-name">HIP_VISIBLE_DEVICES</div>
+      <div class="env-desc">Which GPU indices are visible to the container. Use to pin a run to specific GPUs.</div>
+    </div>
+  </div>
+
+  <p><strong>Export pattern (shell)</strong></p>
+  <pre><span class="copy-hint" onclick="copyCode(this)">copy</span><span class="t-cmd">export</span> <span class="t-var">MAD_SECRETS_HFTOKEN</span>=<span class="t-string">hf_...</span>
+<span class="t-cmd">export</span> <span class="t-var">MAD_MODEL_BATCH_SIZE</span>=<span class="t-string">32</span>
+<span class="t-cmd">madengine run</span> <span class="t-flag">--tags</span> <span class="t-string">pyt_vllm_llama-3.1-8b</span> <span class="t-flag">--live-output</span></pre>
+</section>
+
+<!-- ── Tips ───────────────────────────────────────────────── -->
+<section id="tips">
+  <h2><span class="section-num">9</span> Tips and Gotchas</h2>
+
+  <details open>
+    <summary>GPU guard — plan mode vs execute mode</summary>
+    <div class="details-body">
+      <p>
+        All agents check for AMD GPUs (<code>rocm-smi</code> or <code>amd-smi</code>) before
+        attempting to build or run. On a machine without GPUs:
+      </p>
+      <ul>
+        <li>The agent prints the <strong>exact madengine command</strong> to run on a GPU host.</li>
+        <li>It lists required env vars (e.g. <code>MAD_SECRETS_HFTOKEN</code>).</li>
+        <li>It stops — it does not attempt to run the command.</li>
+      </ul>
+      <p>
+        GPU-free operations always work: <code>madengine discover</code>, <code>/mad-validate</code>,
+        and plan mode for benchmarks and tuning.
+      </p>
+      <pre><span class="copy-hint" onclick="copyCode(this)">copy</span><span class="t-comment"># GPU-free: confirm a tag resolves</span>
+<span class="t-cmd">madengine discover</span> <span class="t-flag">--tags</span> <span class="t-string">bert</span></pre>
+    </div>
+  </details>
+
+  <details>
+    <summary>Precision is not a runtime flag</summary>
+    <div class="details-body">
+      <p>
+        In MAD, <code>training_precision</code> (fp16, bf16, fp8, fp32) is fixed per model in
+        <code>models.json</code>. It is baked into the Dockerfile and run script — you cannot
+        change it with a madengine flag at runtime.
+      </p>
+      <p>
+        To compare precisions in a sweep, use <strong>separate model tags</strong> that differ
+        in precision. For example: <code>pyt_vllm_llama-fp16</code> and
+        <code>pyt_vllm_llama-fp8</code> as two entries in the <code>tags</code> array of
+        <code>mad-benchmark-sweep</code>.
+      </p>
+    </div>
+  </details>
+
+  <details>
+    <summary>perf.csv isolation in parallel sweeps</summary>
+    <div class="details-body">
+      <p>
+        The <code>mad-benchmark-sweep</code> workflow gives each cell its own output file
+        (<code>perf_&lt;tag&gt;.csv</code>) to prevent parallel runs from clobbering a shared
+        <code>perf.csv</code>. After a sweep, concatenate for a combined view:
+      </p>
+      <pre><span class="copy-hint" onclick="copyCode(this)">copy</span><span class="t-comment"># Merge sweep results (keep one header row)</span>
+head -1 perf_bert.csv > perf_combined.csv
+<span class="t-cmd">tail</span> -n +2 perf_bert.csv perf_deepseek.csv >> perf_combined.csv
+<span class="t-cmd">madengine report to-html</span> <span class="t-flag">--csv-file-path</span> perf_combined.csv</pre>
+    </div>
+  </details>
+
+  <details>
+    <summary>madengine discover — GPU-free tag validation</summary>
+    <div class="details-body">
+      <p>
+        Before any GPU run, confirm that your tags resolve to real models.
+        <code>madengine discover</code> reads <code>models.json</code> without touching Docker
+        or GPUs.
+      </p>
+      <pre><span class="copy-hint" onclick="copyCode(this)">copy</span><span class="t-cmd">madengine discover</span> <span class="t-flag">--tags</span> <span class="t-string">pyt_vllm_deepseek-r1</span>
+<span class="t-cmd">madengine discover</span> <span class="t-flag">--tags</span> <span class="t-string">vllm</span>   <span class="t-comment"># matches all tags containing "vllm"</span></pre>
+      <p>
+        The <code>tags</code> field in a model entry supports <em>any-match</em>: madengine
+        selects a model if <code>--tags</code> matches the exact <code>name</code> OR any
+        string in the <code>tags</code> list.
+      </p>
+    </div>
+  </details>
+
+  <details>
+    <summary>TunableOp needs a warmup — don't measure cold</summary>
+    <div class="details-body">
+      <p>
+        <code>PYTORCH_TUNABLEOP_ENABLED=1</code> triggers online GEMM auto-tuning on
+        the <strong>first</strong> run. During this phase throughput collapses to roughly
+        0.2–1 tok/s as PyTorch benchmarks every matrix-multiply shape. The actual improvement
+        only shows on subsequent runs using the cached kernel selection.
+      </p>
+      <p>
+        Measuring TunableOp cold produces a misleading (and huge) regression that the verify
+        step will correctly mark untrustworthy — but only after burning 30+ minutes of GPU time.
+        Best practice:
+      </p>
+      <ul>
+        <li>Run a throwaway warmup pass first (<code>--skip-model-run</code> then one full run) to populate the cache.</li>
+        <li>Always set <code>--timeout</code> so a single candidate cannot stall the whole tune-search.</li>
+        <li>The profiling bottleneck data guides whether TunableOp is even worth trying — only relevant for compute-bound (GEMM-dominant) workloads.</li>
+      </ul>
+    </div>
+  </details>
+
+  <details>
+    <summary>Qwen3-8B live tuning findings (MI350X / gfx950)</summary>
+    <div class="details-body">
+      <p>
+        From a live profiling + tuning run on 8× AMD Instinct MI350X:
+      </p>
+      <p><strong>Profiling result</strong> (<code>rocm_trace_lite</code>, 397,292 GPU ops, 8.6% GPU utilization):</p>
+      <div class="table-wrap">
+        <table>
+          <thead><tr><th>Kernel</th><th>% GPU time</th><th>Implication</th></tr></thead>
+          <tbody>
+            <tr><td><code>Cijk_Alik_Bljk_BBS_BH_Bias_HA…</code> (GEMM)</td><td>27.1%</td><td rowspan="2" style="vertical-align:middle;"><strong>Compute-bound</strong> — GEMM/wvSplitK dominate → target with TunableOp, hipBLASLt, higher concurrency</td></tr>
+            <tr><td><code>wvSplitK</code></td><td>20.6%</td></tr>
+            <tr><td><code>_fwd_kernel</code> (attention)</td><td>8.3%</td><td>Memory-bound secondary; paged attention</td></tr>
+            <tr><td><code>__amd_rocclr_copyBuffer</code></td><td>7.0%</td><td>Data movement overhead</td></tr>
+          </tbody>
+        </table>
+      </div>
+      <p><strong>Best candidate result</strong> — <code>max_concurrency=32</code> (vs baseline concurrency=1):</p>
+      <div class="table-wrap">
+        <table>
+          <thead><tr><th>Metric</th><th>Baseline</th><th>Tuned</th><th>Delta</th></tr></thead>
+          <tbody>
+            <tr><td>throughput_tot</td><td>~422 tok/s</td><td>7,993 tok/s</td><td style="color:var(--accent-green);font-weight:700;">+18.9×</td></tr>
+            <tr><td>throughput_gen</td><td>~211 tok/s</td><td>3,996 tok/s</td><td style="color:var(--accent-green);font-weight:700;">+18.9×</td></tr>
+          </tbody>
+        </table>
+      </div>
+      <p>
+        Baseline was run at <code>max_concurrency=1</code> (10 prompts) — a latency-optimized
+        config. Raising concurrency is the primary throughput lever for vLLM serving workloads.
+      </p>
+    </div>
+  </details>
+
+  <details>
+    <summary>tools.json — authoritative profiler name list</summary>
+    <div class="details-body">
+      <p>
+        The <code>/mad-profile</code> command accepts any profiler tool name. The complete
+        authoritative list (23+ tools) lives in the madengine package at
+        <code>scripts/common/tools.json</code>. Common names include:
+      </p>
+      <pre><span class="copy-hint" onclick="copyCode(this)">copy</span>rocprofv3_compute    rocprofv3_memory     rocprofv3_communication
+rocprofv3_full       rocprofv3            rpd
+rocm_trace_lite      rccl_trace           gpu_info_power_profiler
+rocblas_trace        hipblaslt_trace      miopen_trace
+rocprof_sys</pre>
+      <p>
+        If an agent gives you an unknown tool error, consult <code>tools.json</code> directly
+        for the exact name.
+      </p>
+    </div>
+  </details>
+
+  <details>
+    <summary>Deployment targets — local Docker vs SLURM vs Kubernetes</summary>
+    <div class="details-body">
+      <p>
+        madengine infers the deployment target from which key is present in
+        <code>--additional-context</code>. No special flag needed.
+      </p>
+      <pre><span class="copy-hint" onclick="copyCode(this)">copy</span><span class="t-comment"># Local Docker (default — no key)</span>
+<span class="t-cmd">madengine run</span> <span class="t-flag">--tags</span> <span class="t-string">bert</span> <span class="t-flag">--live-output</span>
+
+<span class="t-comment"># SLURM (presence of "slurm" key)</span>
+<span class="t-cmd">madengine run</span> <span class="t-flag">--tags</span> <span class="t-string">bert</span> <span class="t-flag">-c</span> <span class="t-string">'{"slurm": {"partition": "gpu", "nodes": 2}}'</span>
+
+<span class="t-comment"># Kubernetes (presence of "k8s" or "kubernetes" key)</span>
+<span class="t-cmd">madengine run</span> <span class="t-flag">--tags</span> <span class="t-string">bert</span> <span class="t-flag">-c</span> <span class="t-string">'{"k8s": {"namespace": "ml-jobs"}}'</span></pre>
+    </div>
+  </details>
+
+  <details>
+    <summary>Build-once, run-many pattern</summary>
+    <div class="details-body">
+      <p>
+        When running the same model many times (e.g. during tuning), build the image once and
+        reuse it to save time.
+      </p>
+      <pre><span class="copy-hint" onclick="copyCode(this)">copy</span><span class="t-comment"># Step 1: build only, writes build_manifest.json</span>
+<span class="t-cmd">madengine build</span> <span class="t-flag">--tags</span> <span class="t-string">pyt_vllm_llama-3.1-8b</span>
+
+<span class="t-comment"># Step 2: run from manifest (skips rebuild)</span>
+<span class="t-cmd">madengine run</span> <span class="t-flag">-m</span> build_manifest.json <span class="t-flag">--live-output</span>
+
+<span class="t-comment"># Alternatively: skip the model run but verify image builds</span>
+<span class="t-cmd">madengine run</span> <span class="t-flag">--tags</span> <span class="t-string">pyt_vllm_llama-3.1-8b</span> <span class="t-flag">--skip-model-run</span></pre>
+    </div>
+  </details>
+
+  <details>
+    <summary>models.json naming convention</summary>
+    <div class="details-body">
+      <p>
+        Model names follow <code>{framework}_{project}_{workload}</code>. Examples:
+      </p>
+      <div class="table-wrap">
+        <table>
+          <thead><tr><th>Name</th><th>Framework</th><th>Project</th><th>Workload</th></tr></thead>
+          <tbody>
+            <tr><td><code>pyt_vllm_deepseek-r1</code></td><td>pyt (PyTorch)</td><td>vllm</td><td>deepseek-r1</td></tr>
+            <tr><td><code>jax_maxtext_llama-4</code></td><td>jax</td><td>maxtext</td><td>llama-4</td></tr>
+            <tr><td><code>pyt_primus_qwen3-1.7b</code></td><td>pyt</td><td>primus</td><td>qwen3-1.7b</td></tr>
+            <tr><td><code>huggingface_bert</code></td><td>huggingface</td><td>–</td><td>bert</td></tr>
+          </tbody>
+        </table>
+      </div>
+      <p>
+        The <code>dockerfile</code> field stores the path prefix — madengine automatically
+        appends <code>.ubuntu.amd.Dockerfile</code>.
+      </p>
+    </div>
+  </details>
+
+  <details>
+    <summary>Uploading results to MongoDB</summary>
+    <div class="details-body">
+      <p>Use <code>madengine database</code> to push results to MongoDB after a run.</p>
+      <pre><span class="copy-hint" onclick="copyCode(this)">copy</span><span class="t-comment"># Dry run first to preview</span>
+<span class="t-cmd">madengine database</span> <span class="t-flag">-f</span> perf_entry_super.json <span class="t-flag">--db</span> mad_results <span class="t-flag">-c</span> benchmarks \
+  <span class="t-flag">-k</span> model,timestamp <span class="t-flag">--dry-run</span>
+
+<span class="t-comment"># Upload for real</span>
+<span class="t-cmd">madengine database</span> <span class="t-flag">-f</span> perf_entry_super.json <span class="t-flag">--db</span> mad_results <span class="t-flag">-c</span> benchmarks \
+  <span class="t-flag">-k</span> model,timestamp</pre>
+      <p>
+        <code>perf_entry_super.json</code> is the enriched record (includes a <code>configs</code>
+        block) that gets pushed. <code>-k model,timestamp</code> sets the unique key for upsert.
+      </p>
+    </div>
+  </details>
+</section>
+
+</main>
+
+<script>
+  // ── Copy-to-clipboard ──────────────────────────────────────
+  function copyCode(btn) {
+    const pre = btn.closest('pre');
+    const text = pre.innerText.replace(/^copy\n?/, '').trim();
+    navigator.clipboard.writeText(text).then(() => {
+      const orig = btn.textContent;
+      btn.textContent = 'copied';
+      btn.style.color = 'var(--accent-green)';
+      setTimeout(() => {
+        btn.textContent = orig;
+        btn.style.color = '';
+      }, 1500);
+    });
+  }
+
+  // ── Active nav highlight on scroll ────────────────────────
+  const sections = document.querySelectorAll('section[id], div.hero');
+  const navLinks = document.querySelectorAll('nav a[href^="#"]');
+
+  const observer = new IntersectionObserver((entries) => {
+    entries.forEach(entry => {
+      if (entry.isIntersecting) {
+        navLinks.forEach(link => {
+          link.classList.remove('active');
+          if (link.getAttribute('href') === '#' + entry.target.id) {
+            link.classList.add('active');
+          }
+        });
+      }
+    });
+  }, { rootMargin: '-20% 0px -70% 0px', threshold: 0 });
+
+  sections.forEach(s => { if (s.id) observer.observe(s); });
+</script>
+
+</body>
+</html>
\ No newline at end of file

User task	Slash command	Subagent
Run a benchmark	`/mad-benchmark`	mad-benchmark-runner
Add a new model	`/mad-add-model`	mad-model-author
Profile / trace a model	`/mad-profile`	mad-benchmark-runner
Analyze results	`/mad-report`	mad-perf-analyst
Tune for better perf	`/mad-tune`	mad-tuner
Validate model entries	`/mad-validate`	(inline script)
Tool name	What it captures
`rocprofv3_compute`	Kernel execution, occupancy, compute metrics
`rocprofv3_memory`	Memory bandwidth, cache hit rates
`rocprofv3_communication`	GPU-to-GPU communication metrics
`rpd`	ROCm profiling data, timeline view
`rccl_trace`	Collective communication (AllReduce, etc.)
`rocm_trace_lite`	Lightweight system trace
`gpu_info_power_profiler`	Power draw over time
`rocblas_trace` / `hipblaslt_trace`	BLAS kernel calls
Column	Meaning
`model`	Model name from models.json
`performance`	Numeric result parsed from stdout
`metric`	Unit (tokens_per_second, samples_per_second, …)
`status`	SUCCESS or failure reason
`n_gpus`	Number of GPUs used
`training_precision`	fp16, bf16, fp8, fp32, …
`gpu_architecture`	gfx942, gfx950, …
`git_commit`	Commit hash at run time
`build_duration` / `test_duration`	Wall time in seconds
Flag	Meaning
`--tags a,b c`	One or more model tags (comma and/or space separated). Required.
`--additional-context '{...}'`	Threaded verbatim into every cell (HF token, `docker_env_vars`, deployment keys). A per-cell `n_gpus` is merged in when `--ngpus` is set.
`--ngpus 4,8`	Optional GPU-count axis — multiplies the matrix (each tag × each count).
`--timeout S`	Optional per-cell timeout in seconds.
`--no-execute` / `--plan`	Validate + resolve tags only, no GPU. Executes by default.
Flag	Meaning
`--tag NAME`	The model to tune (`--tags` also accepted). Required.
`--candidates N`	Number of single-lever candidates, 1–8 (default 4).
`--target throughput\|latency`	Optimization target (default throughput).
`--additional-context '{...}'`	Split into a profiled variant (Diagnose) and a clean variant (baseline + measurements). Threaded into every command (HF token, `docker_env_vars`). A `tools` entry here sets the profiler used in Diagnose. For `env`-lever candidates the agent merges the var into the clean context's `docker_env_vars`.
`--profile-tool NAME`	Profiler for the Diagnose phase (default `rocm_trace_lite`, or taken from a `tools` entry in `--additional-context`).
`--timeout S`	Optional per-run timeout in seconds.
`--no-execute` / `--plan`	Produce the ranked candidate plan only, no GPU. Executes by default.
Bottleneck	Profiling signal	Levers to try
Compute-bound	GEMM/conv kernels dominate GPU time (e.g. `Cijk`, `gemm*`, `wvSplitK`); high occupancy. Signal: `rocm_trace_lite`/`rpd` hotspots, `rocprofv3_compute` occupancy.	`PYTORCH_TUNABLEOP_ENABLED`, hipBLASLt tuning, fp8 / lower precision, larger batch size
Memory-bound	Attention / elementwise / norm / copy kernels dominate; low arithmetic intensity. Signal: `rocprofv3_memory` bandwidth.	KV-cache dtype, kernel fusion, `gpu-memory-utilization`, batch size
Communication-bound	RCCL/NCCL collectives (AllReduce/AllGather) are a large share (multi-GPU). Signal: `rocprofv3_communication`, `rccl_trace`.	`NCCL_`/`RCCL_` tuning, tensor-parallel vs pipeline-parallel topology
Launch-bound	Many tiny kernels with timeline gaps / low GPU utilization. Signal: timeline gaps in `rocm_trace_lite`.	HIP/CUDA graphs, larger batch, fewer sync points
Kernel	% GPU time	Implication
`Cijk_Alik_Bljk_BBS_BH_Bias_HA…` (GEMM)	27.1%	Compute-bound — GEMM/wvSplitK dominate → target with TunableOp, hipBLASLt, higher concurrency
`wvSplitK`	20.6%
`_fwd_kernel` (attention)	8.3%	Memory-bound secondary; paged attention
`__amd_rocclr_copyBuffer`	7.0%	Data movement overhead
Metric	Baseline	Tuned	Delta
throughput_tot	~422 tok/s	7,993 tok/s	+18.9×
throughput_gen	~211 tok/s	3,996 tok/s	+18.9×
Name	Framework	Project	Workload
`pyt_vllm_deepseek-r1`	pyt (PyTorch)	vllm	deepseek-r1
`jax_maxtext_llama-4`	jax	maxtext	llama-4
`pyt_primus_qwen3-1.7b`	pyt	primus	qwen3-1.7b
`huggingface_bert`	huggingface	–	bert