Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 7 additions & 4 deletions .claude/skills/evaluation/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,8 @@ deployment:

Conventions: always start `vllm serve /checkpoint` (NEL mounts here); always `--host 0.0.0.0 --port ${deployment.port}`; use folded scalar (`>-`) for one flag per line. Example fallback `--max-model-len 131072` covers AA-LCR (~120K + 16K gen) and SciCode (≥ 65536) — prefer `config.json` / recipe value.

For how to choose `--tensor-parallel-size` / `--data-parallel-size` / `--pipeline-parallel-size` (and EP) from the model size and your GPU count, read `references/parallelism.md` — cross-check the layout against `recipes.vllm.ai`, then adapt to the GPUs you actually have via the fit math there.

**Image / vLLM version.** Default `image: vllm/vllm-openai:v0.19.1` (pinned for reproducibility). If `recipes.vllm.ai` states a higher minimum version for the chosen variant (e.g. "vLLM >= 0.20.0"), bump the image tag accordingly (e.g. `v0.20.0`) — do **not** stay on `0.19.1` when the recipe explicitly requires newer. Do **not** use `:latest` (drifts across re-runs, breaks reproducibility). The version is part of the cross-check: surface to the user when bumping.

#### vLLM-backend defaults — always include unless the recipe *contradicts*
Expand All @@ -146,15 +148,15 @@ Silence is not contradiction. Drop/override only when the recipe sets a differen

- `--max-num-batched-tokens 8192` — caps per-step batched tokens; prevents long-prefill stalls.
- `--enable-chunked-prefill` — interleaves long prefills with decode steps (required for AA-LCR's ~120K input). Modern vLLM defaults this on for many models; set explicitly to avoid drift.
- `--enable-expert-parallel` — **MoE-only default.** Detect MoE from handle suffix (`-A10B`, `-A3B`, etc.), `num_experts` / `num_local_experts` / `n_routed_experts` in `config.json`, or card. No-op when TP=DP=1, safe to always include for MoE. Do not add for dense models.
- `--enable-expert-parallel` — **MoE-only default.** Detect MoE from handle suffix (`-A10B`, `-A3B`, etc.), `num_experts` / `num_local_experts` / `n_routed_experts` in `config.json`, or card. No-op when TP=DP=1, safe to always include for MoE. Do not add for dense models. See `references/parallelism.md` for what EP does and the DP-attention + EP-MoE throughput pattern.
- `--max-num-seqs N` — **omit at generation time** (top-level `parallelism` is `???`). Add this comment above `command:`:

```text
# After filling in `parallelism` values (top-level + per-task overrides),
# append `--max-num-seqs N` where N = ceil(max_parallelism / data_parallel_size).
```

In Step 4 compute and append. Example: top-level=16, Tau2=128, DP=8 → `ceil(128/8)=16`. Too small → request queuing; too large → wasted KV reservation.
In Step 4 compute and append. Example: top-level=16, Tau2=128, DP=8 → `ceil(128/8)=16`. Too small → request queuing; too large → wasted KV reservation. For how to choose the `parallelism` it derives from, read `references/parallelism.md`.

#### Evaluation params template (top-level params)

Expand All @@ -164,7 +166,7 @@ The top-level `nemo_evaluator_config.config.params` must contain **exactly these
nemo_evaluator_config:
config:
params:
parallelism: ??? # Required — ask user in Step 4 (depends on cluster + judge rate limits)
parallelism: ??? # Required — size per references/parallelism.md (bounded by total request count vs GPU serving capacity); ask user in Step 4 if still unclear
request_timeout: 3600
max_retries: 10
max_new_tokens: 65536 # see rule below
Expand Down Expand Up @@ -192,6 +194,7 @@ Reasoning models: prefer reasoning mode (highest scores). For lower variance / c
### Step 4 — Fill remaining ??? values

- Find every `???` left. Ask the user only for what can't be inferred (SLURM hostname/account/output_dir, MLflow tracking URI, etc.). Don't propose defaults; let them give plain text.
- **`parallelism`** — size it yourself from the run shape (total requests = `dataset_size × repeats` vs GPU serving capacity), and set `--max-num-seqs` to match. Read `references/parallelism.md` for the decision rule and worked examples; only ask the user if a non-GPU cap (e.g. judge rate limit) is unknown.
- Ask about other defaults they may want to change (partition, walltime, MLflow tags).

**Walltime cap: 4 hours.** Always `execution.walltime: "04:00:00"`. The cluster does not schedule jobs longer than 4h — this is a hard limit, not a preference.
Expand Down Expand Up @@ -305,7 +308,7 @@ nel info <id> --logs
ssh <user>@<host> "grep -i 'traceback\|exception\|error\|failed\|oom\|killed\|timeout\|unauthorized\|rate limit\|sandbox\|container\|judge\|parse\|scoring' <log_path>/*.log"
```

Canary each risky task class separately (judge-scored, code-execution, model-only). Start `parallelism` conservatively; raise only after judge/sandbox logs are clean — they bottleneck before the model.
Canary each risky task class separately (judge-scored, code-execution, model-only). Start `parallelism` conservatively; raise only after judge/sandbox logs are clean — they bottleneck before the model. For capacity-bound runs, tune `parallelism`/`--max-num-seqs` here against vLLM's reported max concurrency + preemption — see `references/parallelism.md`.

Single-task rerun: `nel run --config <path> -t <task_name>` (combine with `-o ++...limit_samples=10` for canary).

Expand Down
14 changes: 14 additions & 0 deletions .claude/skills/evaluation/recipes/tasks/aa/lcr.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,19 @@ AA-LCR needs long context: plan for roughly 120K input tokens plus 16K
generation tokens. Set deployment `--max-model-len` to at least `131072`, and
use a larger value when the model supports it.

**Parallelism — set this *lower* than the top-level default.** AA-LCR is the
suite's most concurrency-sensitive task on two fronts at once. (1) *KV-bound:* each
request carries ~120K input tokens, so its KV footprint is large and a high
`parallelism` triggers preemption — and recomputing 120K-token prefills is hugely
wasteful, so over-parallelizing here makes the run *slower*, not faster (see
`references/parallelism.md`, "Balanced sizing"). (2) *Judge-bound:* the
equality-checker endpoint rate-limits before your served model does. So give it an
explicit per-task `parallelism` well below the model/GPU-bound tasks' value: start
small (≈16–32 for GQA models; MLA models such as Kimi tolerate several× more) and
raise only while preemption ≈ 0 and the judge shows no 429s. The field is left as
`???`; after choosing a value, recompute the deployment's `--max-num-seqs` per
SKILL.md Step 3 (sized off the *max* parallelism across all tasks).

## YAML Fragment

LCR has a deployment-side requirement (`--max-model-len 131072`) and a task
Expand All @@ -35,6 +48,7 @@ block. Per SKILL.md Step 3, the deployment flag must live inside
use_response_logging: false
config:
params:
parallelism: ??? # set LOWER than top-level: long-context (KV-bound) + judge-bound; see body above. Recompute --max-num-seqs after setting.
extra:
num_repeats: 16
judge:
Expand Down
165 changes: 165 additions & 0 deletions .claude/skills/evaluation/references/parallelism.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
# Parallelism: topology (TP/DP/PP/EP) + concurrency (`parallelism` / `--max-num-seqs`)

Two decisions, in order — both affect **throughput only, never scores**:

1. **Topology** — how the model is laid out across GPUs (sets the replica count).
2. **Concurrency** — requests in flight (`parallelism`) and per replica
(`--max-num-seqs`), sized on top of the topology.

## Layer 1 — topology (TP / DP / PP)

- **TP** shards each layer (weights+KV) within one replica → fits a too-big model /
splits KV for long context; costs an all-reduce **every layer** (keep intra-node).
- **DP** replicates the model → N independent replicas = N× concurrency; N× weight memory.
- **PP** shards layer ranges → very large / multi-node; pipeline bubbles. See `multi-node.md`.

**Decide (single node, G GPUs):**

1. **TP = smallest that fits** with KV headroom. Weights ≈ `params × bytes/param`
(NVFP4 ≈0.5–0.6, FP8 ≈1, BF16 ≈2); need
`weights/TP + KV + activations + overhead < GPU_mem × util`. Fits on one GPU → TP=1.
TP must divide `num_attention_heads` (ideally `num_key_value_heads`), be a power of
2, and never cross nodes.
2. **DP = floor(G / (TP×PP))** — maximize for throughput (a 1-GPU-fit model runs
`TP=1,DP=G`, not `TP=G,DP=1`).
3. **PP** only if it won't fit at max intra-node TP, or multi-node.

> **Gotcha — bit-width sets the topology, not the model name.** Read precision from
> `config.json` (`quantization_config`/`quant_algo`/dtype); don't infer from the
> handle. Same arch + same bit-width → same TP/DP/EP regardless of vendor (INT4 vs
> NVFP4 differ only in auto-detected kernel flags). The split changes only when
> bit-width changes *size*.

**Choosing the TP/DP split** (e.g. on 8 GPUs: `1/8`, `2/4`, `4/2`, `8/1`, all EP=8):
default **smallest TP, largest DP** — DP scales throughput ~linearly with no extra
comm; TP adds an all-reduce per attention layer. Raise TP **only** to relieve memory
DP can't:

1. a single request's KV won't fit one replica's HBM (long context — AA-LCR ~120K / 262K);
2. preemption at your target per-replica `max-num-seqs` (TP=2 doubles per-replica KV);
3. weights don't fit one GPU even after EP-sharding.

Else higher TP wastes KV and gives up replicas. **Verify:** vLLM startup
`Maximum concurrency for <max-model-len> tokens` ≳ `parallelism/DP` with no canary
preemption → smaller TP wins.

## Layer 1b — Expert parallelism (EP), MoE only

`--enable-expert-parallel` is a **boolean** (no `--expert-parallel-size`); experts
are partitioned across the whole world size:

```text
EP = tensor_parallel_size × data_parallel_size (EP = TP only when DP=1)
```

So on a fixed node you don't tune EP — you tune the TP/DP split, which only changes
the *attention* side:

| Layout (8 GPUs, all EP=8) | Attention | Best when |
| --- | --- | --- |
| `TP=1 DP=8` | 8 replicas, comm-free | **default** — one request's KV fits 1 GPU |
| `TP=2 DP=4` | 4 replicas | need ~2× per-replica KV (long ctx) |
| `TP=4 DP=2` | 2 replicas | ~4× per-replica KV, or weights too big for TP≤2 |
| `TP=8 DP=1` | 1 replica | trillion-scale weights / one huge KV pool |

Down the table = more per-replica KV/weight room, fewer replicas, higher all-reduce
cost; pick the **topmost row that fits**.

**Dataflow (DP-attention + EP-MoE):** the DP and EP groups are the **same GPUs**.
Attention is DP-local (no cross-rank comm); each MoE layer does a dispatch+combine
**all-to-all** to route tokens to the rank owning their expert. So comm is all-to-all
*only at MoE layers* (vs TP's per-layer all-reduce) — keep it **intra-node (NVLink)**.
Data-dependent routing → uneven load; vLLM runs dummy passes on idle ranks, so spread
load evenly.

**Enable for any MoE** (detect via `-A10B`/`-A3B`/`-A22B` handle, `num_experts` /
`n_routed_experts` in `config.json`); **not for dense**; no-op at `TP=DP=1`.
Cross-check `recipes.vllm.ai` for the validated layout, then adapt to your GPU count
via the fit math.

## Layer 2 — concurrency (`parallelism` / `--max-num-seqs`)

- **`parallelism`** = requests the client keeps in flight *per benchmark*.
- **`--max-num-seqs`** = sequences one replica decodes at once.

```text
serving_capacity = max-num-seqs × DP × num_instances
max-num-seqs = ceil(parallelism / (DP × num_instances)) # keep matched
```

(TP/PP don't add capacity; replicas = DP, × `num_instances` for HAProxy — see
`multi-node.md`.) `parallelism` above capacity just queues in vLLM (and risks
`request_timeout`).

**`parallelism` ceiling = the smaller of:**

1. **total requests** = `dataset_size × repeats` (`n_samples` for simple-evals/tau2,
`num_repeats` for nemo-skills) — can't have more in flight than exist;
2. **preemption-free capacity at the task's context** (KV-bound; below).

| Run | Set `parallelism` to |
| --- | --- |
| `total_requests ≤ capacity` (small) | `total_requests` (round up for uneven DP routing) → one wave |
| `total_requests ≫ capacity` (large) | the **preemption-free** capacity at the task's context (often *below* nominal) |

**Sizing `--max-num-seqs` vs KV** — capped by `context × concurrent seqs`; high
`max_new_tokens` shrinks the batch. Read vLLM startup `# GPU blocks` /
`Maximum concurrency for <max-model-len> tokens` (full-length floor — you fit more at
shorter context). Canary: `Preempted N` → lower; KV usage ≪100% with no preemption →
raise. **Relaxed by:** low-precision weights; **KV-cache quantization** — checkpoint
`kv_cache_scheme` **or serve-time `--kv-cache-dtype fp8`** (`fp8_e4m3`/`fp8_e5m2`) in
`deployment.command`, ~halving KV → ~2× concurrency/context (verify support; small
accuracy effect); and **hybrid/linear-attention** (near-constant KV).

## Balanced sizing — bigger is NOT always faster (esp. long context)

Past the KV-fit point throughput doesn't just plateau, it **regresses** — worst for
long-context / long-output:

1. **Preemption thrash** — over-admitted seqs get preempted; recomputing a ~120K
prefill is huge wasted work, so a modest preemption-free concurrency finishes *sooner*.
2. **Prefill/decode contention** — many long prefills split `--max-num-batched-tokens`
and starve decode.
3. **Timeout cascade** — too many in-flight → p99 > `request_timeout` → `max_retries`
resubmissions pile on more load.

Sustainable concurrency is **context-dependent** — a `parallelism` good for GPQA
(short) thrashes AA-LCR (~120K). **Rule:** target ~**70–80% of the preemption-free
KV-fit concurrency at the task's working context × DP**; give long-context/long-output
tasks a **lower per-task override**; canary-tune up only while throughput↑,
preemption≈0, p99 < `request_timeout`; **err low** for long context (too-small mildly
underutilizes; too-large is *multiples* slower).

## Suites — set `parallelism` per task, not per run

Suite tasks hit **different bottlenecks** against one deployment; use a top-level
default for short model-bound tasks and override the outliers:

| Bottleneck | AA tasks | Cap by |
| --- | --- | --- |
| Model / GPU KV (short) | `gpqa_diamond_aa_v3`, `ns_ifbench` | top-level default (preemption-free KV-fit) |
| Long-context KV (~120K) | `ns_aa_lcr` | **low** override — prefill thrash; MLA ≫ GQA |
| Judge / user-sim rate limit | `ns_hle_aa`, `ns_aa_lcr`, `tau2_bench_telecom` | judge endpoint 429s, **not** the model |
| Sandbox execution | `ns_scicode` | sandbox slots |

- Judge/sandbox tasks bottleneck **before** the model — over-parallelizing yields
429s/retries, not speed; cap to the endpoint, tune by *its* errors.
- `--max-num-seqs = ceil(max parallelism across tasks / DP)` (deployment must serve the
busiest task) even if long-context tasks run lower.
Comment on lines +147 to +148

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix --max-num-seqs suite formula to include num_instances.

This line conflicts with the capacity definition above (Lines 86–91). For HAProxy/multi-instance setups, omitting num_instances will oversize --max-num-seqs per replica and can induce avoidable KV pressure/preemption.

Suggested doc fix
-- `--max-num-seqs = ceil(max parallelism across tasks / DP)` (deployment must serve the
+- `--max-num-seqs = ceil(max parallelism across tasks / (DP × num_instances))` (deployment must serve the
   busiest task) even if long-context tasks run lower.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- `--max-num-seqs = ceil(max parallelism across tasks / DP)` (deployment must serve the
busiest task) even if long-context tasks run lower.
- `--max-num-seqs = ceil(max parallelism across tasks / (DP × num_instances))` (deployment must serve the
busiest task) even if long-context tasks run lower.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/evaluation/references/parallelism.md around lines 147 - 148,
The current documentation formula for `--max-num-seqs` is missing
`num_instances`; update the formula so it divides the busiest-task parallelism
by both DP and num_instances (e.g., `--max-num-seqs = ceil(max parallelism
across tasks / (DP * num_instances))`) so each replica is sized correctly in
HAProxy/multi-instance deployments; change the line referencing `--max-num-seqs`
and ensure any explanatory text mentions `num_instances` to avoid oversizing per
replica and extra KV pressure/preemption.

- Canary each class (model / judge / sandbox) separately. Endpoint/context-dependent
tasks (`ns_aa_lcr`, `tau2_bench_telecom`) ship `parallelism: ???` to force a choice.

## Worked examples (8×B200)

- **Dense 9B NVFP4** (~5–6 GB) → **TP=1/DP=8, no EP**. GPQA `n_samples=1` = 198 reqs
(request-bound) → `parallelism=256`, `max-num-seqs=32`. `n_samples=8` = 1584
(capacity-bound) → start 512; tune up only while preemption≈0 (~82K reasoning output
→ knee may be <1024).
- **Dense ~70B BF16, 8×H100/80GB** (~140 GB) → won't fit 1 GPU → **TP=2/DP=4, no EP**.
- **Large MoE ~235B-A22B** → EP on; layout `DP=8 + EP` (or `TP=8 + EP` if one replica
needs the full node for KV).
- **Trillion-scale MoE (Kimi-class ~1T, MLA) — bit-width flips the split:** FP8
(~1040 GB) is weight-bound → forced **TP=8/DP=1/EP**; 4-bit INT4/NVFP4 (~520–572 GB)
frees room → **TP=1/DP=8/EP**. INT4 ≈ NVFP4 → same layout (don't let `moonshotai/…`
vs `nvidia/…-NVFP4` mislead) — same reason a 4-bit Kimi needing TP=8 on 8×H200/640GB
switches to TP=1/DP=8 on 8×B200.
Loading