Skip to content

Commit 8ac3342

Browse files
cjluo-nvclaude
andcommitted
AA-LCR: add per-task parallelism field; document per-task suite strategy
AA-LCR is both KV-bound (~120K input -> preemption thrash if over-parallelized) and judge-bound, so it needs a parallelism well below the model/GPU-bound tasks. - lcr.md: add `parallelism: ???` at the params level (like tau2_bench_telecom), with a Params note that it must be set LOWER than the top-level default and how to tune it (start ~16-32 for GQA, several x more for MLA; watch preemption + 429s). - parallelism.md: add a "Running a suite" section — parallelism is per-task, not per-run; top-level default for short model-bound tasks, per-task overrides for long-context (KV), judge/user-sim (rate limit), and sandbox (execution) bottlenecks; --max-num-seqs sized off the max across tasks; canary each class separately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
1 parent 386b498 commit 8ac3342

2 files changed

Lines changed: 43 additions & 0 deletions

File tree

.claude/skills/evaluation/recipes/tasks/aa/lcr.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,19 @@ AA-LCR needs long context: plan for roughly 120K input tokens plus 16K
1313
generation tokens. Set deployment `--max-model-len` to at least `131072`, and
1414
use a larger value when the model supports it.
1515

16+
**Parallelism — set this *lower* than the top-level default.** AA-LCR is the
17+
suite's most concurrency-sensitive task on two fronts at once. (1) *KV-bound:* each
18+
request carries ~120K input tokens, so its KV footprint is large and a high
19+
`parallelism` triggers preemption — and recomputing 120K-token prefills is hugely
20+
wasteful, so over-parallelizing here makes the run *slower*, not faster (see
21+
`references/parallelism.md`, "Balanced sizing"). (2) *Judge-bound:* the
22+
equality-checker endpoint rate-limits before your served model does. So give it an
23+
explicit per-task `parallelism` well below the model/GPU-bound tasks' value: start
24+
small (≈16–32 for GQA models; MLA models such as Kimi tolerate several× more) and
25+
raise only while preemption ≈ 0 and the judge shows no 429s. The field is left as
26+
`???`; after choosing a value, recompute the deployment's `--max-num-seqs` per
27+
SKILL.md Step 3 (sized off the *max* parallelism across all tasks).
28+
1629
## YAML Fragment
1730

1831
LCR has a deployment-side requirement (`--max-model-len 131072`) and a task
@@ -35,6 +48,7 @@ block. Per SKILL.md Step 3, the deployment flag must live inside
3548
use_response_logging: false
3649
config:
3750
params:
51+
parallelism: ??? # set LOWER than top-level: long-context (KV-bound) + judge-bound; see body above. Recompute --max-num-seqs after setting.
3852
extra:
3953
num_repeats: 16
4054
judge:

.claude/skills/evaluation/references/parallelism.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -272,6 +272,35 @@ AA-LCR (~120K input). **Never inherit a short task's `parallelism` for a long on
272272
the top-level and all per-task overrides (the deployment must support the busiest
273273
task), even though long-context tasks themselves run at a lower `parallelism`.
274274

275+
## Running a suite: `parallelism` is per-task, not per-run
276+
277+
A benchmark suite (e.g. AA) runs tasks with **different bottlenecks** against one
278+
deployment, so a single suite-wide `parallelism` is wrong. Set a top-level **default**
279+
for the model/GPU-bound short tasks, then **override the outliers**:
280+
281+
| Bottleneck | Example AA tasks | Set `parallelism` by |
282+
| --- | --- | --- |
283+
| Model / GPU KV (short in) | `gpqa_diamond_aa_v3`, `ns_ifbench` | top-level default (preemption-free KV-fit) |
284+
| **Long-context KV** (~120K in) | `ns_aa_lcr` | **LOW** per-task override — prefill thrash; model-dependent (MLA ≫ GQA) |
285+
| **Judge / user-sim rate limit** | `ns_hle_aa`, `ns_aa_lcr`, `tau2_bench_telecom` | the judge endpoint (429s), **not** your model |
286+
| **Sandbox execution** | `ns_scicode` | concurrent sandbox slots |
287+
288+
Rules:
289+
290+
- **Judge/sandbox-bound tasks bottleneck *before* the model** — over-parallelizing
291+
them yields 429s/retries (the timeout cascade), not speed. Cap to the endpoint and
292+
tune by *its* errors, independent of GPU KV.
293+
- **Long-context tasks (AA-LCR) are KV-bound** — give them an explicit **low**
294+
per-task `parallelism` (see Balanced sizing); never inherit the short-task default.
295+
- `--max-num-seqs` is sized off the **max** `parallelism` across all tasks (the
296+
deployment must serve the busiest one), even though the long-context / judge-bound
297+
tasks themselves run lower.
298+
- **Canary each bottleneck class separately** (model-only / judge-scored / sandbox)
299+
and tune that task by its own signal — preemption, judge 429s, or sandbox saturation.
300+
301+
Tasks whose `parallelism` is endpoint- or context-dependent ship with the field as
302+
`???` in their recipe (`ns_aa_lcr`, `tau2_bench_telecom`) so it's a conscious choice.
303+
275304
---
276305

277306
## Worked examples

0 commit comments

Comments
 (0)