AA-LCR: add per-task parallelism field; document per-task suite strategy

cjluo-nv · claude · cjluo-nv · commit 8ac3342f9dff · 2026-06-01T13:31:37.000-07:00
AA-LCR is both KV-bound (~120K input -&gt; preemption thrash if over-parallelized)
and judge-bound, so it needs a parallelism well below the model/GPU-bound tasks.

- lcr.md: add `parallelism: ???` at the params level (like tau2_bench_telecom),
  with a Params note that it must be set LOWER than the top-level default and how
  to tune it (start ~16-32 for GQA, several x more for MLA; watch preemption + 429s).
- parallelism.md: add a "Running a suite" section — parallelism is per-task, not
  per-run; top-level default for short model-bound tasks, per-task overrides for
  long-context (KV), judge/user-sim (rate limit), and sandbox (execution) bottlenecks;
  --max-num-seqs sized off the max across tasks; canary each class separately.

Co-Authored-By: Claude Opus 4.8 (1M context) &lt;noreply@anthropic.com&gt;
Signed-off-by: Chenjie Luo &lt;chenjiel@nvidia.com&gt;
diff --git a/.claude/skills/evaluation/recipes/tasks/aa/lcr.md b/.claude/skills/evaluation/recipes/tasks/aa/lcr.md
@@ -13,6 +13,19 @@ AA-LCR needs long context: plan for roughly 120K input tokens plus 16K
 generation tokens. Set deployment `--max-model-len` to at least `131072`, and
 use a larger value when the model supports it.
 
+**Parallelism — set this *lower* than the top-level default.** AA-LCR is the
+suite's most concurrency-sensitive task on two fronts at once. (1) *KV-bound:* each
+request carries ~120K input tokens, so its KV footprint is large and a high
+`parallelism` triggers preemption — and recomputing 120K-token prefills is hugely
+wasteful, so over-parallelizing here makes the run *slower*, not faster (see
+`references/parallelism.md`, "Balanced sizing"). (2) *Judge-bound:* the
+equality-checker endpoint rate-limits before your served model does. So give it an
+explicit per-task `parallelism` well below the model/GPU-bound tasks' value: start
+small (≈16–32 for GQA models; MLA models such as Kimi tolerate several× more) and
+raise only while preemption ≈ 0 and the judge shows no 429s. The field is left as
+`???`; after choosing a value, recompute the deployment's `--max-num-seqs` per
+SKILL.md Step 3 (sized off the *max* parallelism across all tasks).
+
 ## YAML Fragment
 
 LCR has a deployment-side requirement (`--max-model-len 131072`) and a task
@@ -35,6 +48,7 @@ block. Per SKILL.md Step 3, the deployment flag must live inside
           use_response_logging: false
     config:
       params:
+        parallelism: ???   # set LOWER than top-level: long-context (KV-bound) + judge-bound; see body above. Recompute --max-num-seqs after setting.
         extra:
           num_repeats: 16
           judge:
diff --git a/.claude/skills/evaluation/references/parallelism.md b/.claude/skills/evaluation/references/parallelism.md
@@ -272,6 +272,35 @@ AA-LCR (~120K input). **Never inherit a short task's `parallelism` for a long on
   the top-level and all per-task overrides (the deployment must support the busiest
   task), even though long-context tasks themselves run at a lower `parallelism`.
 
+## Running a suite: `parallelism` is per-task, not per-run
+
+A benchmark suite (e.g. AA) runs tasks with **different bottlenecks** against one
+deployment, so a single suite-wide `parallelism` is wrong. Set a top-level **default**
+for the model/GPU-bound short tasks, then **override the outliers**:
+
+| Bottleneck | Example AA tasks | Set `parallelism` by |
+| --- | --- | --- |
+| Model / GPU KV (short in) | `gpqa_diamond_aa_v3`, `ns_ifbench` | top-level default (preemption-free KV-fit) |
+| **Long-context KV** (~120K in) | `ns_aa_lcr` | **LOW** per-task override — prefill thrash; model-dependent (MLA ≫ GQA) |
+| **Judge / user-sim rate limit** | `ns_hle_aa`, `ns_aa_lcr`, `tau2_bench_telecom` | the judge endpoint (429s), **not** your model |
+| **Sandbox execution** | `ns_scicode` | concurrent sandbox slots |
+
+Rules:
+
+- **Judge/sandbox-bound tasks bottleneck *before* the model** — over-parallelizing
+  them yields 429s/retries (the timeout cascade), not speed. Cap to the endpoint and
+  tune by *its* errors, independent of GPU KV.
+- **Long-context tasks (AA-LCR) are KV-bound** — give them an explicit **low**
+  per-task `parallelism` (see Balanced sizing); never inherit the short-task default.
+- `--max-num-seqs` is sized off the **max** `parallelism` across all tasks (the
+  deployment must serve the busiest one), even though the long-context / judge-bound
+  tasks themselves run lower.
+- **Canary each bottleneck class separately** (model-only / judge-scored / sandbox)
+  and tune that task by its own signal — preemption, judge 429s, or sandbox saturation.
+
+Tasks whose `parallelism` is endpoint- or context-dependent ship with the field as
+`???` in their recipe (`ns_aa_lcr`, `tau2_bench_telecom`) so it's a conscious choice.
+
 ---
 
 ## Worked examples