Add parallelism sizing reference to evaluation skill#1583
Conversation
📝 WalkthroughWalkthroughAdds a new parallelism reference and updates SKILL.md plus an AA-LCR recipe to use it, clarifying GPU topology (TP/DP/PP/EP), concurrency ( ChangesvLLM Parallelism Documentation & Workflow Integration
🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
meenchen
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Pure documentation addition for the agent-facing evaluation skill. Adds a new references/parallelism.md (consistent in style and cross-referencing with sibling refs like multi-node.md and quantization-benchmarks.md) and wires SKILL.md to it from the relevant sections (deployment defaults, EP flag, evaluation params, Step 4, canary). Content is internally consistent (EP=TP×DP, max-num-seqs=ceil(parallelism/(DP×num_instances)), agrees with the existing multi-node.md definitions of num_instances vs data_parallel_size). No code/runtime behavior change, no licensing concerns, tests N/A.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1583 +/- ##
=======================================
Coverage 73.22% 73.22%
=======================================
Files 478 478
Lines 52421 52421
=======================================
Hits 38387 38387
Misses 14034 14034
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| @@ -0,0 +1,273 @@ | |||
| # Parallelism: GPU topology (TP / DP / PP / EP) and concurrency (`parallelism` / `--max-num-seqs`) | |||
There was a problem hiding this comment.
Do you think the model already knows about this background info from pre-training? Is it useful to repeat it here?
There was a problem hiding this comment.
I think it's better to be explicit here
There was a problem hiding this comment.
Good point — trimmed. I compressed the doc ~52% (345→165 lines, commit e4368c1): cut the generic TP/DP/PP/EP background the model already knows and merged the overlapping sections, keeping only the non-obvious decision rules, formulas, gotchas (e.g. bit-width sets the topology), and the worked examples. Happy to cut further if any of what's left still reads as restated background.
| | `total_requests ≤ serving_capacity` (small run) | `total_requests` (round up a little for uneven DP routing) | All requests dispatch at once → one wave → finishes in ~one generation-time. Higher is wasted. | | ||
| | `total_requests ≫ serving_capacity` (large run) | `serving_capacity` (largest the GPUs sustain) | Throughput-bound: keep every decode slot full until the queue drains. Request count no longer matters. | | ||
|
|
||
| So "set it higher" is right **only up to the request count**; past that you just |
There was a problem hiding this comment.
I think we mention it somewhere else, but maximizing the throughput isn't necessarily the correct goal because we can overwhelm dependent services in the evaluation, like user simulator, code execution envs, etc.
There was a problem hiding this comment.
for tau2, we will apply a specific cap in the tau2 yaml description.
There was a problem hiding this comment.
actually I think you are right. I removed this from the latest commit.
There was a problem hiding this comment.
Agreed, and addressed directly. The doc no longer treats throughput-max as the goal:
- "Balanced sizing — bigger is NOT always faster" (commit 386b498) spells out that over-parallelizing regresses — preemption thrash, prefill/decode contention, and the latency→timeout→retry cascade — and says to err low for long context.
- "Suites — set parallelism per task" (commit 8ac3342) calls out exactly your point: judge / user-simulator / sandbox tasks bottleneck before the served model, so cap them to the dependent endpoint (429s), not the GPU. AA-LCR and tau2 ship
parallelism: ???to force a conscious, lower value.
| > **Gotcha — bit-width sets the topology, not the model name.** The weight estimate | ||
| > hinges on `bytes_per_param`, so **read the actual precision from `config.json` | ||
| > (`quantization_config` / `quant_algo` / dtype) before sizing** — do not infer it | ||
| > from the org/handle. Two checkpoints of the *same architecture at the same | ||
| > bit-width* have the same footprint → the same TP/DP/EP, regardless of vendor or | ||
| > quant scheme (INT4 vs NVFP4 differ only in kernel/quant-method flags, which vLLM | ||
| > auto-detects — and a negligible effective-bit difference). The split only changes | ||
| > when the bit-width changes the *size* (see the Kimi example below). |
There was a problem hiding this comment.
We sometimes quantize kv cache using the arguments in vllm serve. Could you also add it to the note?
There was a problem hiding this comment.
Good point — added. The "Factors that relax the KV limit" note now covers serve-time KV-cache quantization via vLLM's --kv-cache-dtype fp8 (fp8_e4m3 / fp8_e5m2) in deployment.command, in addition to the checkpoint's kv_cache_scheme — noting it ~halves KV vs bf16/fp16 (~2× sustainable concurrency/context) and to verify model/recipe support. Done in 0839159.
### What does this PR do?
Type of change: documentation
Fixes two issues in the evaluation skill that caused completed eval runs
to silently **not** upload to MLflow (found while running a real GPQA
eval — the run finished successfully but never appeared in MLflow):
1. **Missing auto-export trigger.** NEL only auto-exports when
`execution.auto_export.destinations` is set; the `export.mlflow` block
alone just *configures* the exporter. The skill's shortcut said to "copy
the `export.mlflow` block" without calling out the trigger, so a config
could end up with the block but no upload.
2. **Interpolated export values break at submit time.** With auto-export
enabled, NEL resolves the `export.mlflow` block at submit time in a
scope that does **not** include `deployment` / `evaluation`, so
`${deployment.served_model_name}` / `${evaluation.*}` cross-references
fail hard with `Interpolation key '...' not found`. (`example_eval.yaml`
shipped this interpolated pattern *with* auto-export, so it would hit
the error.)
Changes:
- **`example_eval.yaml`**: mark the `auto_export.destinations` trigger
as REQUIRED; switch the `export.mlflow` block to **literal** values
(`CHANGEME-served-model-name` placeholders) with a comment explaining
why cross-refs break. `${oc.env:USER}` (env interpolation) is retained
since it resolves fine.
- **`SKILL.md`**: rewrite the MLflow shortcut step to require **both**
the trigger and literal export values, and warn against
`${deployment.*}` / `${evaluation.*}` cross-references.
### Usage
N/A — documentation/skill-template only.
### Testing
`pre-commit run` passes on both files (markdownlint + YAML format).
Verified the literal-export approach uploads correctly by exporting a
real run to MLflow.
### Before your PR is "*Ready for review*"
- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: N/A (documentation)
- Did you update Changelog?: N/A (skill docs)
- Did you get Claude approval on this PR?: ❌ (pending)
### Additional Information
Separate from #1583 (parallelism reference); both touch the evaluation
skill but are independent.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Documentation**
* Clarified MLflow auto-export: auto-export must be explicitly enabled
for uploads and the MLflow export block is ignored otherwise.
* Example config now uses literal experiment name, description, and tags
(no cross-scope interpolation) and shows concrete sampling values.
* Warned to keep sampling params consistent between tags and runtime
params and to set the tracking URI.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add references/parallelism.md documenting how to choose deployment topology (TP / DP / PP, expert parallelism) and concurrency (parallelism / --max-num-seqs) for NEL evaluation runs: - GPU topology decision procedure: smallest-TP-then-max-DP, the TP-up triggers, and the TP/DP split tradeoff for a fixed world size. - Expert parallelism: EP = TP x DP (boolean flag, no direct EP size), the DP-attention + EP-MoE dataflow, and when to enable vs not. - Concurrency sizing: request-count vs serving-capacity ceiling, KV-driven --max-num-seqs, empirical tuning from vLLM logs. - Gotcha: bit-width (read from config.json), not the model name, sets the topology; worked examples incl. FP8-vs-4-bit Kimi. Wire SKILL.md to the new reference from the deployment-command, expert-parallel, evaluation-params, Step 4, and canary sections. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
…ster) Higher parallelism is not universally better — past the KV-fit point it regresses, worst for long-context / long-output tasks. Document the three mechanisms (preemption thrash on huge prefills, prefill/decode contention, latency->timeout->retry cascade), that sustainable concurrency is context-dependent, and a balanced rule: target ~70-80% of the preemption-free KV-fit concurrency at the task's working context, give long-context tasks a lower per-task parallelism override, tune by canary while throughput rises / preemption ~0 / p99 within request_timeout, and err low when unsure. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
AA-LCR is both KV-bound (~120K input -> preemption thrash if over-parallelized) and judge-bound, so it needs a parallelism well below the model/GPU-bound tasks. - lcr.md: add `parallelism: ???` at the params level (like tau2_bench_telecom), with a Params note that it must be set LOWER than the top-level default and how to tune it (start ~16-32 for GQA, several x more for MLA; watch preemption + 429s). - parallelism.md: add a "Running a suite" section — parallelism is per-task, not per-run; top-level default for short model-bound tasks, per-task overrides for long-context (KV), judge/user-sim (rate limit), and sandbox (execution) bottlenecks; --max-num-seqs sized off the max across tasks; canary each class separately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
7a1e110 to
8ac3342
Compare
Address PR review: KV-cache quantization can also be enabled at serve time via vLLM's --kv-cache-dtype fp8 (fp8_e4m3 / fp8_e5m2), not only baked into the checkpoint's kv_cache_scheme. It ~halves KV vs bf16/fp16 (~2x sustainable concurrency/context) and is a knob in deployment.command for KV-bound runs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Cut generic TP/DP/PP/EP background the model already knows (per PR review) and merged the overlapping binding-constraint / balanced-sizing / non-GPU-caps / suite sections. All formulas, the bit-width gotcha, TP-up triggers, the layout table, serve-time KV-cache quantization, balanced-sizing regression, the per-task suite table, and worked examples are preserved. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.claude/skills/evaluation/references/parallelism.md:
- Around line 147-148: The current documentation formula for `--max-num-seqs` is
missing `num_instances`; update the formula so it divides the busiest-task
parallelism by both DP and num_instances (e.g., `--max-num-seqs = ceil(max
parallelism across tasks / (DP * num_instances))`) so each replica is sized
correctly in HAProxy/multi-instance deployments; change the line referencing
`--max-num-seqs` and ensure any explanatory text mentions `num_instances` to
avoid oversizing per replica and extra KV pressure/preemption.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 360d7b8e-c1b0-496b-8bc3-a60e67362e81
📒 Files selected for processing (1)
.claude/skills/evaluation/references/parallelism.md
| - `--max-num-seqs = ceil(max parallelism across tasks / DP)` (deployment must serve the | ||
| busiest task) even if long-context tasks run lower. |
There was a problem hiding this comment.
Fix --max-num-seqs suite formula to include num_instances.
This line conflicts with the capacity definition above (Lines 86–91). For HAProxy/multi-instance setups, omitting num_instances will oversize --max-num-seqs per replica and can induce avoidable KV pressure/preemption.
Suggested doc fix
-- `--max-num-seqs = ceil(max parallelism across tasks / DP)` (deployment must serve the
+- `--max-num-seqs = ceil(max parallelism across tasks / (DP × num_instances))` (deployment must serve the
busiest task) even if long-context tasks run lower.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - `--max-num-seqs = ceil(max parallelism across tasks / DP)` (deployment must serve the | |
| busiest task) even if long-context tasks run lower. | |
| - `--max-num-seqs = ceil(max parallelism across tasks / (DP × num_instances))` (deployment must serve the | |
| busiest task) even if long-context tasks run lower. |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.claude/skills/evaluation/references/parallelism.md around lines 147 - 148,
The current documentation formula for `--max-num-seqs` is missing
`num_instances`; update the formula so it divides the busiest-task parallelism
by both DP and num_instances (e.g., `--max-num-seqs = ceil(max parallelism
across tasks / (DP * num_instances))`) so each replica is sized correctly in
HAProxy/multi-instance deployments; change the line referencing `--max-num-seqs`
and ensure any explanatory text mentions `num_instances` to avoid oversizing per
replica and extra KV pressure/preemption.
|
What does this PR do?
Type of change: documentation
Adds
.claude/skills/evaluation/references/parallelism.md, a reference for sizing both the GPU topology and the request concurrency of NEL evaluation runs, and wires the mainSKILL.mdto it. Pure docs for the agent-facing evaluation skill — no code or runtime behavior changes.The new reference covers:
TP×DP=8factorizations on an 8-GPU node).--enable-expert-parallelis a boolean andEP = TP × DP(no direct EP-size flag); the DP-attention + EP-MoE dataflow (per-layer dispatch/combine all-to-all); when to enable vs not.parallelism/--max-num-seqs): the request-count-vs-serving-capacity ceiling, KV-driven--max-num-seqs, and empirical tuning from vLLM startup logs + preemption.config.json(not the model name) sets the topology; includes the FP8-vs-4-bit Kimi example.SKILL.mdgains pointers to the reference from the deployment-command, expert-parallel, evaluation-params, Step 4, and canary sections.Usage
N/A — documentation only (agent skill guidance).
Testing
pre-commit runpasses on both files (markdownlint included).Before your PR is "Ready for review"
CONTRIBUTING.md: N/AAdditional Information
Commits are signed (
git commit -s -S).🤖 Generated with Claude Code
Summary by CodeRabbit