Add parallelism sizing reference to evaluation skill by cjluo-nv · Pull Request #1583 · NVIDIA/Model-Optimizer

cjluo-nv · 2026-06-01T18:28:31Z

What does this PR do?

Type of change: documentation

Adds .claude/skills/evaluation/references/parallelism.md, a reference for sizing both the GPU topology and the request concurrency of NEL evaluation runs, and wires the main SKILL.md to it. Pure docs for the agent-facing evaluation skill — no code or runtime behavior changes.

The new reference covers:

GPU topology (TP / DP / PP): decision procedure (smallest-TP-then-max-DP), the TP-up triggers, and the TP/DP-split tradeoff for a fixed world size (e.g. all TP×DP=8 factorizations on an 8-GPU node).
Expert parallelism (EP): --enable-expert-parallel is a boolean and EP = TP × DP (no direct EP-size flag); the DP-attention + EP-MoE dataflow (per-layer dispatch/combine all-to-all); when to enable vs not.
Concurrency (parallelism / --max-num-seqs): the request-count-vs-serving-capacity ceiling, KV-driven --max-num-seqs, and empirical tuning from vLLM startup logs + preemption.
Gotcha + worked examples: bit-width read from config.json (not the model name) sets the topology; includes the FP8-vs-4-bit Kimi example.

SKILL.md gains pointers to the reference from the deployment-command, expert-parallel, evaluation-params, Step 4, and canary sections.

Usage

N/A — documentation only (agent skill guidance).

Testing

pre-commit run passes on both files (markdownlint included).

Before your PR is "Ready for review"

Is this change backward compatible?: ✅
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
Did you write any new necessary tests?: N/A (documentation)
Did you update Changelog?: N/A (skill docs, not a library feature)
Did you get Claude approval on this PR?: ❌ (pending)

Additional Information

Commits are signed (git commit -s -S).

🤖 Generated with Claude Code

Summary by CodeRabbit

Documentation
- Expanded deployment guidance with clearer GPU topology, MoE/EP semantics, and bit‑width caveats.
- Refined concurrency guidance: explicit rules for sizing top-level parallelism and computing max concurrent sequences, plus canary-run tuning to watch preemption and KV utilization.
- Added task-level guidance for long‑context/judge‑bound jobs with recomputation advice and worked examples.

coderabbitai · 2026-06-01T18:28:44Z

📝 Walkthrough

Walkthrough

Adds a new parallelism reference and updates SKILL.md plus an AA-LCR recipe to use it, clarifying GPU topology (TP/DP/PP/EP), concurrency (parallelism and --max-num-seqs = ceil(max_parallelism / data_parallel_size)), and canary tuning with vLLM concurrency/preemption signals.

Changes

vLLM Parallelism Documentation & Workflow Integration

Layer / File(s)	Summary
GPU topology decision framework `.claude/skills/evaluation/references/parallelism.md`	Introduced the two-layer decision model and TP/DP/PP roles, costs, divisibility/headroom constraints, and bit-width-driven topology guidance.
Single-node TP/DP/PP decision procedure `.claude/skills/evaluation/references/parallelism.md`	Procedure to pick TP (fit/KV headroom), DP (replica factor), and use PP only when necessary; validate via vLLM canaries and derive topology from `config.json` precision.
MoE expert parallelism (EP) `.claude/skills/evaluation/references/parallelism.md`	Documents `--enable-expert-parallel`, EP = TP×DP, MoE-layer all-to-all vs DP-local attention, detection cues, and layout examples.
Concurrency semantics & sizing `.claude/skills/evaluation/references/parallelism.md`, `.claude/skills/evaluation/SKILL.md`	Defines `parallelism` vs `--max-num-seqs`, capacity relationship, KV-cache-driven sizing using vLLM startup logs/canaries, and the rule to compute `--max-num-seqs = ceil(max_parallelism / data_parallel_size)` during Step 4.
Balanced sizing & canary tuning `.claude/skills/evaluation/references/parallelism.md`, `.claude/skills/evaluation/SKILL.md`	Adds balanced-sizing heuristics, preemption/thrash pitfalls, and canary tuning guidance (start conservative, raise only after judge/sandbox logs are clean and monitor vLLM max concurrency/preemption).
Per-task overrides & worked examples `.claude/skills/evaluation/references/parallelism.md`	Specifies per-task `parallelism` overrides for suites, mapping bottlenecks, and provides worked dense and MoE examples with concrete topology and concurrency calculations.
SKILL.md integration and workflow edits `.claude/skills/evaluation/SKILL.md`	Inserted references to `references/parallelism.md`, tightened `parallelism: ???` sizing to run shape vs GPU capacity, and clarified when Step 4 should ask users and how it appends `--max-num-seqs`.
AA-LCR task recipe parallelism `.claude/skills/evaluation/recipes/tasks/aa/lcr.md`	Adds AA-LCR-specific guidance to lower task `parallelism` for KV-bound and judge-bound cases, suggests starting ranges, and adds `parallelism: ???` with a comment to recompute `--max-num-seqs`.

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 6

✅ Passed checks (6 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely describes the main objective of the PR: adding a new parallelism sizing reference document to the evaluation skill.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	PR is documentation-only (parallelism.md reference and SKILL.md/LCR.md updates). No Python code changes in evaluation skill; security anti-patterns check not applicable.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch docs/eval-parallelism-reference

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

meenchen

Bot review — DM the bot to share feedback.

Pure documentation addition for the agent-facing evaluation skill. Adds a new references/parallelism.md (consistent in style and cross-referencing with sibling refs like multi-node.md and quantization-benchmarks.md) and wires SKILL.md to it from the relevant sections (deployment defaults, EP flag, evaluation params, Step 4, canary). Content is internally consistent (EP=TP×DP, max-num-seqs=ceil(parallelism/(DP×num_instances)), agrees with the existing multi-node.md definitions of num_instances vs data_parallel_size). No code/runtime behavior change, no licensing concerns, tests N/A.

codecov · 2026-06-01T18:42:08Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.22%. Comparing base (54fb87e) to head (e4368c1).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1583   +/-   ##
=======================================
  Coverage   73.22%   73.22%           
=======================================
  Files         478      478           
  Lines       52421    52421           
=======================================
  Hits        38387    38387           
  Misses      14034    14034

Flag	Coverage Δ
unit	`53.61% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

chadvoegele · 2026-06-01T18:49:46Z

@@ -0,0 +1,273 @@
+# Parallelism: GPU topology (TP / DP / PP / EP) and concurrency (`parallelism` / `--max-num-seqs`)


Do you think the model already knows about this background info from pre-training? Is it useful to repeat it here?

I think it's better to be explicit here

Good point — trimmed. I compressed the doc ~52% (345→165 lines, commit e4368c1): cut the generic TP/DP/PP/EP background the model already knows and merged the overlapping sections, keeping only the non-obvious decision rules, formulas, gotchas (e.g. bit-width sets the topology), and the worked examples. Happy to cut further if any of what's left still reads as restated background.

chadvoegele · 2026-06-01T18:51:14Z

+| `total_requests ≤ serving_capacity` (small run) | `total_requests` (round up a little for uneven DP routing) | All requests dispatch at once → one wave → finishes in ~one generation-time. Higher is wasted. |
+| `total_requests ≫ serving_capacity` (large run) | `serving_capacity` (largest the GPUs sustain) | Throughput-bound: keep every decode slot full until the queue drains. Request count no longer matters. |
+
+So "set it higher" is right **only up to the request count**; past that you just


I think we mention it somewhere else, but maximizing the throughput isn't necessarily the correct goal because we can overwhelm dependent services in the evaluation, like user simulator, code execution envs, etc.

for tau2, we will apply a specific cap in the tau2 yaml description.

actually I think you are right. I removed this from the latest commit.

Agreed, and addressed directly. The doc no longer treats throughput-max as the goal:

"Balanced sizing — bigger is NOT always faster" (commit 386b498) spells out that over-parallelizing regresses — preemption thrash, prefill/decode contention, and the latency→timeout→retry cascade — and says to err low for long context.

"Suites — set parallelism per task" (commit 8ac3342) calls out exactly your point: judge / user-simulator / sandbox tasks bottleneck before the served model, so cap them to the dependent endpoint (429s), not the GPU. AA-LCR and tau2 ship parallelism: ??? to force a conscious, lower value.

meenchen · 2026-06-01T19:57:55Z

+> **Gotcha — bit-width sets the topology, not the model name.** The weight estimate
+> hinges on `bytes_per_param`, so **read the actual precision from `config.json`
+> (`quantization_config` / `quant_algo` / dtype) before sizing** — do not infer it
+> from the org/handle. Two checkpoints of the *same architecture at the same
+> bit-width* have the same footprint → the same TP/DP/EP, regardless of vendor or
+> quant scheme (INT4 vs NVFP4 differ only in kernel/quant-method flags, which vLLM
+> auto-detects — and a negligible effective-bit difference). The split only changes
+> when the bit-width changes the *size* (see the Kimi example below).


We sometimes quantize kv cache using the arguments in vllm serve. Could you also add it to the note?

Good point — added. The "Factors that relax the KV limit" note now covers serve-time KV-cache quantization via vLLM's --kv-cache-dtype fp8 (fp8_e4m3 / fp8_e5m2) in deployment.command, in addition to the checkpoint's kv_cache_scheme — noting it ~halves KV vs bf16/fp16 (~2× sustainable concurrency/context) and to verify model/recipe support. Done in 0839159.

### What does this PR do? Type of change: documentation Fixes two issues in the evaluation skill that caused completed eval runs to silently **not** upload to MLflow (found while running a real GPQA eval — the run finished successfully but never appeared in MLflow): 1. **Missing auto-export trigger.** NEL only auto-exports when `execution.auto_export.destinations` is set; the `export.mlflow` block alone just *configures* the exporter. The skill's shortcut said to "copy the `export.mlflow` block" without calling out the trigger, so a config could end up with the block but no upload. 2. **Interpolated export values break at submit time.** With auto-export enabled, NEL resolves the `export.mlflow` block at submit time in a scope that does **not** include `deployment` / `evaluation`, so `${deployment.served_model_name}` / `${evaluation.*}` cross-references fail hard with `Interpolation key '...' not found`. (`example_eval.yaml` shipped this interpolated pattern *with* auto-export, so it would hit the error.) Changes: - **`example_eval.yaml`**: mark the `auto_export.destinations` trigger as REQUIRED; switch the `export.mlflow` block to **literal** values (`CHANGEME-served-model-name` placeholders) with a comment explaining why cross-refs break. `${oc.env:USER}` (env interpolation) is retained since it resolves fine. - **`SKILL.md`**: rewrite the MLflow shortcut step to require **both** the trigger and literal export values, and warn against `${deployment.*}` / `${evaluation.*}` cross-references. ### Usage N/A — documentation/skill-template only. ### Testing `pre-commit run` passes on both files (markdownlint + YAML format). Verified the literal-export approach uploads correctly by exporting a real run to MLflow. ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A (documentation) - Did you update Changelog?: N/A (skill docs) - Did you get Claude approval on this PR?: ❌ (pending) ### Additional Information Separate from #1583 (parallelism reference); both touch the evaluation skill but are independent. 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **Documentation** * Clarified MLflow auto-export: auto-export must be explicitly enabled for uploads and the MLflow export block is ignored otherwise. * Example config now uses literal experiment name, description, and tags (no cross-scope interpolation) and shows concrete sampling values. * Warned to keep sampling params consistent between tags and runtime params and to set the tracking URI.  --------- Signed-off-by: Chenjie Luo <chenjiel@nvidia.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add references/parallelism.md documenting how to choose deployment topology (TP / DP / PP, expert parallelism) and concurrency (parallelism / --max-num-seqs) for NEL evaluation runs: - GPU topology decision procedure: smallest-TP-then-max-DP, the TP-up triggers, and the TP/DP split tradeoff for a fixed world size. - Expert parallelism: EP = TP x DP (boolean flag, no direct EP size), the DP-attention + EP-MoE dataflow, and when to enable vs not. - Concurrency sizing: request-count vs serving-capacity ceiling, KV-driven --max-num-seqs, empirical tuning from vLLM logs. - Gotcha: bit-width (read from config.json), not the model name, sets the topology; worked examples incl. FP8-vs-4-bit Kimi. Wire SKILL.md to the new reference from the deployment-command, expert-parallel, evaluation-params, Step 4, and canary sections. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

…ster) Higher parallelism is not universally better — past the KV-fit point it regresses, worst for long-context / long-output tasks. Document the three mechanisms (preemption thrash on huge prefills, prefill/decode contention, latency->timeout->retry cascade), that sustainable concurrency is context-dependent, and a balanced rule: target ~70-80% of the preemption-free KV-fit concurrency at the task's working context, give long-context tasks a lower per-task parallelism override, tune by canary while throughput rises / preemption ~0 / p99 within request_timeout, and err low when unsure. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

AA-LCR is both KV-bound (~120K input -> preemption thrash if over-parallelized) and judge-bound, so it needs a parallelism well below the model/GPU-bound tasks. - lcr.md: add `parallelism: ???` at the params level (like tau2_bench_telecom), with a Params note that it must be set LOWER than the top-level default and how to tune it (start ~16-32 for GQA, several x more for MLA; watch preemption + 429s). - parallelism.md: add a "Running a suite" section — parallelism is per-task, not per-run; top-level default for short model-bound tasks, per-task overrides for long-context (KV), judge/user-sim (rate limit), and sandbox (execution) bottlenecks; --max-num-seqs sized off the max across tasks; canary each class separately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

Address PR review: KV-cache quantization can also be enabled at serve time via vLLM's --kv-cache-dtype fp8 (fp8_e4m3 / fp8_e5m2), not only baked into the checkpoint's kv_cache_scheme. It ~halves KV vs bf16/fp16 (~2x sustainable concurrency/context) and is a knob in deployment.command for KV-bound runs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

Cut generic TP/DP/PP/EP background the model already knows (per PR review) and merged the overlapping binding-constraint / balanced-sizing / non-GPU-caps / suite sections. All formulas, the bit-width gotcha, TP-up triggers, the layout table, serve-time KV-cache quantization, balanced-sizing regression, the per-task suite table, and worked examples are preserved. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.claude/skills/evaluation/references/parallelism.md:
- Around line 147-148: The current documentation formula for `--max-num-seqs` is
missing `num_instances`; update the formula so it divides the busiest-task
parallelism by both DP and num_instances (e.g., `--max-num-seqs = ceil(max
parallelism across tasks / (DP * num_instances))`) so each replica is sized
correctly in HAProxy/multi-instance deployments; change the line referencing
`--max-num-seqs` and ensure any explanatory text mentions `num_instances` to
avoid oversizing per replica and extra KV pressure/preemption.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 360d7b8e-c1b0-496b-8bc3-a60e67362e81

📥 Commits

Reviewing files that changed from the base of the PR and between 0839159 and e4368c1.

📒 Files selected for processing (1)

.claude/skills/evaluation/references/parallelism.md

coderabbitai · 2026-06-01T20:47:22Z

+- `--max-num-seqs = ceil(max parallelism across tasks / DP)` (deployment must serve the
+  busiest task) even if long-context tasks run lower.


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix --max-num-seqs suite formula to include num_instances.

This line conflicts with the capacity definition above (Lines 86–91). For HAProxy/multi-instance setups, omitting num_instances will oversize --max-num-seqs per replica and can induce avoidable KV pressure/preemption.

Suggested doc fix

-- `--max-num-seqs = ceil(max parallelism across tasks / DP)` (deployment must serve the +- `--max-num-seqs = ceil(max parallelism across tasks / (DP × num_instances))` (deployment must serve the busiest task) even if long-context tasks run lower.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- `--max-num-seqs = ceil(max parallelism across tasks / DP)` (deployment must serve the

busiest task) even if long-context tasks run lower.

- `--max-num-seqs = ceil(max parallelism across tasks / (DP × num_instances))` (deployment must serve the

busiest task) even if long-context tasks run lower.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.claude/skills/evaluation/references/parallelism.md around lines 147 - 148, The current documentation formula for `--max-num-seqs` is missing `num_instances`; update the formula so it divides the busiest-task parallelism by both DP and num_instances (e.g., `--max-num-seqs = ceil(max parallelism across tasks / (DP * num_instances))`) so each replica is sized correctly in HAProxy/multi-instance deployments; change the line referencing `--max-num-seqs` and ensure any explanatory text mentions `num_instances` to avoid oversizing per replica and extra KV pressure/preemption.

github-actions · 2026-06-01T21:34:06Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-06-01 21:33 UTC

cjluo-nv requested review from chadvoegele, kaix-nv and meenchen June 1, 2026 18:29

meenchen approved these changes Jun 1, 2026

View reviewed changes

coderabbitai Bot approved these changes Jun 1, 2026

View reviewed changes

cjluo-nv mentioned this pull request Jun 1, 2026

Fix MLflow auto-export in evaluation skill template #1586

Merged

chadvoegele reviewed Jun 1, 2026

View reviewed changes

meenchen reviewed Jun 1, 2026

View reviewed changes

cjluo-nv and others added 3 commits June 1, 2026 13:18

cjluo-nv force-pushed the docs/eval-parallelism-reference branch from 7a1e110 to 8ac3342 Compare June 1, 2026 20:36

cjluo-nv and others added 2 commits June 1, 2026 13:39

coderabbitai Bot reviewed Jun 1, 2026

View reviewed changes

meenchen approved these changes Jun 1, 2026

View reviewed changes

cjluo-nv merged commit 38c7843 into main Jun 1, 2026
39 checks passed

cjluo-nv deleted the docs/eval-parallelism-reference branch June 1, 2026 21:33

cjluo-nv mentioned this pull request Jun 1, 2026

eval skill: parameterize external judge/user-sim endpoints via .env #1591

Open

coderabbitai Bot mentioned this pull request Jun 2, 2026

[skills] Fix eval/deploy defaults for NVFP4-on-Blackwell + nemo-skills evals #1595

Open

		@@ -0,0 +1,273 @@
		# Parallelism: GPU topology (TP / DP / PP / EP) and concurrency (`parallelism` / `--max-num-seqs`)

		- `--max-num-seqs = ceil(max parallelism across tasks / DP)` (deployment must serve the
		busiest task) even if long-context tasks run lower.

Conversation

cjluo-nv commented Jun 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

meenchen left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cjluo-nv commented Jun 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading

codecov Bot commented Jun 1, 2026 •

edited

Loading