Skip to content

Add parallelism sizing reference to evaluation skill#1583

Merged
cjluo-nv merged 5 commits into
mainfrom
docs/eval-parallelism-reference
Jun 1, 2026
Merged

Add parallelism sizing reference to evaluation skill#1583
cjluo-nv merged 5 commits into
mainfrom
docs/eval-parallelism-reference

Conversation

@cjluo-nv
Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv commented Jun 1, 2026

What does this PR do?

Type of change: documentation

Adds .claude/skills/evaluation/references/parallelism.md, a reference for sizing both the GPU topology and the request concurrency of NEL evaluation runs, and wires the main SKILL.md to it. Pure docs for the agent-facing evaluation skill — no code or runtime behavior changes.

The new reference covers:

  • GPU topology (TP / DP / PP): decision procedure (smallest-TP-then-max-DP), the TP-up triggers, and the TP/DP-split tradeoff for a fixed world size (e.g. all TP×DP=8 factorizations on an 8-GPU node).
  • Expert parallelism (EP): --enable-expert-parallel is a boolean and EP = TP × DP (no direct EP-size flag); the DP-attention + EP-MoE dataflow (per-layer dispatch/combine all-to-all); when to enable vs not.
  • Concurrency (parallelism / --max-num-seqs): the request-count-vs-serving-capacity ceiling, KV-driven --max-num-seqs, and empirical tuning from vLLM startup logs + preemption.
  • Gotcha + worked examples: bit-width read from config.json (not the model name) sets the topology; includes the FP8-vs-4-bit Kimi example.

SKILL.md gains pointers to the reference from the deployment-command, expert-parallel, evaluation-params, Step 4, and canary sections.

Usage

N/A — documentation only (agent skill guidance).

Testing

pre-commit run passes on both files (markdownlint included).

Before your PR is "Ready for review"

  • Is this change backward compatible?: ✅
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: N/A (documentation)
  • Did you update Changelog?: N/A (skill docs, not a library feature)
  • Did you get Claude approval on this PR?: ❌ (pending)

Additional Information

Commits are signed (git commit -s -S).

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Documentation
    • Expanded deployment guidance with clearer GPU topology, MoE/EP semantics, and bit‑width caveats.
    • Refined concurrency guidance: explicit rules for sizing top-level parallelism and computing max concurrent sequences, plus canary-run tuning to watch preemption and KV utilization.
    • Added task-level guidance for long‑context/judge‑bound jobs with recomputation advice and worked examples.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

Adds a new parallelism reference and updates SKILL.md plus an AA-LCR recipe to use it, clarifying GPU topology (TP/DP/PP/EP), concurrency (parallelism and --max-num-seqs = ceil(max_parallelism / data_parallel_size)), and canary tuning with vLLM concurrency/preemption signals.

Changes

vLLM Parallelism Documentation & Workflow Integration

Layer / File(s) Summary
GPU topology decision framework
.claude/skills/evaluation/references/parallelism.md
Introduced the two-layer decision model and TP/DP/PP roles, costs, divisibility/headroom constraints, and bit-width-driven topology guidance.
Single-node TP/DP/PP decision procedure
.claude/skills/evaluation/references/parallelism.md
Procedure to pick TP (fit/KV headroom), DP (replica factor), and use PP only when necessary; validate via vLLM canaries and derive topology from config.json precision.
MoE expert parallelism (EP)
.claude/skills/evaluation/references/parallelism.md
Documents --enable-expert-parallel, EP = TP×DP, MoE-layer all-to-all vs DP-local attention, detection cues, and layout examples.
Concurrency semantics & sizing
.claude/skills/evaluation/references/parallelism.md, .claude/skills/evaluation/SKILL.md
Defines parallelism vs --max-num-seqs, capacity relationship, KV-cache-driven sizing using vLLM startup logs/canaries, and the rule to compute --max-num-seqs = ceil(max_parallelism / data_parallel_size) during Step 4.
Balanced sizing & canary tuning
.claude/skills/evaluation/references/parallelism.md, .claude/skills/evaluation/SKILL.md
Adds balanced-sizing heuristics, preemption/thrash pitfalls, and canary tuning guidance (start conservative, raise only after judge/sandbox logs are clean and monitor vLLM max concurrency/preemption).
Per-task overrides & worked examples
.claude/skills/evaluation/references/parallelism.md
Specifies per-task parallelism overrides for suites, mapping bottlenecks, and provides worked dense and MoE examples with concrete topology and concurrency calculations.
SKILL.md integration and workflow edits
.claude/skills/evaluation/SKILL.md
Inserted references to references/parallelism.md, tightened parallelism: ??? sizing to run shape vs GPU capacity, and clarified when Step 4 should ask users and how it appends --max-num-seqs.
AA-LCR task recipe parallelism
.claude/skills/evaluation/recipes/tasks/aa/lcr.md
Adds AA-LCR-specific guidance to lower task parallelism for KV-bound and judge-bound cases, suggests starting ranges, and adds parallelism: ??? with a comment to recompute --max-num-seqs.

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely describes the main objective of the PR: adding a new parallelism sizing reference document to the evaluation skill.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed PR is documentation-only (parallelism.md reference and SKILL.md/LCR.md updates). No Python code changes in evaluation skill; security anti-patterns check not applicable.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/eval-parallelism-reference

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@meenchen meenchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Pure documentation addition for the agent-facing evaluation skill. Adds a new references/parallelism.md (consistent in style and cross-referencing with sibling refs like multi-node.md and quantization-benchmarks.md) and wires SKILL.md to it from the relevant sections (deployment defaults, EP flag, evaluation params, Step 4, canary). Content is internally consistent (EP=TP×DP, max-num-seqs=ceil(parallelism/(DP×num_instances)), agrees with the existing multi-node.md definitions of num_instances vs data_parallel_size). No code/runtime behavior change, no licensing concerns, tests N/A.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.22%. Comparing base (54fb87e) to head (e4368c1).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1583   +/-   ##
=======================================
  Coverage   73.22%   73.22%           
=======================================
  Files         478      478           
  Lines       52421    52421           
=======================================
  Hits        38387    38387           
  Misses      14034    14034           
Flag Coverage Δ
unit 53.61% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@@ -0,0 +1,273 @@
# Parallelism: GPU topology (TP / DP / PP / EP) and concurrency (`parallelism` / `--max-num-seqs`)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think the model already knows about this background info from pre-training? Is it useful to repeat it here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to be explicit here

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point — trimmed. I compressed the doc ~52% (345→165 lines, commit e4368c1): cut the generic TP/DP/PP/EP background the model already knows and merged the overlapping sections, keeping only the non-obvious decision rules, formulas, gotchas (e.g. bit-width sets the topology), and the worked examples. Happy to cut further if any of what's left still reads as restated background.

| `total_requests ≤ serving_capacity` (small run) | `total_requests` (round up a little for uneven DP routing) | All requests dispatch at once → one wave → finishes in ~one generation-time. Higher is wasted. |
| `total_requests ≫ serving_capacity` (large run) | `serving_capacity` (largest the GPUs sustain) | Throughput-bound: keep every decode slot full until the queue drains. Request count no longer matters. |

So "set it higher" is right **only up to the request count**; past that you just
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we mention it somewhere else, but maximizing the throughput isn't necessarily the correct goal because we can overwhelm dependent services in the evaluation, like user simulator, code execution envs, etc.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for tau2, we will apply a specific cap in the tau2 yaml description.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I think you are right. I removed this from the latest commit.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, and addressed directly. The doc no longer treats throughput-max as the goal:

  • "Balanced sizing — bigger is NOT always faster" (commit 386b498) spells out that over-parallelizing regresses — preemption thrash, prefill/decode contention, and the latency→timeout→retry cascade — and says to err low for long context.
  • "Suites — set parallelism per task" (commit 8ac3342) calls out exactly your point: judge / user-simulator / sandbox tasks bottleneck before the served model, so cap them to the dependent endpoint (429s), not the GPU. AA-LCR and tau2 ship parallelism: ??? to force a conscious, lower value.

Comment on lines +46 to +53
> **Gotcha — bit-width sets the topology, not the model name.** The weight estimate
> hinges on `bytes_per_param`, so **read the actual precision from `config.json`
> (`quantization_config` / `quant_algo` / dtype) before sizing** — do not infer it
> from the org/handle. Two checkpoints of the *same architecture at the same
> bit-width* have the same footprint → the same TP/DP/EP, regardless of vendor or
> quant scheme (INT4 vs NVFP4 differ only in kernel/quant-method flags, which vLLM
> auto-detects — and a negligible effective-bit difference). The split only changes
> when the bit-width changes the *size* (see the Kimi example below).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We sometimes quantize kv cache using the arguments in vllm serve. Could you also add it to the note?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point — added. The "Factors that relax the KV limit" note now covers serve-time KV-cache quantization via vLLM's --kv-cache-dtype fp8 (fp8_e4m3 / fp8_e5m2) in deployment.command, in addition to the checkpoint's kv_cache_scheme — noting it ~halves KV vs bf16/fp16 (~2× sustainable concurrency/context) and to verify model/recipe support. Done in 0839159.

cjluo-nv added a commit that referenced this pull request Jun 1, 2026
### What does this PR do?

Type of change: documentation

Fixes two issues in the evaluation skill that caused completed eval runs
to silently **not** upload to MLflow (found while running a real GPQA
eval — the run finished successfully but never appeared in MLflow):

1. **Missing auto-export trigger.** NEL only auto-exports when
`execution.auto_export.destinations` is set; the `export.mlflow` block
alone just *configures* the exporter. The skill's shortcut said to "copy
the `export.mlflow` block" without calling out the trigger, so a config
could end up with the block but no upload.

2. **Interpolated export values break at submit time.** With auto-export
enabled, NEL resolves the `export.mlflow` block at submit time in a
scope that does **not** include `deployment` / `evaluation`, so
`${deployment.served_model_name}` / `${evaluation.*}` cross-references
fail hard with `Interpolation key '...' not found`. (`example_eval.yaml`
shipped this interpolated pattern *with* auto-export, so it would hit
the error.)

Changes:

- **`example_eval.yaml`**: mark the `auto_export.destinations` trigger
as REQUIRED; switch the `export.mlflow` block to **literal** values
(`CHANGEME-served-model-name` placeholders) with a comment explaining
why cross-refs break. `${oc.env:USER}` (env interpolation) is retained
since it resolves fine.
- **`SKILL.md`**: rewrite the MLflow shortcut step to require **both**
the trigger and literal export values, and warn against
`${deployment.*}` / `${evaluation.*}` cross-references.

### Usage

N/A — documentation/skill-template only.

### Testing

`pre-commit run` passes on both files (markdownlint + YAML format).
Verified the literal-export approach uploads correctly by exporting a
real run to MLflow.

### Before your PR is "*Ready for review*"

- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: N/A (documentation)
- Did you update Changelog?: N/A (skill docs)
- Did you get Claude approval on this PR?: ❌ (pending)

### Additional Information

Separate from #1583 (parallelism reference); both touch the evaluation
skill but are independent.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Documentation**
* Clarified MLflow auto-export: auto-export must be explicitly enabled
for uploads and the MLflow export block is ignored otherwise.
* Example config now uses literal experiment name, description, and tags
(no cross-scope interpolation) and shows concrete sampling values.
* Warned to keep sampling params consistent between tags and runtime
params and to set the tracking URI.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cjluo-nv and others added 3 commits June 1, 2026 13:18
Add references/parallelism.md documenting how to choose deployment
topology (TP / DP / PP, expert parallelism) and concurrency
(parallelism / --max-num-seqs) for NEL evaluation runs:

- GPU topology decision procedure: smallest-TP-then-max-DP, the
  TP-up triggers, and the TP/DP split tradeoff for a fixed world size.
- Expert parallelism: EP = TP x DP (boolean flag, no direct EP size),
  the DP-attention + EP-MoE dataflow, and when to enable vs not.
- Concurrency sizing: request-count vs serving-capacity ceiling,
  KV-driven --max-num-seqs, empirical tuning from vLLM logs.
- Gotcha: bit-width (read from config.json), not the model name,
  sets the topology; worked examples incl. FP8-vs-4-bit Kimi.

Wire SKILL.md to the new reference from the deployment-command,
expert-parallel, evaluation-params, Step 4, and canary sections.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
…ster)

Higher parallelism is not universally better — past the KV-fit point it
regresses, worst for long-context / long-output tasks. Document the three
mechanisms (preemption thrash on huge prefills, prefill/decode contention,
latency->timeout->retry cascade), that sustainable concurrency is
context-dependent, and a balanced rule: target ~70-80% of the preemption-free
KV-fit concurrency at the task's working context, give long-context tasks a
lower per-task parallelism override, tune by canary while throughput rises /
preemption ~0 / p99 within request_timeout, and err low when unsure.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
AA-LCR is both KV-bound (~120K input -> preemption thrash if over-parallelized)
and judge-bound, so it needs a parallelism well below the model/GPU-bound tasks.

- lcr.md: add `parallelism: ???` at the params level (like tau2_bench_telecom),
  with a Params note that it must be set LOWER than the top-level default and how
  to tune it (start ~16-32 for GQA, several x more for MLA; watch preemption + 429s).
- parallelism.md: add a "Running a suite" section — parallelism is per-task, not
  per-run; top-level default for short model-bound tasks, per-task overrides for
  long-context (KV), judge/user-sim (rate limit), and sandbox (execution) bottlenecks;
  --max-num-seqs sized off the max across tasks; canary each class separately.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@cjluo-nv cjluo-nv force-pushed the docs/eval-parallelism-reference branch from 7a1e110 to 8ac3342 Compare June 1, 2026 20:36
cjluo-nv and others added 2 commits June 1, 2026 13:39
Address PR review: KV-cache quantization can also be enabled at serve time via
vLLM's --kv-cache-dtype fp8 (fp8_e4m3 / fp8_e5m2), not only baked into the
checkpoint's kv_cache_scheme. It ~halves KV vs bf16/fp16 (~2x sustainable
concurrency/context) and is a knob in deployment.command for KV-bound runs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Cut generic TP/DP/PP/EP background the model already knows (per PR review) and
merged the overlapping binding-constraint / balanced-sizing / non-GPU-caps /
suite sections. All formulas, the bit-width gotcha, TP-up triggers, the layout
table, serve-time KV-cache quantization, balanced-sizing regression, the
per-task suite table, and worked examples are preserved.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.claude/skills/evaluation/references/parallelism.md:
- Around line 147-148: The current documentation formula for `--max-num-seqs` is
missing `num_instances`; update the formula so it divides the busiest-task
parallelism by both DP and num_instances (e.g., `--max-num-seqs = ceil(max
parallelism across tasks / (DP * num_instances))`) so each replica is sized
correctly in HAProxy/multi-instance deployments; change the line referencing
`--max-num-seqs` and ensure any explanatory text mentions `num_instances` to
avoid oversizing per replica and extra KV pressure/preemption.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 360d7b8e-c1b0-496b-8bc3-a60e67362e81

📥 Commits

Reviewing files that changed from the base of the PR and between 0839159 and e4368c1.

📒 Files selected for processing (1)
  • .claude/skills/evaluation/references/parallelism.md

Comment on lines +147 to +148
- `--max-num-seqs = ceil(max parallelism across tasks / DP)` (deployment must serve the
busiest task) even if long-context tasks run lower.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix --max-num-seqs suite formula to include num_instances.

This line conflicts with the capacity definition above (Lines 86–91). For HAProxy/multi-instance setups, omitting num_instances will oversize --max-num-seqs per replica and can induce avoidable KV pressure/preemption.

Suggested doc fix
-- `--max-num-seqs = ceil(max parallelism across tasks / DP)` (deployment must serve the
+- `--max-num-seqs = ceil(max parallelism across tasks / (DP × num_instances))` (deployment must serve the
   busiest task) even if long-context tasks run lower.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- `--max-num-seqs = ceil(max parallelism across tasks / DP)` (deployment must serve the
busiest task) even if long-context tasks run lower.
- `--max-num-seqs = ceil(max parallelism across tasks / (DP × num_instances))` (deployment must serve the
busiest task) even if long-context tasks run lower.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/evaluation/references/parallelism.md around lines 147 - 148,
The current documentation formula for `--max-num-seqs` is missing
`num_instances`; update the formula so it divides the busiest-task parallelism
by both DP and num_instances (e.g., `--max-num-seqs = ceil(max parallelism
across tasks / (DP * num_instances))`) so each replica is sized correctly in
HAProxy/multi-instance deployments; change the line referencing `--max-num-seqs`
and ensure any explanatory text mentions `num_instances` to avoid oversizing per
replica and extra KV pressure/preemption.

@cjluo-nv cjluo-nv merged commit 38c7843 into main Jun 1, 2026
39 checks passed
@cjluo-nv cjluo-nv deleted the docs/eval-parallelism-reference branch June 1, 2026 21:33
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-06-01 21:33 UTC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants