Fix and Speedup megatron_mmlu by >10x via prefill scoring and global batching by kevalmorabia97 · Pull Request #1280 · NVIDIA/Model-Optimizer

kevalmorabia97 · 2026-04-16T20:23:33Z

What does this PR do?

Type of change: new feature + bug fix

Two improvements to Megatron inference utilities:

1. Pipeline Parallel (PP) correctness fixes

PP inference was producing garbage output (MMLU ~0.24, random chance). Two root causes:

megatron_generate / megatron_prefill used get_forward_backward_func() (the training pipeline scheduler), which is not designed for inference. Rewrote both functions to use explicit P2P communication via recv_from_prev_pipeline_rank_ / send_to_next_pipeline_rank, matching the run_mcore_inference pattern.
import_mcore_gpt_from_hf loads HF weights into stage 0's embedding but never updates the output_layer on the last PP stage when share_embeddings_and_output_weights=True. At model init, setup_embeddings_and_output_layer() all-reduces from stage 0 to sync the output layer; after importing HF weights that all-reduce is stale. Fix: call model.setup_embeddings_and_output_layer() again after import.

2. megatron_mmlu speedup (~6x)

Replaces the megatron_mmlu implementation with a significantly faster approach that matches how lm-evaluation-harness scores multiple-choice questions.

Before: autoregressive generation (megatron_generate, osl=2) per example, 114 separate load_dataset calls, batch_size=1 — 260s for 5% data.

After: single prefill forward pass + argmax over {A,B,C,D} logits, 2 load_dataset calls, configurable batch_size — 18s for 5% data (~6x faster).

Changes

PP fixes:

megatron_generate / megatron_prefill: replace get_forward_backward_func with explicit P2P (recv_from_prev_pipeline_rank_ / send_to_next_pipeline_rank)
import_mcore_gpt_from_hf: call model.setup_embeddings_and_output_layer() after HF weight import when PP>1 and share_embeddings_and_output_weights=True
megatron_prefill: add skip_return_logits param and VLM support (needed for PP non-last stages)

MMLU speedup:

Log-likelihood scoring: replace megatron_generate with megatron_prefill — one forward pass per batch, no autoregressive decode loop
Global batching: collect all examples across all subjects, sort by descending sequence length, run in batch_size chunks
2 dataset loads instead of 114: use load_dataset("cais/mmlu", "all") with per-subject grouping; skip dev load when few_shots=0
percentage → fraction parameter rename for clarity
tqdm progress bar (rank-0 only)

Testing

test_megatron_generate_and_mmlu parametrized over tp and pp. Accuracy assertion: 0.36 < score < 0.39. Manually checked generated text is coherent.
Re-ran M-Bridge Minitron MMLU based pruning for Nano v2 9B -> 7B and all top 10 candidate's MMLU numbers are ballpark similar as before

Before your PR is "Ready for review"

Is this change backward compatible?: ❌ — percentage parameter renamed to fraction; enable_kv_cache removed from megatron_mmlu
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
Did you write any new necessary tests?: ✅ — existing test updated and parametrized for TP+PP
Did you update Changelog?: ✅

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Improved pipeline-parallel generation and MMLU evaluation reliability; fixed output-layer synchronization in shared-embedding + pipeline setups.
New Features
- MMLU scoring now uses batched prefill logit scoring for faster, batched evaluation.
Behavior Changes
- Default MMLU sampling increased from 5% to 10%; calibration batch sizing adjusted and related CLI/help text updated.
Tests
- Distributed tests cover tensor- and pipeline-parallel modes and tighten MMLU validation ranges.
Documentation
- Updated pruning example and benchmark timing to reflect new sampling and speedup.

coderabbitai · 2026-04-16T20:23:47Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Replaces per-example autoregressive MMLU scoring with batched prefill-logit scoring; adds explicit pipeline-parallel activation send/recv in generation paths; renames MMLU API parameter percentage→fraction and replaces KV-cache flag with batch_size; updates examples, tests, importer sync, logging flush behavior, and changelog.

Changes

Cohort / File(s)	Summary
MMLU Evaluation Core `modelopt/torch/utils/plugins/megatron_mmlu.py`	Replaced per-example autoregressive generation with batched prefill-logit scoring. Loads `cais/mmlu` once, groups by `subject`, samples per-subject by `fraction`, builds few-shot prompts, tokenizes/sorts/batches (right-pad), calls `megatron_prefill`, selects choices via argmax over choice-token logits. Signature changed: `percentage`→`fraction`, `enable_kv_cache`→`batch_size`, tokenizer type annotation updated.
Generation / Prefill Engine `modelopt/torch/utils/plugins/megatron_generate.py`	Reworked prefill/decode to call `model(...)` directly, removed get_forward_backward indirection, added explicit pipeline-parallel recv/send activation exchange and logits broadcast from last stage, adjusted attention/position/padding computation, and limited vision inputs to prefill step.
Importer: Embedding/Output Sync `modelopt/torch/export/plugins/megatron_importer.py`	When `share_embeddings_and_output_weights` and `pipeline_model_parallel_size > 1`, call `model.setup_embeddings_and_output_layer()` after importing layers to re-synchronize shared embeddings/output across PP stages.
Examples & Docs `examples/megatron_bridge/prune_minitron.py`, `examples/pruning/README.md`	Default MMLU sampling and CLI defaults changed from 5%→10% (`mmlu_5pct`→`mmlu_10pct`); example `score_func` updated to `megatron_mmlu(..., fraction=0.1, batch_size=4)` and runtime note adjusted.
Tests: Distributed MMLU `tests/gpu_megatron/torch/utils/plugins/test_utils_megatron.py`	Parametrized over `parallelism` (`"tp"`, `"pp"`), helper updated to accept `parallelism`, conditional skip for insufficient GPUs, debug prints include `rank`, and MMLU assertion tightened to use `fraction=0.1, batch_size=16` with bounded accuracy check.
Logging `modelopt/torch/utils/logging.py`	`print_rank_0` now sets `flush=True` by default via kwargs.setdefault before printing.
Changelog `CHANGELOG.rst`	Added entry noting fixes to Megatron generation utilities and MMLU evaluation performance improvement.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant Dataset
    participant Tokenizer
    participant Model
    participant Scorer

    Caller->>Dataset: megatron_mmlu(model, tokenizer, few_shots=0, fraction=0.1, batch_size=16)
    Dataset-->>Caller: load "cais/mmlu" (all), samples with subject labels
    Caller->>Caller: group by subject, sample fraction, build prompts
    Caller->>Tokenizer: tokenize prompts (batched, sorted by length)
    Tokenizer-->>Caller: token sequences
    Caller->>Model: megatron_prefill(batch_tokens, attention_mask, position_ids, ...)
    Model-->>Caller: logits (broadcasted across PP ranks)
    Caller->>Scorer: select choice-token via argmax per example
    Scorer-->>Caller: per-example predictions → compute accuracies
    Caller-->>Caller: return overall accuracy

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 38.46% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Security Anti-Patterns	❓ Inconclusive	Code review for prohibited security patterns including unsanitized torch.load, numpy.load with allow_pickle, hardcoded trust_remote_code, eval/exec usage, and nosec comments.	Manual security audit required for files: modelopt/torch/utils/plugins/megatron_mmlu.py, modelopt/torch/export/plugins/megatron_importer.py, examples, and tests directories.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately captures the two main changes: fixing pipeline-parallel (PP) correctness and achieving >10x speedup in megatron_mmlu through prefill scoring and global batching.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch kmorabia/speedup-megatron-mmlu

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

modelopt/torch/utils/plugins/megatron_mmlu.py (2)
159-166: Consider explicitly deleting large tensors to reduce memory pressure.

The PR description mentions "explicitly delete logits / padded after each batch to avoid OOM from tensor lifetime overlap," but I don't see explicit del statements. While Python's garbage collector will eventually reclaim these, explicit deletion can help when processing many batches with large tensors.
♻️ Proposed fix
         for i, seq_len in enumerate(batch_len):
             answer_logits = logits[i, seq_len - 1, choice_ids]
             predictions[order[batch_start + i]] = _CHOICES[answer_logits.argmax().item()]
 
         examples_done = min(batch_start + batch_size, len(sorted_encoded))
         pbar.set_postfix(examples=f"{examples_done}/{len(sorted_encoded)}")
+
+        del logits, padded  # Free GPU memory before next batch
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/plugins/megatron_mmlu.py` around lines 159 - 166, The
batch loop keeps large tensors (logits and padded) alive causing OOM; after
using logits to compute answer_logits and updating predictions in the
megatron_prefill flow, explicitly delete the large tensors (del logits, del
padded) and, if needed, call torch.cuda.empty_cache() to free GPU memory before
the next batch; ensure you do this right after the loop that consumes logits
(the section using logits[i, seq_len - 1, choice_ids] and updating predictions)
so references are dropped and memory is released.
154-158: Padding with zeros may not match the tokenizer's pad token.

Right-padding with torch.zeros assumes pad_token_id is 0, but many tokenizers use different pad token IDs (e.g., Llama uses a special token). While the causal mask typically prevents padding from affecting the output at the last real token position, using the actual pad token is more robust.
♻️ Proposed fix
-        padded = torch.zeros(len(batch_enc), max_len, dtype=torch.long)
+        pad_id = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else 0
+        padded = torch.full((len(batch_enc), max_len), pad_id, dtype=torch.long)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/plugins/megatron_mmlu.py` around lines 154 - 158, The
padding uses hard-coded zeros which assumes pad_token_id==0; change the creation
of `padded` to fill with the tokenizer's pad token id (e.g., obtain `pad_id =
tokenizer.pad_token_id if hasattr(tokenizer, "pad_token_id") else 0`) and
initialize `padded = torch.full((len(batch_enc), max_len), fill_value=pad_id,
dtype=torch.long)` before copying sequences (symbols to locate: `padded`,
`batch_enc`, `batch_len`, `max_len` in `megatron_mmlu.py`); ensure the dtype and
fill value are compatible (cast pad_id to int) so padding uses the correct pad
token for the model.
tests/gpu_megatron/torch/utils/plugins/test_utils_megatron.py (1)
45-45: Widen assertion bounds to account for GPU floating-point precision variance.

The 0.01 range (0.37 < x < 0.38) is vulnerable to differences in floating-point arithmetic across GPU architectures and CUDA versions. Even with deterministic prefill scoring and explicit seeding, accumulated FP32 precision differences over ~500 examples can shift the accuracy metric. Widening to 0.35–0.40 provides reasonable tolerance while maintaining test integrity.
Suggested fix
-    assert 0.37 < megatron_mmlu(model, tokenizer, fraction=0.1, batch_size=16) < 0.38
+    assert 0.35 < megatron_mmlu(model, tokenizer, fraction=0.1, batch_size=16) < 0.40
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu_megatron/torch/utils/plugins/test_utils_megatron.py` at line 45,
The assertion on megatron_mmlu is too tight for GPU FP variability; update the
test assertion that calls megatron_mmlu(model, tokenizer, fraction=0.1,
batch_size=16) to use wider bounds (e.g., assert 0.35 < megatron_mmlu(...) <
0.40) to tolerate FP32 differences across GPUs/CUDA while keeping fraction and
batch_size the same; locate the call to megatron_mmlu in the test file and
adjust the numeric bounds accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/utils/plugins/megatron_mmlu.py`:
- Around line 177-178: The "evaluation started..." message is printed after the
evaluation loop, causing confusion; move the dist.is_master() print statement to
just before the batch loop (before pbar = tqdm(...) / before iterating over
dataloader) so the message is shown when evaluation actually begins, or
alternatively change the message text at the current location to "evaluation
completed" to reflect its current timing; update references in this file to
dist.is_master(), the print call, and the pbar / batch loop in the function
around the MMLU evaluation block.
- Around line 173-175: The computation of avg using avg = sum(all_correct) /
n_total can raise ZeroDivisionError when n_total == 0 (e.g., empty dataset or
fraction==0); update the logic around variables all_correct, n_total, and avg to
guard against zero by checking if n_total == 0 and handling it explicitly (for
example set avg = 0.0 or raise a clear ValueError) before performing the
division, and optionally log or surface a warning so callers know no samples
were evaluated.

---

Nitpick comments:
In `@modelopt/torch/utils/plugins/megatron_mmlu.py`:
- Around line 159-166: The batch loop keeps large tensors (logits and padded)
alive causing OOM; after using logits to compute answer_logits and updating
predictions in the megatron_prefill flow, explicitly delete the large tensors
(del logits, del padded) and, if needed, call torch.cuda.empty_cache() to free
GPU memory before the next batch; ensure you do this right after the loop that
consumes logits (the section using logits[i, seq_len - 1, choice_ids] and
updating predictions) so references are dropped and memory is released.
- Around line 154-158: The padding uses hard-coded zeros which assumes
pad_token_id==0; change the creation of `padded` to fill with the tokenizer's
pad token id (e.g., obtain `pad_id = tokenizer.pad_token_id if
hasattr(tokenizer, "pad_token_id") else 0`) and initialize `padded =
torch.full((len(batch_enc), max_len), fill_value=pad_id, dtype=torch.long)`
before copying sequences (symbols to locate: `padded`, `batch_enc`, `batch_len`,
`max_len` in `megatron_mmlu.py`); ensure the dtype and fill value are compatible
(cast pad_id to int) so padding uses the correct pad token for the model.

In `@tests/gpu_megatron/torch/utils/plugins/test_utils_megatron.py`:
- Line 45: The assertion on megatron_mmlu is too tight for GPU FP variability;
update the test assertion that calls megatron_mmlu(model, tokenizer,
fraction=0.1, batch_size=16) to use wider bounds (e.g., assert 0.35 <
megatron_mmlu(...) < 0.40) to tolerate FP32 differences across GPUs/CUDA while
keeping fraction and batch_size the same; locate the call to megatron_mmlu in
the test file and adjust the numeric bounds accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 91e73dfb-34ef-49fd-bb35-47fc41d3a912

📥 Commits

Reviewing files that changed from the base of the PR and between 6ded36b and 303e429.

📒 Files selected for processing (4)

examples/megatron_bridge/prune_minitron.py
examples/pruning/README.md
modelopt/torch/utils/plugins/megatron_mmlu.py
tests/gpu_megatron/torch/utils/plugins/test_utils_megatron.py

github-actions · 2026-04-16T20:28:46Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-17 19:22 UTC

codecov · 2026-04-16T20:37:25Z

Codecov Report

❌ Patch coverage is 78.98089% with 33 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.54%. Comparing base (4e33368) to head (d9b3f8a).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/torch/utils/plugins/megatron_generate.py	64.63%	29 Missing ⚠️
modelopt/torch/utils/plugins/megatron_mmlu.py	95.71%	3 Missing ⚠️
modelopt/torch/export/plugins/megatron_importer.py	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1280      +/-   ##
==========================================
+ Coverage   72.74%   76.54%   +3.80%     
==========================================
  Files         459      459              
  Lines       48611    48649      +38     
==========================================
+ Hits        35361    37238    +1877     
+ Misses      13250    11411    -1839

Flag	Coverage Δ
examples	`41.33% <56.68%> (+1.89%)`	⬆️
gpu	`59.97% <78.98%> (+7.77%)`	⬆️
unit	`52.18% <4.45%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

🧹 Nitpick comments (2)

modelopt/torch/utils/plugins/megatron_generate.py (1)
111-122: Redundant pp_dtype computation.

pp_dtype is computed identically at lines 73-75, 113-115, and 148-150. Consider removing the redundant assignments at lines 113-115 and 148-150 since pp_dtype is already computed earlier in the function.
♻️ Suggested cleanup
     # For PP, receive activations from the previous stage before calling forward.
     if is_pp and not pp_first:
-        pp_dtype = model.config.pipeline_dtype or (
-            torch.bfloat16 if model.config.bf16 else torch.float32
-        )
         recv_buffer = torch.empty(
             (padded_seq_len, batch_size, model.config.hidden_size),
             dtype=pp_dtype,
     # For PP non-last stages, forward activations to the next stage and return early.
     if is_pp and not pp_last:
-        pp_dtype = model.config.pipeline_dtype or (
-            torch.bfloat16 if model.config.bf16 else torch.float32
-        )
         send_to_next_pipeline_rank(output.to(dtype=pp_dtype))
Also applies to: 146-151
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/plugins/megatron_generate.py` around lines 111 - 122,
pp_dtype is being recomputed multiple times; compute it once and reuse it
instead of repeating the same expression. Move the pp_dtype determination
(model.config.pipeline_dtype or torch.bfloat16 if model.config.bf16 else
torch.float32) to a single place earlier in the function (so it is available to
the PP receive branch that calls recv_from_prev_pipeline_rank_ and
set_input_tensor via get_attr_wrapped_model(model, "set_input_tensor")), then
remove the duplicate pp_dtype assignments at the duplicated locations so the
recv_buffer creation and any other PP-related code reuse the single pp_dtype
variable.
tests/gpu_megatron/torch/utils/plugins/test_utils_megatron.py (1)
53-53: Consider documenting why the tight bounds are safe, or widen slightly for hardware variation tolerance.

This is a regression test (marked "TODO: move to regression test folder") with intentionally tight bounds (0.36, 0.39) to catch model accuracy changes. However, even with SEED=1234 and eval mode, GPU/CUDA numerical precision varies across hardware architectures. Either:

Add an inline comment confirming the bounds are empirically validated across your CI GPU hardware, or

Widen to (0.34, 0.41) to tolerate expected hardware-induced variance while still catching real regressions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu_megatron/torch/utils/plugins/test_utils_megatron.py` at line 53,
The assertion using megatron_mmlu(model, tokenizer, fraction=0.1, batch_size=16)
has overly tight bounds (0.36, 0.39) that can fail across different GPU/CUDA
hardware; either expand the range to assert 0.34 < megatron_mmlu(...) < 0.41 to
tolerate expected numerical variance, or add an inline comment next to the
assertion referencing SEED=1234 and confirming these tight bounds were
empirically validated across your CI GPU hardware (and keep the original bounds)
so future readers know the rationale.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@modelopt/torch/utils/plugins/megatron_generate.py`:
- Around line 111-122: pp_dtype is being recomputed multiple times; compute it
once and reuse it instead of repeating the same expression. Move the pp_dtype
determination (model.config.pipeline_dtype or torch.bfloat16 if
model.config.bf16 else torch.float32) to a single place earlier in the function
(so it is available to the PP receive branch that calls
recv_from_prev_pipeline_rank_ and set_input_tensor via
get_attr_wrapped_model(model, "set_input_tensor")), then remove the duplicate
pp_dtype assignments at the duplicated locations so the recv_buffer creation and
any other PP-related code reuse the single pp_dtype variable.

In `@tests/gpu_megatron/torch/utils/plugins/test_utils_megatron.py`:
- Line 53: The assertion using megatron_mmlu(model, tokenizer, fraction=0.1,
batch_size=16) has overly tight bounds (0.36, 0.39) that can fail across
different GPU/CUDA hardware; either expand the range to assert 0.34 <
megatron_mmlu(...) < 0.41 to tolerate expected numerical variance, or add an
inline comment next to the assertion referencing SEED=1234 and confirming these
tight bounds were empirically validated across your CI GPU hardware (and keep
the original bounds) so future readers know the rationale.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 5ff5b50c-1f64-492a-a5e6-7b7aa7dad7eb

📥 Commits

Reviewing files that changed from the base of the PR and between 303e429 and f6a92b3.

📒 Files selected for processing (4)

CHANGELOG.rst
modelopt/torch/export/plugins/megatron_importer.py
modelopt/torch/utils/plugins/megatron_generate.py
tests/gpu_megatron/torch/utils/plugins/test_utils_megatron.py

✅ Files skipped from review due to trivial changes (1)

CHANGELOG.rst

coderabbitai

♻️ Duplicate comments (1)

modelopt/torch/utils/plugins/megatron_mmlu.py (1)

178-180: ⚠️ Potential issue | 🟡 Minor

Potential division by zero if no examples are evaluated.

If n_total is 0 (e.g., fraction=0 or all subject samples fall below the cutoff threshold), the division will raise ZeroDivisionError.

🛡️ Proposed fix

     all_correct = [pred == label for pred, label in zip(predictions, all_labels)]
     n_total = len(all_correct)
+    if n_total == 0:
+        print_rank_0("No examples to evaluate. Returning 0.0 accuracy.")
+        return 0.0
     avg = sum(all_correct) / n_total

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/plugins/megatron_mmlu.py` around lines 178 - 180, Guard
against division by zero when computing avg from all_correct: after computing
n_total = len(all_correct) check if n_total == 0 and handle it (e.g., set avg =
0.0 or float('nan') or return/skip as appropriate) instead of performing
sum(all_correct) / n_total; update the code around the variables all_correct,
n_total, and avg in megatron_mmlu.py to perform this conditional check and
compute avg only when n_total > 0.

🧹 Nitpick comments (1)

modelopt/torch/utils/plugins/megatron_mmlu.py (1)

164-171: Consider explicitly freeing large tensors to avoid OOM.

The PR objectives mention deleting logits and padded tensors to avoid OOM, but these are not explicitly freed in the loop. While Python's reference counting may handle this, explicit cleanup can help with GPU memory pressure during long evaluations.

♻️ Proposed fix to explicitly free tensors

         for i, seq_len in enumerate(batch_len):
             answer_logits = logits[i, seq_len - 1, choice_ids]
             predictions[order[batch_start + i]] = _CHOICES[answer_logits.argmax().item()]
 
         examples_done = min(batch_start + batch_size, len(sorted_encoded))
         pbar.set_postfix(examples=f"{examples_done}/{len(sorted_encoded)}")
+
+        del padded, logits

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/plugins/megatron_mmlu.py` around lines 164 - 171, The
logits and padded tensors returned/used by megatron_prefill and padded.cuda()
should be explicitly freed after each batch to reduce GPU memory pressure: after
you finish extracting answer_logits and updating predictions (i.e., after the
for i, seq_len loop and before advancing batch_start), call del logits and del
padded and then torch.cuda.empty_cache() (optionally torch.cuda.synchronize())
to release GPU memory; ensure you reference the existing symbols logits, padded,
megatron_prefill, and predictions when applying this change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@modelopt/torch/utils/plugins/megatron_mmlu.py`:
- Around line 178-180: Guard against division by zero when computing avg from
all_correct: after computing n_total = len(all_correct) check if n_total == 0
and handle it (e.g., set avg = 0.0 or float('nan') or return/skip as
appropriate) instead of performing sum(all_correct) / n_total; update the code
around the variables all_correct, n_total, and avg in megatron_mmlu.py to
perform this conditional check and compute avg only when n_total > 0.

---

Nitpick comments:
In `@modelopt/torch/utils/plugins/megatron_mmlu.py`:
- Around line 164-171: The logits and padded tensors returned/used by
megatron_prefill and padded.cuda() should be explicitly freed after each batch
to reduce GPU memory pressure: after you finish extracting answer_logits and
updating predictions (i.e., after the for i, seq_len loop and before advancing
batch_start), call del logits and del padded and then torch.cuda.empty_cache()
(optionally torch.cuda.synchronize()) to release GPU memory; ensure you
reference the existing symbols logits, padded, megatron_prefill, and predictions
when applying this change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 37dc9398-a492-4acc-b2ba-b5455a41bff8

📥 Commits

Reviewing files that changed from the base of the PR and between f6a92b3 and c11b8bf.

📒 Files selected for processing (5)

CHANGELOG.rst
modelopt/torch/export/plugins/megatron_importer.py
modelopt/torch/utils/plugins/megatron_generate.py
modelopt/torch/utils/plugins/megatron_mmlu.py
tests/gpu_megatron/torch/utils/plugins/test_utils_megatron.py

✅ Files skipped from review due to trivial changes (2)

CHANGELOG.rst
modelopt/torch/utils/plugins/megatron_generate.py

🚧 Files skipped from review as they are similar to previous changes (1)

modelopt/torch/export/plugins/megatron_importer.py

Replace autoregressive generation with a single prefill forward pass per batch and argmax over the four answer-choice token logits. This matches the log-likelihood approach used by lm-evaluation-harness and avoids the autoregressive decode loop entirely. Additional improvements: - Load dataset once with the "all" config (2 calls) instead of once per subject (114 calls), eliminating the main CPU overhead bottleneck - Batch globally across all subjects sorted by descending sequence length to minimise padding waste and fail-fast on OOM - Skip dev dataset load when few_shots=0 - Rename percentage -> fraction for clearer semantics - Fix few-shot answer formatting (was emitting integer index, now letter) - Fix off-by-one: idx > threshold -> idx >= threshold - Fix avg_correct reset bug inside subject loop - Add tqdm progress bar (rank-0 only) - Explicitly del logits/padded after each batch to avoid tensor lifetime overlap that caused OOM on long-sequence runs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

modelopt/torch/utils/plugins/megatron_generate.py (1)
348-349: ⚠️ Potential issue | 🟡 Minor

eager_ids.item() will fail for batch_size > 1.

eager_ids has shape [max_batch_size, 1] (from line 337, where max_batch_size = input_ids.shape[0]). Calling .item() on a multi-element tensor raises RuntimeError. The function does not enforce or check for batch_size == 1, so generation will fail if called with batch_size > 1.

Consider either:

Adding a batch_size check at the start of the function to enforce batch_size == 1, or

Handling batch EOS detection properly (e.g., per-sample tracking of which sequences have ended)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/plugins/megatron_generate.py` around lines 348 - 349,
The current code calls eager_ids.item() which fails when batch_size > 1; update
the function containing eager_ids, max_batch_size and input_ids (the megatron
generation routine) to handle multi-example batches: either add an explicit
check at the start (verify input_ids.shape[0] == 1 and raise/return if not) or
implement per-sample EOS detection by comparing eager_ids.squeeze(-1) (or
eager_ids[:,0]) against eos_token_id(s), maintain a boolean finished mask for
each batch element, and break only when all entries in the finished mask are
true; ensure references to eager_ids, eos_token_id, max_batch_size and input_ids
are used to locate and change the logic accordingly.

🧹 Nitpick comments (2)

modelopt/torch/utils/plugins/megatron_generate.py (2)

146-151: Same redundant pp_dtype recomputation.

This is the third occurrence of the same pp_dtype assignment. Remove it to use the value computed at line 73-75.

♻️ Remove duplicate pp_dtype computation

     # For PP non-last stages, forward activations to the next stage and return early.
     if is_pp and not pp_last:
-        pp_dtype = model.config.pipeline_dtype or (
-            torch.bfloml16 if model.config.bf16 else torch.float32
-        )
         send_to_next_pipeline_rank(output.to(dtype=pp_dtype))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/plugins/megatron_generate.py` around lines 146 - 151,
Remove the redundant recomputation of pp_dtype in the PP non-last stage block;
instead reuse the pp_dtype previously computed earlier in the same scope and
keep the call to send_to_next_pipeline_rank(output.to(dtype=pp_dtype)).
Specifically, delete the duplicate assignment to pp_dtype in the if is_pp and
not pp_last branch and ensure the branch uses the existing pp_dtype variable
when calling send_to_next_pipeline_rank so functions like
send_to_next_pipeline_rank and the output.to(dtype=pp_dtype) call remain
unchanged.

111-122: Redundant pp_dtype computation.

pp_dtype is already computed at lines 73-75 and is used unchanged here and again at lines 148-150. Consider removing the duplicate assignments.

♻️ Remove duplicate pp_dtype computation

     # For PP, receive activations from the previous stage before calling forward.
     if is_pp and not pp_first:
-        pp_dtype = model.config.pipeline_dtype or (
-            torch.bfloat16 if model.config.bf16 else torch.float32
-        )
         recv_buffer = torch.empty(
             (padded_seq_len, batch_size, model.config.hidden_size),
             dtype=pp_dtype,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/plugins/megatron_generate.py` around lines 111 - 122,
The pp_dtype computation inside the PP receive block is redundant because
pp_dtype is already computed earlier; remove the duplicate assignment in the if
is_pp and not pp_first block and reuse the previously computed pp_dtype (derived
from model.config.pipeline_dtype / model.config.bf16) when creating recv_buffer
and calling recv_from_prev_pipeline_rank_ / get_attr_wrapped_model(model,
"set_input_tensor")(recv_buffer); keep the rest of the block (recv_buffer
creation, recv_from_prev_pipeline_rank_, set_input_tensor call) unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@modelopt/torch/utils/plugins/megatron_generate.py`:
- Around line 348-349: The current code calls eager_ids.item() which fails when
batch_size > 1; update the function containing eager_ids, max_batch_size and
input_ids (the megatron generation routine) to handle multi-example batches:
either add an explicit check at the start (verify input_ids.shape[0] == 1 and
raise/return if not) or implement per-sample EOS detection by comparing
eager_ids.squeeze(-1) (or eager_ids[:,0]) against eos_token_id(s), maintain a
boolean finished mask for each batch element, and break only when all entries in
the finished mask are true; ensure references to eager_ids, eos_token_id,
max_batch_size and input_ids are used to locate and change the logic
accordingly.

---

Nitpick comments:
In `@modelopt/torch/utils/plugins/megatron_generate.py`:
- Around line 146-151: Remove the redundant recomputation of pp_dtype in the PP
non-last stage block; instead reuse the pp_dtype previously computed earlier in
the same scope and keep the call to
send_to_next_pipeline_rank(output.to(dtype=pp_dtype)). Specifically, delete the
duplicate assignment to pp_dtype in the if is_pp and not pp_last branch and
ensure the branch uses the existing pp_dtype variable when calling
send_to_next_pipeline_rank so functions like send_to_next_pipeline_rank and the
output.to(dtype=pp_dtype) call remain unchanged.
- Around line 111-122: The pp_dtype computation inside the PP receive block is
redundant because pp_dtype is already computed earlier; remove the duplicate
assignment in the if is_pp and not pp_first block and reuse the previously
computed pp_dtype (derived from model.config.pipeline_dtype / model.config.bf16)
when creating recv_buffer and calling recv_from_prev_pipeline_rank_ /
get_attr_wrapped_model(model, "set_input_tensor")(recv_buffer); keep the rest of
the block (recv_buffer creation, recv_from_prev_pipeline_rank_, set_input_tensor
call) unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: fb6a7b1e-a076-4b58-a41d-19842543c8e9

📥 Commits

Reviewing files that changed from the base of the PR and between c11b8bf and 9dfb516.

📒 Files selected for processing (7)

CHANGELOG.rst
examples/megatron_bridge/prune_minitron.py
examples/pruning/README.md
modelopt/torch/export/plugins/megatron_importer.py
modelopt/torch/utils/plugins/megatron_generate.py
modelopt/torch/utils/plugins/megatron_mmlu.py
tests/gpu_megatron/torch/utils/plugins/test_utils_megatron.py

✅ Files skipped from review due to trivial changes (4)

CHANGELOG.rst
modelopt/torch/export/plugins/megatron_importer.py
examples/pruning/README.md
modelopt/torch/utils/plugins/megatron_mmlu.py

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tests/gpu_megatron/torch/utils/plugins/test_utils_megatron.py (1)

51-51: Limit debug printing to rank 0 to reduce distributed log noise.

Line 51 currently prints on every rank; gating to rank 0 keeps CI logs readable.

Proposed tweak

-    print(rank, output_text)
+    if rank == 0:
+        print(rank, output_text)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/gpu_megatron/torch/utils/plugins/test_utils_megatron.py` at line 51,
The test currently prints on every distributed rank using print(rank,
output_text); change this to only emit output on rank 0 by guarding the print
with a rank check (e.g., if rank == 0) so that only the primary process logs the
message (keep the same printed values rank and output_text). Ensure the
conditional wraps the existing print call (reference: print(rank, output_text)).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/gpu_megatron/torch/utils/plugins/test_utils_megatron.py`:
- Line 25: The MMLU accuracy check in the test calls megatron_mmlu(...,
fraction=0.1, batch_size=16) which is regression-scale and too slow for the fast
GPU test path; either move the heavy assertion into a dedicated regression/slow
test suite or keep a tiny smoke check here (e.g., fraction=0.001 or a
single-batch run) and add the full-fraction assertion under a new
slow/regression test. Locate the call to megatron_mmlu in test_utils_megatron.py
and replace the large-fraction assertion with a lightweight smoke-run, or
extract the current assertion into a new test (named to indicate
slow/regression) that will not run in the fast GPU test set.

---

Nitpick comments:
In `@tests/gpu_megatron/torch/utils/plugins/test_utils_megatron.py`:
- Line 51: The test currently prints on every distributed rank using print(rank,
output_text); change this to only emit output on rank 0 by guarding the print
with a rank check (e.g., if rank == 0) so that only the primary process logs the
message (keep the same printed values rank and output_text). Ensure the
conditional wraps the existing print call (reference: print(rank, output_text)).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 8166e3be-85cd-4679-8c90-c18b5de0d7bf

📥 Commits

Reviewing files that changed from the base of the PR and between 9dfb516 and 0b2aabc.

📒 Files selected for processing (6)

CHANGELOG.rst
modelopt/torch/export/plugins/megatron_importer.py
modelopt/torch/utils/logging.py
modelopt/torch/utils/plugins/megatron_generate.py
modelopt/torch/utils/plugins/megatron_mmlu.py
tests/gpu_megatron/torch/utils/plugins/test_utils_megatron.py

✅ Files skipped from review due to trivial changes (2)

CHANGELOG.rst
modelopt/torch/utils/plugins/megatron_mmlu.py

🚧 Files skipped from review as they are similar to previous changes (2)

modelopt/torch/export/plugins/megatron_importer.py
modelopt/torch/utils/plugins/megatron_generate.py

Two bugs caused pipeline-parallel inference to produce garbage output: 1. megatron_generate/megatron_prefill used get_forward_backward_func() (the training pipeline scheduler), which is not designed for inference. Rewrote both functions to use explicit P2P communication via recv_from_prev_pipeline_rank_ / send_to_next_pipeline_rank, matching the pattern from run_mcore_inference. 2. import_mcore_gpt_from_hf loads HF weights into stage 0's embedding but never updates the output_layer on the last PP stage when share_embeddings_and_output_weights=True. After import, call model.setup_embeddings_and_output_layer() to re-run the all-reduce that syncs the output layer from stage 0 to the last stage. Also parametrize the megatron_generate test to cover both TP and PP. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

ChenhanYu

Strong PR — fixes PP inference correctness (training scheduler → explicit P2P) and 6-10x MMLU speedup via batched prefill log-likelihood scoring (matching lm-evaluation-harness approach). Embedding re-sync after HF import with PP>1 is a real correctness bug fixed cleanly.

Minor (non-blocking):

pp_dtype computed redundantly 3x in megatron_prefill and 2x in megatron_generate — first computation suffices
skip_return_logits path still runs the broadcast (correct for PP sync, but worth a comment explaining why)
MMLU test bounds (0.36-0.39) are tight for a 0.6B model — may be flaky across hardware. Consider widening or using lower bound only
CandidateSubnet safe globals and percentage→fraction rename are unrelated — note in PR description

LGTM.

kevalmorabia97 requested review from a team as code owners April 16, 2026 20:23

kevalmorabia97 requested review from realAsma and yueshen2016 April 16, 2026 20:23

kevalmorabia97 requested a review from ChenhanYu April 16, 2026 20:24

kevalmorabia97 changed the title ~~Speedup megatron_mmlu by ~6x via prefill scoring and global batching~~ Fix and Speedup megatron_mmlu by >10x via prefill scoring and global batching Apr 16, 2026

kevalmorabia97 removed request for realAsma and yueshen2016 April 16, 2026 20:25

coderabbitai bot reviewed Apr 16, 2026

View reviewed changes

Comment thread modelopt/torch/utils/plugins/megatron_mmlu.py

Comment thread modelopt/torch/utils/plugins/megatron_mmlu.py Outdated

kevalmorabia97 requested a review from a team as a code owner April 16, 2026 21:33

kevalmorabia97 force-pushed the kmorabia/speedup-megatron-mmlu branch from f6a92b3 to c11b8bf Compare April 16, 2026 21:36

coderabbitai bot reviewed Apr 16, 2026

View reviewed changes

kevalmorabia97 force-pushed the kmorabia/speedup-megatron-mmlu branch from c11b8bf to 9dfb516 Compare April 17, 2026 12:31

coderabbitai bot reviewed Apr 17, 2026

View reviewed changes

kevalmorabia97 force-pushed the kmorabia/speedup-megatron-mmlu branch from 9dfb516 to 0b2aabc Compare April 17, 2026 16:19

coderabbitai bot reviewed Apr 17, 2026

View reviewed changes

Comment thread tests/gpu_megatron/torch/utils/plugins/test_utils_megatron.py

kevalmorabia97 force-pushed the kmorabia/speedup-megatron-mmlu branch 4 times, most recently from 1cec8ec to 802ce85 Compare April 17, 2026 17:15

minor

fdfa85f

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

kevalmorabia97 force-pushed the kmorabia/speedup-megatron-mmlu branch from 802ce85 to fdfa85f Compare April 17, 2026 17:32

Merge branch 'main' into kmorabia/speedup-megatron-mmlu

d9b3f8a

ChenhanYu approved these changes Apr 17, 2026

View reviewed changes

kevalmorabia97 enabled auto-merge (squash) April 17, 2026 18:30

kevalmorabia97 merged commit e4b054b into main Apr 17, 2026
45 checks passed

kevalmorabia97 deleted the kmorabia/speedup-megatron-mmlu branch April 17, 2026 19:20

Conversation

kevalmorabia97 commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Changes

Testing

Before your PR is "Ready for review"

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks failed

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ChenhanYu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kevalmorabia97 commented Apr 16, 2026 •

edited

Loading

coderabbitai bot commented Apr 16, 2026 •

edited

Loading

github-actions bot commented Apr 16, 2026 •

edited

Loading

codecov bot commented Apr 16, 2026 •

edited

Loading