Fix and Speedup megatron_mmlu by >10x via prefill scoring and global batching#1280
Fix and Speedup megatron_mmlu by >10x via prefill scoring and global batching#1280kevalmorabia97 merged 4 commits intomainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReplaces per-example autoregressive MMLU scoring with batched prefill-logit scoring; adds explicit pipeline-parallel activation send/recv in generation paths; renames MMLU API parameter Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant Dataset
participant Tokenizer
participant Model
participant Scorer
Caller->>Dataset: megatron_mmlu(model, tokenizer, few_shots=0, fraction=0.1, batch_size=16)
Dataset-->>Caller: load "cais/mmlu" (all), samples with subject labels
Caller->>Caller: group by subject, sample fraction, build prompts
Caller->>Tokenizer: tokenize prompts (batched, sorted by length)
Tokenizer-->>Caller: token sequences
Caller->>Model: megatron_prefill(batch_tokens, attention_mask, position_ids, ...)
Model-->>Caller: logits (broadcasted across PP ranks)
Caller->>Scorer: select choice-token via argmax per example
Scorer-->>Caller: per-example predictions → compute accuracies
Caller-->>Caller: return overall accuracy
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (2 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
modelopt/torch/utils/plugins/megatron_mmlu.py (2)
159-166: Consider explicitly deleting large tensors to reduce memory pressure.The PR description mentions "explicitly delete
logits/paddedafter each batch to avoid OOM from tensor lifetime overlap," but I don't see explicitdelstatements. While Python's garbage collector will eventually reclaim these, explicit deletion can help when processing many batches with large tensors.♻️ Proposed fix
for i, seq_len in enumerate(batch_len): answer_logits = logits[i, seq_len - 1, choice_ids] predictions[order[batch_start + i]] = _CHOICES[answer_logits.argmax().item()] examples_done = min(batch_start + batch_size, len(sorted_encoded)) pbar.set_postfix(examples=f"{examples_done}/{len(sorted_encoded)}") + + del logits, padded # Free GPU memory before next batch🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/utils/plugins/megatron_mmlu.py` around lines 159 - 166, The batch loop keeps large tensors (logits and padded) alive causing OOM; after using logits to compute answer_logits and updating predictions in the megatron_prefill flow, explicitly delete the large tensors (del logits, del padded) and, if needed, call torch.cuda.empty_cache() to free GPU memory before the next batch; ensure you do this right after the loop that consumes logits (the section using logits[i, seq_len - 1, choice_ids] and updating predictions) so references are dropped and memory is released.
154-158: Padding with zeros may not match the tokenizer's pad token.Right-padding with
torch.zerosassumes pad_token_id is 0, but many tokenizers use different pad token IDs (e.g., Llama uses a special token). While the causal mask typically prevents padding from affecting the output at the last real token position, using the actual pad token is more robust.♻️ Proposed fix
- padded = torch.zeros(len(batch_enc), max_len, dtype=torch.long) + pad_id = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else 0 + padded = torch.full((len(batch_enc), max_len), pad_id, dtype=torch.long)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/utils/plugins/megatron_mmlu.py` around lines 154 - 158, The padding uses hard-coded zeros which assumes pad_token_id==0; change the creation of `padded` to fill with the tokenizer's pad token id (e.g., obtain `pad_id = tokenizer.pad_token_id if hasattr(tokenizer, "pad_token_id") else 0`) and initialize `padded = torch.full((len(batch_enc), max_len), fill_value=pad_id, dtype=torch.long)` before copying sequences (symbols to locate: `padded`, `batch_enc`, `batch_len`, `max_len` in `megatron_mmlu.py`); ensure the dtype and fill value are compatible (cast pad_id to int) so padding uses the correct pad token for the model.tests/gpu_megatron/torch/utils/plugins/test_utils_megatron.py (1)
45-45: Widen assertion bounds to account for GPU floating-point precision variance.The 0.01 range (0.37 < x < 0.38) is vulnerable to differences in floating-point arithmetic across GPU architectures and CUDA versions. Even with deterministic prefill scoring and explicit seeding, accumulated FP32 precision differences over ~500 examples can shift the accuracy metric. Widening to 0.35–0.40 provides reasonable tolerance while maintaining test integrity.
Suggested fix
- assert 0.37 < megatron_mmlu(model, tokenizer, fraction=0.1, batch_size=16) < 0.38 + assert 0.35 < megatron_mmlu(model, tokenizer, fraction=0.1, batch_size=16) < 0.40🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gpu_megatron/torch/utils/plugins/test_utils_megatron.py` at line 45, The assertion on megatron_mmlu is too tight for GPU FP variability; update the test assertion that calls megatron_mmlu(model, tokenizer, fraction=0.1, batch_size=16) to use wider bounds (e.g., assert 0.35 < megatron_mmlu(...) < 0.40) to tolerate FP32 differences across GPUs/CUDA while keeping fraction and batch_size the same; locate the call to megatron_mmlu in the test file and adjust the numeric bounds accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/utils/plugins/megatron_mmlu.py`:
- Around line 177-178: The "evaluation started..." message is printed after the
evaluation loop, causing confusion; move the dist.is_master() print statement to
just before the batch loop (before pbar = tqdm(...) / before iterating over
dataloader) so the message is shown when evaluation actually begins, or
alternatively change the message text at the current location to "evaluation
completed" to reflect its current timing; update references in this file to
dist.is_master(), the print call, and the pbar / batch loop in the function
around the MMLU evaluation block.
- Around line 173-175: The computation of avg using avg = sum(all_correct) /
n_total can raise ZeroDivisionError when n_total == 0 (e.g., empty dataset or
fraction==0); update the logic around variables all_correct, n_total, and avg to
guard against zero by checking if n_total == 0 and handling it explicitly (for
example set avg = 0.0 or raise a clear ValueError) before performing the
division, and optionally log or surface a warning so callers know no samples
were evaluated.
---
Nitpick comments:
In `@modelopt/torch/utils/plugins/megatron_mmlu.py`:
- Around line 159-166: The batch loop keeps large tensors (logits and padded)
alive causing OOM; after using logits to compute answer_logits and updating
predictions in the megatron_prefill flow, explicitly delete the large tensors
(del logits, del padded) and, if needed, call torch.cuda.empty_cache() to free
GPU memory before the next batch; ensure you do this right after the loop that
consumes logits (the section using logits[i, seq_len - 1, choice_ids] and
updating predictions) so references are dropped and memory is released.
- Around line 154-158: The padding uses hard-coded zeros which assumes
pad_token_id==0; change the creation of `padded` to fill with the tokenizer's
pad token id (e.g., obtain `pad_id = tokenizer.pad_token_id if
hasattr(tokenizer, "pad_token_id") else 0`) and initialize `padded =
torch.full((len(batch_enc), max_len), fill_value=pad_id, dtype=torch.long)`
before copying sequences (symbols to locate: `padded`, `batch_enc`, `batch_len`,
`max_len` in `megatron_mmlu.py`); ensure the dtype and fill value are compatible
(cast pad_id to int) so padding uses the correct pad token for the model.
In `@tests/gpu_megatron/torch/utils/plugins/test_utils_megatron.py`:
- Line 45: The assertion on megatron_mmlu is too tight for GPU FP variability;
update the test assertion that calls megatron_mmlu(model, tokenizer,
fraction=0.1, batch_size=16) to use wider bounds (e.g., assert 0.35 <
megatron_mmlu(...) < 0.40) to tolerate FP32 differences across GPUs/CUDA while
keeping fraction and batch_size the same; locate the call to megatron_mmlu in
the test file and adjust the numeric bounds accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 91e73dfb-34ef-49fd-bb35-47fc41d3a912
📒 Files selected for processing (4)
examples/megatron_bridge/prune_minitron.pyexamples/pruning/README.mdmodelopt/torch/utils/plugins/megatron_mmlu.pytests/gpu_megatron/torch/utils/plugins/test_utils_megatron.py
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1280 +/- ##
==========================================
+ Coverage 72.74% 76.54% +3.80%
==========================================
Files 459 459
Lines 48611 48649 +38
==========================================
+ Hits 35361 37238 +1877
+ Misses 13250 11411 -1839
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
f6a92b3 to
c11b8bf
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (2)
modelopt/torch/utils/plugins/megatron_generate.py (1)
111-122: Redundantpp_dtypecomputation.
pp_dtypeis computed identically at lines 73-75, 113-115, and 148-150. Consider removing the redundant assignments at lines 113-115 and 148-150 sincepp_dtypeis already computed earlier in the function.♻️ Suggested cleanup
# For PP, receive activations from the previous stage before calling forward. if is_pp and not pp_first: - pp_dtype = model.config.pipeline_dtype or ( - torch.bfloat16 if model.config.bf16 else torch.float32 - ) recv_buffer = torch.empty( (padded_seq_len, batch_size, model.config.hidden_size), dtype=pp_dtype,# For PP non-last stages, forward activations to the next stage and return early. if is_pp and not pp_last: - pp_dtype = model.config.pipeline_dtype or ( - torch.bfloat16 if model.config.bf16 else torch.float32 - ) send_to_next_pipeline_rank(output.to(dtype=pp_dtype))Also applies to: 146-151
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/utils/plugins/megatron_generate.py` around lines 111 - 122, pp_dtype is being recomputed multiple times; compute it once and reuse it instead of repeating the same expression. Move the pp_dtype determination (model.config.pipeline_dtype or torch.bfloat16 if model.config.bf16 else torch.float32) to a single place earlier in the function (so it is available to the PP receive branch that calls recv_from_prev_pipeline_rank_ and set_input_tensor via get_attr_wrapped_model(model, "set_input_tensor")), then remove the duplicate pp_dtype assignments at the duplicated locations so the recv_buffer creation and any other PP-related code reuse the single pp_dtype variable.tests/gpu_megatron/torch/utils/plugins/test_utils_megatron.py (1)
53-53: Consider documenting why the tight bounds are safe, or widen slightly for hardware variation tolerance.This is a regression test (marked "TODO: move to regression test folder") with intentionally tight bounds
(0.36, 0.39)to catch model accuracy changes. However, even withSEED=1234and eval mode, GPU/CUDA numerical precision varies across hardware architectures. Either:
- Add an inline comment confirming the bounds are empirically validated across your CI GPU hardware, or
- Widen to
(0.34, 0.41)to tolerate expected hardware-induced variance while still catching real regressions.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gpu_megatron/torch/utils/plugins/test_utils_megatron.py` at line 53, The assertion using megatron_mmlu(model, tokenizer, fraction=0.1, batch_size=16) has overly tight bounds (0.36, 0.39) that can fail across different GPU/CUDA hardware; either expand the range to assert 0.34 < megatron_mmlu(...) < 0.41 to tolerate expected numerical variance, or add an inline comment next to the assertion referencing SEED=1234 and confirming these tight bounds were empirically validated across your CI GPU hardware (and keep the original bounds) so future readers know the rationale.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@modelopt/torch/utils/plugins/megatron_generate.py`:
- Around line 111-122: pp_dtype is being recomputed multiple times; compute it
once and reuse it instead of repeating the same expression. Move the pp_dtype
determination (model.config.pipeline_dtype or torch.bfloat16 if
model.config.bf16 else torch.float32) to a single place earlier in the function
(so it is available to the PP receive branch that calls
recv_from_prev_pipeline_rank_ and set_input_tensor via
get_attr_wrapped_model(model, "set_input_tensor")), then remove the duplicate
pp_dtype assignments at the duplicated locations so the recv_buffer creation and
any other PP-related code reuse the single pp_dtype variable.
In `@tests/gpu_megatron/torch/utils/plugins/test_utils_megatron.py`:
- Line 53: The assertion using megatron_mmlu(model, tokenizer, fraction=0.1,
batch_size=16) has overly tight bounds (0.36, 0.39) that can fail across
different GPU/CUDA hardware; either expand the range to assert 0.34 <
megatron_mmlu(...) < 0.41 to tolerate expected numerical variance, or add an
inline comment next to the assertion referencing SEED=1234 and confirming these
tight bounds were empirically validated across your CI GPU hardware (and keep
the original bounds) so future readers know the rationale.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 5ff5b50c-1f64-492a-a5e6-7b7aa7dad7eb
📒 Files selected for processing (4)
CHANGELOG.rstmodelopt/torch/export/plugins/megatron_importer.pymodelopt/torch/utils/plugins/megatron_generate.pytests/gpu_megatron/torch/utils/plugins/test_utils_megatron.py
✅ Files skipped from review due to trivial changes (1)
- CHANGELOG.rst
There was a problem hiding this comment.
♻️ Duplicate comments (1)
modelopt/torch/utils/plugins/megatron_mmlu.py (1)
178-180:⚠️ Potential issue | 🟡 MinorPotential division by zero if no examples are evaluated.
If
n_totalis 0 (e.g.,fraction=0or all subject samples fall below the cutoff threshold), the division will raiseZeroDivisionError.🛡️ Proposed fix
all_correct = [pred == label for pred, label in zip(predictions, all_labels)] n_total = len(all_correct) + if n_total == 0: + print_rank_0("No examples to evaluate. Returning 0.0 accuracy.") + return 0.0 avg = sum(all_correct) / n_total🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/utils/plugins/megatron_mmlu.py` around lines 178 - 180, Guard against division by zero when computing avg from all_correct: after computing n_total = len(all_correct) check if n_total == 0 and handle it (e.g., set avg = 0.0 or float('nan') or return/skip as appropriate) instead of performing sum(all_correct) / n_total; update the code around the variables all_correct, n_total, and avg in megatron_mmlu.py to perform this conditional check and compute avg only when n_total > 0.
🧹 Nitpick comments (1)
modelopt/torch/utils/plugins/megatron_mmlu.py (1)
164-171: Consider explicitly freeing large tensors to avoid OOM.The PR objectives mention deleting
logitsandpaddedtensors to avoid OOM, but these are not explicitly freed in the loop. While Python's reference counting may handle this, explicit cleanup can help with GPU memory pressure during long evaluations.♻️ Proposed fix to explicitly free tensors
for i, seq_len in enumerate(batch_len): answer_logits = logits[i, seq_len - 1, choice_ids] predictions[order[batch_start + i]] = _CHOICES[answer_logits.argmax().item()] examples_done = min(batch_start + batch_size, len(sorted_encoded)) pbar.set_postfix(examples=f"{examples_done}/{len(sorted_encoded)}") + + del padded, logits🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/utils/plugins/megatron_mmlu.py` around lines 164 - 171, The logits and padded tensors returned/used by megatron_prefill and padded.cuda() should be explicitly freed after each batch to reduce GPU memory pressure: after you finish extracting answer_logits and updating predictions (i.e., after the for i, seq_len loop and before advancing batch_start), call del logits and del padded and then torch.cuda.empty_cache() (optionally torch.cuda.synchronize()) to release GPU memory; ensure you reference the existing symbols logits, padded, megatron_prefill, and predictions when applying this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@modelopt/torch/utils/plugins/megatron_mmlu.py`:
- Around line 178-180: Guard against division by zero when computing avg from
all_correct: after computing n_total = len(all_correct) check if n_total == 0
and handle it (e.g., set avg = 0.0 or float('nan') or return/skip as
appropriate) instead of performing sum(all_correct) / n_total; update the code
around the variables all_correct, n_total, and avg in megatron_mmlu.py to
perform this conditional check and compute avg only when n_total > 0.
---
Nitpick comments:
In `@modelopt/torch/utils/plugins/megatron_mmlu.py`:
- Around line 164-171: The logits and padded tensors returned/used by
megatron_prefill and padded.cuda() should be explicitly freed after each batch
to reduce GPU memory pressure: after you finish extracting answer_logits and
updating predictions (i.e., after the for i, seq_len loop and before advancing
batch_start), call del logits and del padded and then torch.cuda.empty_cache()
(optionally torch.cuda.synchronize()) to release GPU memory; ensure you
reference the existing symbols logits, padded, megatron_prefill, and predictions
when applying this change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 37dc9398-a492-4acc-b2ba-b5455a41bff8
📒 Files selected for processing (5)
CHANGELOG.rstmodelopt/torch/export/plugins/megatron_importer.pymodelopt/torch/utils/plugins/megatron_generate.pymodelopt/torch/utils/plugins/megatron_mmlu.pytests/gpu_megatron/torch/utils/plugins/test_utils_megatron.py
✅ Files skipped from review due to trivial changes (2)
- CHANGELOG.rst
- modelopt/torch/utils/plugins/megatron_generate.py
🚧 Files skipped from review as they are similar to previous changes (1)
- modelopt/torch/export/plugins/megatron_importer.py
Replace autoregressive generation with a single prefill forward pass per batch and argmax over the four answer-choice token logits. This matches the log-likelihood approach used by lm-evaluation-harness and avoids the autoregressive decode loop entirely. Additional improvements: - Load dataset once with the "all" config (2 calls) instead of once per subject (114 calls), eliminating the main CPU overhead bottleneck - Batch globally across all subjects sorted by descending sequence length to minimise padding waste and fail-fast on OOM - Skip dev dataset load when few_shots=0 - Rename percentage -> fraction for clearer semantics - Fix few-shot answer formatting (was emitting integer index, now letter) - Fix off-by-one: idx > threshold -> idx >= threshold - Fix avg_correct reset bug inside subject loop - Add tqdm progress bar (rank-0 only) - Explicitly del logits/padded after each batch to avoid tensor lifetime overlap that caused OOM on long-sequence runs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
c11b8bf to
9dfb516
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/torch/utils/plugins/megatron_generate.py (1)
348-349:⚠️ Potential issue | 🟡 Minor
eager_ids.item()will fail for batch_size > 1.
eager_idshas shape[max_batch_size, 1](from line 337, wheremax_batch_size = input_ids.shape[0]). Calling.item()on a multi-element tensor raisesRuntimeError. The function does not enforce or check forbatch_size == 1, so generation will fail if called withbatch_size > 1.Consider either:
- Adding a batch_size check at the start of the function to enforce
batch_size == 1, or- Handling batch EOS detection properly (e.g., per-sample tracking of which sequences have ended)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/utils/plugins/megatron_generate.py` around lines 348 - 349, The current code calls eager_ids.item() which fails when batch_size > 1; update the function containing eager_ids, max_batch_size and input_ids (the megatron generation routine) to handle multi-example batches: either add an explicit check at the start (verify input_ids.shape[0] == 1 and raise/return if not) or implement per-sample EOS detection by comparing eager_ids.squeeze(-1) (or eager_ids[:,0]) against eos_token_id(s), maintain a boolean finished mask for each batch element, and break only when all entries in the finished mask are true; ensure references to eager_ids, eos_token_id, max_batch_size and input_ids are used to locate and change the logic accordingly.
🧹 Nitpick comments (2)
modelopt/torch/utils/plugins/megatron_generate.py (2)
146-151: Same redundantpp_dtyperecomputation.This is the third occurrence of the same
pp_dtypeassignment. Remove it to use the value computed at line 73-75.♻️ Remove duplicate pp_dtype computation
# For PP non-last stages, forward activations to the next stage and return early. if is_pp and not pp_last: - pp_dtype = model.config.pipeline_dtype or ( - torch.bfloml16 if model.config.bf16 else torch.float32 - ) send_to_next_pipeline_rank(output.to(dtype=pp_dtype))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/utils/plugins/megatron_generate.py` around lines 146 - 151, Remove the redundant recomputation of pp_dtype in the PP non-last stage block; instead reuse the pp_dtype previously computed earlier in the same scope and keep the call to send_to_next_pipeline_rank(output.to(dtype=pp_dtype)). Specifically, delete the duplicate assignment to pp_dtype in the if is_pp and not pp_last branch and ensure the branch uses the existing pp_dtype variable when calling send_to_next_pipeline_rank so functions like send_to_next_pipeline_rank and the output.to(dtype=pp_dtype) call remain unchanged.
111-122: Redundantpp_dtypecomputation.
pp_dtypeis already computed at lines 73-75 and is used unchanged here and again at lines 148-150. Consider removing the duplicate assignments.♻️ Remove duplicate pp_dtype computation
# For PP, receive activations from the previous stage before calling forward. if is_pp and not pp_first: - pp_dtype = model.config.pipeline_dtype or ( - torch.bfloat16 if model.config.bf16 else torch.float32 - ) recv_buffer = torch.empty( (padded_seq_len, batch_size, model.config.hidden_size), dtype=pp_dtype,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/utils/plugins/megatron_generate.py` around lines 111 - 122, The pp_dtype computation inside the PP receive block is redundant because pp_dtype is already computed earlier; remove the duplicate assignment in the if is_pp and not pp_first block and reuse the previously computed pp_dtype (derived from model.config.pipeline_dtype / model.config.bf16) when creating recv_buffer and calling recv_from_prev_pipeline_rank_ / get_attr_wrapped_model(model, "set_input_tensor")(recv_buffer); keep the rest of the block (recv_buffer creation, recv_from_prev_pipeline_rank_, set_input_tensor call) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@modelopt/torch/utils/plugins/megatron_generate.py`:
- Around line 348-349: The current code calls eager_ids.item() which fails when
batch_size > 1; update the function containing eager_ids, max_batch_size and
input_ids (the megatron generation routine) to handle multi-example batches:
either add an explicit check at the start (verify input_ids.shape[0] == 1 and
raise/return if not) or implement per-sample EOS detection by comparing
eager_ids.squeeze(-1) (or eager_ids[:,0]) against eos_token_id(s), maintain a
boolean finished mask for each batch element, and break only when all entries in
the finished mask are true; ensure references to eager_ids, eos_token_id,
max_batch_size and input_ids are used to locate and change the logic
accordingly.
---
Nitpick comments:
In `@modelopt/torch/utils/plugins/megatron_generate.py`:
- Around line 146-151: Remove the redundant recomputation of pp_dtype in the PP
non-last stage block; instead reuse the pp_dtype previously computed earlier in
the same scope and keep the call to
send_to_next_pipeline_rank(output.to(dtype=pp_dtype)). Specifically, delete the
duplicate assignment to pp_dtype in the if is_pp and not pp_last branch and
ensure the branch uses the existing pp_dtype variable when calling
send_to_next_pipeline_rank so functions like send_to_next_pipeline_rank and the
output.to(dtype=pp_dtype) call remain unchanged.
- Around line 111-122: The pp_dtype computation inside the PP receive block is
redundant because pp_dtype is already computed earlier; remove the duplicate
assignment in the if is_pp and not pp_first block and reuse the previously
computed pp_dtype (derived from model.config.pipeline_dtype / model.config.bf16)
when creating recv_buffer and calling recv_from_prev_pipeline_rank_ /
get_attr_wrapped_model(model, "set_input_tensor")(recv_buffer); keep the rest of
the block (recv_buffer creation, recv_from_prev_pipeline_rank_, set_input_tensor
call) unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: fb6a7b1e-a076-4b58-a41d-19842543c8e9
📒 Files selected for processing (7)
CHANGELOG.rstexamples/megatron_bridge/prune_minitron.pyexamples/pruning/README.mdmodelopt/torch/export/plugins/megatron_importer.pymodelopt/torch/utils/plugins/megatron_generate.pymodelopt/torch/utils/plugins/megatron_mmlu.pytests/gpu_megatron/torch/utils/plugins/test_utils_megatron.py
✅ Files skipped from review due to trivial changes (4)
- CHANGELOG.rst
- modelopt/torch/export/plugins/megatron_importer.py
- examples/pruning/README.md
- modelopt/torch/utils/plugins/megatron_mmlu.py
9dfb516 to
0b2aabc
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/gpu_megatron/torch/utils/plugins/test_utils_megatron.py (1)
51-51: Limit debug printing to rank 0 to reduce distributed log noise.Line 51 currently prints on every rank; gating to rank 0 keeps CI logs readable.
Proposed tweak
- print(rank, output_text) + if rank == 0: + print(rank, output_text)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gpu_megatron/torch/utils/plugins/test_utils_megatron.py` at line 51, The test currently prints on every distributed rank using print(rank, output_text); change this to only emit output on rank 0 by guarding the print with a rank check (e.g., if rank == 0) so that only the primary process logs the message (keep the same printed values rank and output_text). Ensure the conditional wraps the existing print call (reference: print(rank, output_text)).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/gpu_megatron/torch/utils/plugins/test_utils_megatron.py`:
- Line 25: The MMLU accuracy check in the test calls megatron_mmlu(...,
fraction=0.1, batch_size=16) which is regression-scale and too slow for the fast
GPU test path; either move the heavy assertion into a dedicated regression/slow
test suite or keep a tiny smoke check here (e.g., fraction=0.001 or a
single-batch run) and add the full-fraction assertion under a new
slow/regression test. Locate the call to megatron_mmlu in test_utils_megatron.py
and replace the large-fraction assertion with a lightweight smoke-run, or
extract the current assertion into a new test (named to indicate
slow/regression) that will not run in the fast GPU test set.
---
Nitpick comments:
In `@tests/gpu_megatron/torch/utils/plugins/test_utils_megatron.py`:
- Line 51: The test currently prints on every distributed rank using print(rank,
output_text); change this to only emit output on rank 0 by guarding the print
with a rank check (e.g., if rank == 0) so that only the primary process logs the
message (keep the same printed values rank and output_text). Ensure the
conditional wraps the existing print call (reference: print(rank, output_text)).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 8166e3be-85cd-4679-8c90-c18b5de0d7bf
📒 Files selected for processing (6)
CHANGELOG.rstmodelopt/torch/export/plugins/megatron_importer.pymodelopt/torch/utils/logging.pymodelopt/torch/utils/plugins/megatron_generate.pymodelopt/torch/utils/plugins/megatron_mmlu.pytests/gpu_megatron/torch/utils/plugins/test_utils_megatron.py
✅ Files skipped from review due to trivial changes (2)
- CHANGELOG.rst
- modelopt/torch/utils/plugins/megatron_mmlu.py
🚧 Files skipped from review as they are similar to previous changes (2)
- modelopt/torch/export/plugins/megatron_importer.py
- modelopt/torch/utils/plugins/megatron_generate.py
Two bugs caused pipeline-parallel inference to produce garbage output: 1. megatron_generate/megatron_prefill used get_forward_backward_func() (the training pipeline scheduler), which is not designed for inference. Rewrote both functions to use explicit P2P communication via recv_from_prev_pipeline_rank_ / send_to_next_pipeline_rank, matching the pattern from run_mcore_inference. 2. import_mcore_gpt_from_hf loads HF weights into stage 0's embedding but never updates the output_layer on the last PP stage when share_embeddings_and_output_weights=True. After import, call model.setup_embeddings_and_output_layer() to re-run the all-reduce that syncs the output layer from stage 0 to the last stage. Also parametrize the megatron_generate test to cover both TP and PP. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
1cec8ec to
802ce85
Compare
802ce85 to
fdfa85f
Compare
ChenhanYu
left a comment
There was a problem hiding this comment.
Strong PR — fixes PP inference correctness (training scheduler → explicit P2P) and 6-10x MMLU speedup via batched prefill log-likelihood scoring (matching lm-evaluation-harness approach). Embedding re-sync after HF import with PP>1 is a real correctness bug fixed cleanly.
Minor (non-blocking):
pp_dtypecomputed redundantly 3x inmegatron_prefilland 2x inmegatron_generate— first computation sufficesskip_return_logitspath still runs the broadcast (correct for PP sync, but worth a comment explaining why)- MMLU test bounds (0.36-0.39) are tight for a 0.6B model — may be flaky across hardware. Consider widening or using lower bound only
CandidateSubnetsafe globals andpercentage→fractionrename are unrelated — note in PR description
LGTM.
What does this PR do?
Type of change: new feature + bug fix
Two improvements to Megatron inference utilities:
1. Pipeline Parallel (PP) correctness fixes
PP inference was producing garbage output (MMLU ~0.24, random chance). Two root causes:
megatron_generate/megatron_prefillusedget_forward_backward_func()(the training pipeline scheduler), which is not designed for inference. Rewrote both functions to use explicit P2P communication viarecv_from_prev_pipeline_rank_/send_to_next_pipeline_rank, matching therun_mcore_inferencepattern.import_mcore_gpt_from_hfloads HF weights into stage 0's embedding but never updates the output_layer on the last PP stage whenshare_embeddings_and_output_weights=True. At model init,setup_embeddings_and_output_layer()all-reduces from stage 0 to sync the output layer; after importing HF weights that all-reduce is stale. Fix: callmodel.setup_embeddings_and_output_layer()again after import.2.
megatron_mmluspeedup (~6x)Replaces the
megatron_mmluimplementation with a significantly faster approach that matches howlm-evaluation-harnessscores multiple-choice questions.Before: autoregressive generation (
megatron_generate,osl=2) per example, 114 separateload_datasetcalls, batch_size=1 — 260s for 5% data.After: single prefill forward pass + argmax over {A,B,C,D} logits, 2
load_datasetcalls, configurable batch_size — 18s for 5% data (~6x faster).Changes
PP fixes:
megatron_generate/megatron_prefill: replaceget_forward_backward_funcwith explicit P2P (recv_from_prev_pipeline_rank_/send_to_next_pipeline_rank)import_mcore_gpt_from_hf: callmodel.setup_embeddings_and_output_layer()after HF weight import when PP>1 andshare_embeddings_and_output_weights=Truemegatron_prefill: addskip_return_logitsparam and VLM support (needed for PP non-last stages)MMLU speedup:
megatron_generatewithmegatron_prefill— one forward pass per batch, no autoregressive decode loopbatch_sizechunksload_dataset("cais/mmlu", "all")with per-subject grouping; skip dev load whenfew_shots=0percentage→fractionparameter rename for clarityTesting
test_megatron_generate_and_mmluparametrized overtpandpp. Accuracy assertion:0.36 < score < 0.39. Manually checked generated text is coherent.Before your PR is "Ready for review"
percentageparameter renamed tofraction;enable_kv_cacheremoved frommegatron_mmluCONTRIBUTING.md: N/A🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
New Features
Behavior Changes
Tests
Documentation