Fix masked token loss refactor from NeMo bump. (#855)

cspades · farhadrgh · dorotat-nv · farhadrgh · commit 917c9c4e76bc · 2025-05-05T09:12:36.000-07:00
### Description  - Fixes the error caused by NVIDIA-NeMo/NeMo#12459 refactoring the definition of `masked_token_loss` and `masked_token_loss_context_parallel` into a single function with a `cp_size` argument that no longer divides the loss by the number of "valid" (i.e. non-masked) tokens. So it returns a CP-reduced loss sum. - Specifically, this breaks one of our golden value tests in `bionemo-llm`: `sub-packages/bionemo-llm/tests/bionemo/llm/model/test_loss.py::test_loss_equivalency_bionemo_vs_pytorch`, and this fixes it with no behavior change to the LLM model `forward()`, i.e. we perform the normalization on valid tokens on our side now. ### Details - Bump NeMo to a version greater than: NVIDIA-NeMo/NeMo#12856 or matching this: #798 - Update: Need to migrate to `inference_context` in NeMo: https://github.com/NVIDIA/NeMo/tree/cye/hyena-gpt-infer-context - Bump Megatron to support new imports in the NeMo bump. Found a commit that bisects the new Megatron inference engine and the new NeMo imports to prevent breakage of our inference tests. - Use a backend version of RoPE for the Amplify Megatron vs. PyTorch/HF parity test to avoid the CP process group requirement. - `MaskedTokenLossReduction.forward()` return API changed. - Added commentary for future devs to understand the code. #### Appendix - NeMo Fork Hotfix Patch: Safe import of a future module in Megatron to avoid upgrading. ``` get_gpt_heterogeneous_layer_spec, HAVE_GPT_HETEROGENEOUS = safe_import("megatron.core.models.gpt.heterogeneous.heterogeneous_layer_specs") ``` ### Type of changes  - [x] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Refactor - [ ] Documentation update - [ ] Other (please describe): ### Usage / Testing  - Tested against the commit specified in this PR: #798 ```python cd 3rdparty/NeMo git checkout c998e273f9cd23e36d7348fa27d0c2692efd87c8 pytest -s sub-packages/bionemo-llm/tests/bionemo/llm/model/test_loss.py::test_loss_equivalency_bionemo_vs_pytorch ``` --------- Signed-off-by: Farhad Ramezanghorbani <farhadr@nvidia.com> Signed-off-by: Cory Ye <cye@nvidia.com> Signed-off-by: cspades <cory0ye@gmail.com> Signed-off-by: Timur Rvachov <trvachov@nvidia.com> Signed-off-by: Danny <dreidenbach@nvidia.com> Signed-off-by: Cory Ye <44509866+cspades@users.noreply.github.com> Signed-off-by: nvdreidenbach <97637601+nvdreidenbach@users.noreply.github.com> Signed-off-by: Peter St. John <pstjohn@nvidia.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Polina Binder <pbinder@nvidia.com> Signed-off-by: polinabinder1 <pbinder@nvidia.com> Signed-off-by: dorotat <dorotat@nvidia.com> Signed-off-by: Truong Nguyen <tgnguyen@nvidia.com> Signed-off-by: Jonathan Mitchell <jomitchell@nvidia.com> Signed-off-by: Timur Rvachov <120140748+trvachov@users.noreply.github.com> Signed-off-by: Steven <skothenhill@nvidia.com> Co-authored-by: Farhad Ramezanghorbani <farhadr@nvidia.com> Co-authored-by: Farhad Ramezanghorbani <farhadrgh@users.noreply.github.com> Co-authored-by: Dorota Toczydlowska <115542912+dorotat-nv@users.noreply.github.com> Co-authored-by: Timur Rvachov <120140748+trvachov@users.noreply.github.com> Co-authored-by: nvdreidenbach <97637601+nvdreidenbach@users.noreply.github.com> Co-authored-by: Steven Kothen-Hill <148821680+skothenhill-nv@users.noreply.github.com> Co-authored-by: Peter St. John <pstjohn@nvidia.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: polinabinder1 <pbinder@nvidia.com> Co-authored-by: Truong Nguyen <tgnguyen@nvidia.com> Co-authored-by: jomitchellnv <148147880+jomitchellnv@users.noreply.github.com> Co-authored-by: lvojtku <lvojtku@nvidia.com> Signed-off-by: Farhad Ramezanghorbani <farhadr@nvidia.com>
diff --git a/Dockerfile b/Dockerfile
@@ -195,7 +195,7 @@ uv pip install --no-build-isolation \
 -r /requirements-test.txt
 
 # Install back ngcsdk, as a WAR for the protobuf version conflict with nemo_toolkit.
-uv pip install ngcsdk==3.63.0  # Remove when https://nvidia.slack.com/archives/CEX3JC6SF/p1744898511311379 is fixed.
+uv pip install ngcsdk==3.64.3  # Temporary fix for changed filename, see https://nvidia.slack.com/archives/C074Z808N05/p1746231345981209
 
 # Install nvidia-pytriton which seems to cause a conflict with pyzmq versions
 uv pip install nvidia-pytriton  # Temporary dependency until this gets added to requirements_nlp.txt in NeMo.
diff --git a/VERSION b/VERSION
@@ -1 +1 @@
-2.5
+2.6
diff --git a/docs/docs/models/geneformer.md b/docs/docs/models/geneformer.md
@@ -207,4 +207,4 @@ The 106M parameter variant of Geneformer achieves over 50 TFLOPS per GPU during
 
 ![GPU Performance (TFLOPS) Comparison Between Geneformer Model Variants on A100 GPUs](../assets/images/geneformer/model_tflops_per_gpu_chart_geneformer.png)
 
-    Performance will increase if the `num_dataset_workers` and the `micro_batch_size` are set appropriately. For the above metrics, we set `num_dataset_workers=8`. For the 10m model, set `micro_batch_size=120` and for the 106m model set the `micro_batch_size=16`. This will enable you to achieve similar performance results.
+Performance will increase if the `num_dataset_workers` and the `micro_batch_size` are set appropriately. For the above metrics, we set `num_dataset_workers=8`. For the 10m model, set `micro_batch_size=120` and for the 106m model set the `micro_batch_size=16`. This will enable you to achieve similar performance results.
diff --git a/docs/docs/user-guide/appendix/releasenotes-fw.md b/docs/docs/user-guide/appendix/releasenotes-fw.md
@@ -1,5 +1,17 @@
 # Release Notes
 
+## BioNeMo Framework v2.6
+
+### New Features
+
+* Adds support for AMPLIFY [doi:10.1101/2024.09.23.614603](https://doi.org/10.1101/2024.09.23.614603) pre-training and inference, offering a 70% speedup over the xformers-based attention backend with similar final perplexity values at 1M pre-training steps. (4.23 for 120M, 3.05 for 350M). The model is fully compatible with existing weights on HuggingFace.
+* Adds alpha support for [LoRA fine-tuning to for ESM2 models](https://nvidia.github.io/bionemo-framework/models/ESM-2/#lora-fine-tuning-performace). Inference and fine-tuning are enabled along with resumption from a checkpoint.
+
+### Updates & Improvements
+
+* Blackwell support, tested on B200 systems.
+* Fixed Grace CPU support, released ARM compatible container.
+
 ## BioNeMo Framework v2.5
 
 ### New Features
@@ -12,6 +24,9 @@
 * Upgrade bionemo-moco to v0.0.2
 * Brev.dev launchable tutorials
 
+#### Known Issues
+* Partial test failures on ARM CPUs.
+
 ## BioNeMo Framework v2.4.1
 
 ### Updates & Improvements
@@ -23,6 +38,9 @@
 * Draft implementation of Evo2 with support for Hyena operators
 * bionemo-moco v0.0.1 released for building diffusion-like generative models.
 
+### Known Issues
+* Partial test failures on ARM CPUs.
+
 ### Updates & Improvements
 
 * ESM2 fine-tuning script with CLI (finetune_esm2) that supports sequence-level/token-level classification/regression using a CSV dataset.
diff --git a/sub-packages/bionemo-amplify/tests/bionemo/amplify/test_hf_rotary.py b/sub-packages/bionemo-amplify/tests/bionemo/amplify/test_hf_rotary.py
@@ -14,7 +14,7 @@
 # limitations under the License.
 
 import torch
-from megatron.core.models.common.embeddings.rope_utils import apply_rotary_pos_emb
+from megatron.core.models.common.embeddings.rope_utils import _apply_rotary_pos_emb_bshd
 from megatron.core.models.common.embeddings.rotary_pos_embedding import RotaryEmbedding
 from transformers import AutoConfig
 
@@ -47,8 +47,20 @@ def test_rope_embeddings():
         seq_len_interpolation_factor=nemo_config.seq_len_interpolation_factor,
     )
     rotary_pos_emb = rotary_pos_layer(q.shape[1])
-    q_post_nemo = apply_rotary_pos_emb(q.transpose(0, 1).cuda(), rotary_pos_emb.cuda(), config=nemo_config).cpu()
-    k_post_nemo = apply_rotary_pos_emb(k.transpose(0, 1).cuda(), rotary_pos_emb.cuda(), config=nemo_config).cpu()
+    # Note: Use the backend implementation of the RoPE to avoid
+    # getting or instantiating a CP process group.
+    q_post_nemo = _apply_rotary_pos_emb_bshd(
+        q.transpose(0, 1).cuda(),
+        rotary_pos_emb.cuda(),
+        rotary_interleaved=nemo_config.rotary_interleaved,
+        multi_latent_attention=nemo_config.multi_latent_attention,
+    ).cpu()
+    k_post_nemo = _apply_rotary_pos_emb_bshd(
+        k.transpose(0, 1).cuda(),
+        rotary_pos_emb.cuda(),
+        rotary_interleaved=nemo_config.rotary_interleaved,
+        multi_latent_attention=nemo_config.multi_latent_attention,
+    ).cpu()
 
     torch.testing.assert_close(q_post, q_post_nemo.transpose(0, 1))
     torch.testing.assert_close(k_post, k_post_nemo.transpose(0, 1))
diff --git a/sub-packages/bionemo-esm2/tests/bionemo/esm2/scripts/test_train_esm2.py b/sub-packages/bionemo-esm2/tests/bionemo/esm2/scripts/test_train_esm2.py
@@ -327,7 +327,7 @@ def test_main_runs(tmp_path, dummy_protein_dataset, dummy_parquet_train_val_inpu
     event_files = list(log_dir.rglob("events.out.tfevents*"))
     assert event_files, f"No TensorBoard event files found under {log_dir}"
     assert "val_ppl" in trainer.logged_metrics  # validation logging on by default
-    assert "tflops_per_sec_per_gpu" in trainer.logged_metrics  # ensuring that tflops logger can be added
+    assert "TFLOPS_per_GPU" in trainer.logged_metrics  # ensuring that tflops logger can be added
     assert "train_step_timing in s" in trainer.logged_metrics
 
 
diff --git a/sub-packages/bionemo-evo2/src/bionemo/evo2/run/predict.py b/sub-packages/bionemo-evo2/src/bionemo/evo2/run/predict.py
@@ -157,7 +157,9 @@ def predict_step(self, batch, batch_idx: Optional[int] = None) -> Tensor:
             return forward_out
         # Reminder: the model's predictions for input i land at output i+1. To get everything to align, we prepend the
         # EOS token to the input sequences and take the outputs for all but the first token.
-        forward_out_tp_gathered = _gather_along_last_dim(forward_out)
+        forward_out_tp_gathered = _gather_along_last_dim(
+            forward_out, group=parallel_state.get_tensor_model_parallel_group()
+        )
         # else:
         #     forward_out_tp_gathered = _collect_into_dim(forward_out, dim=-1)
         forward_out_gathered = _gather_along_cp_dim(forward_out_tp_gathered)
diff --git a/sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py b/sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py
@@ -146,7 +146,7 @@ def test_train_evo2_stops(tmp_path):
     )
 
     assert "reduced_train_loss" in trainer.logged_metrics  # validation logging on by default
-    assert "tflops_per_sec_per_gpu" in trainer.logged_metrics  # ensuring that tflops logger can be added
+    assert "TFLOPS_per_GPU" in trainer.logged_metrics  # ensuring that tflops logger can be added
     assert "train_step_timing in s" in trainer.logged_metrics
 
 
diff --git a/sub-packages/bionemo-llm/src/bionemo/llm/model/loss.py b/sub-packages/bionemo-llm/src/bionemo/llm/model/loss.py
@@ -19,7 +19,10 @@
 from megatron.core import parallel_state, tensor_parallel
 from megatron.core.fusions.fused_cross_entropy import fused_vocab_parallel_cross_entropy
 from nemo.collections.nlp.modules.common.megatron.utils import average_losses_across_data_parallel_group
-from nemo.lightning.megatron_parallel import MegatronLossReduction, masked_token_loss
+from nemo.lightning.megatron_parallel import (
+    MegatronLossReduction,
+    masked_token_loss,
+)
 from torch import Tensor
 
 
@@ -175,14 +178,17 @@ def forward(
 
         # TODO(@jstjohn) also handle different output keys, like the sequence loss.
 
-        # compute loss
+        # Compute loss over "valid" tokens in the microbatch, i.e. the non-masked tokens.
+        # The loss is not normalized, only potentially reduced via torch.distributed.ReduceOp.SUM
+        # across the context parallel process group, so you need to divide by the number
+        # of non-masked tokens (loss_mask.sum()) to compute the mean reduced loss per token.
         cp_size = parallel_state.get_context_parallel_world_size()
-        loss_for_microbatch = masked_token_loss(unreduced_token_loss, batch["loss_mask"], cp_size)
+        loss_for_microbatch = masked_token_loss(unreduced_token_loss, batch["loss_mask"], cp_size=cp_size)
+        num_valid_tokens_in_microbatch = batch["loss_mask"].sum()
 
         # If we do not drop the last partial batch of validation, we need to do fancy reduction handling to support
         #  reducing the loss across the data parallel group.
         if self.validation_step and not self.val_drop_last:
-            num_valid_tokens_in_microbatch = batch["loss_mask"].sum()
             if loss_for_microbatch.isnan():
                 # TODO(@jomitchell): Add a unit test for this. This is the case where there are no valid tokens in the microbatch for the loss
                 #  to be computed over, so we expect a NaN loss (divide by zero for a mean) but we make this an expected and non-breaking case,
@@ -191,9 +197,8 @@ def forward(
                     raise ValueError("Got NaN loss with non-empty input")
                 loss_sum_for_microbatch = torch.zeros_like(num_valid_tokens_in_microbatch)
             else:
-                loss_sum_for_microbatch = (
-                    num_valid_tokens_in_microbatch * loss_for_microbatch
-                )  # sum over all valid tokens
+                # The reduced loss is already the sum of all losses from masked_token_loss().
+                loss_sum_for_microbatch = loss_for_microbatch
 
             # In this case we need to store the loss sum as well as the number of valid tokens in the microbatch.
             loss_sum_and_microbatch_size_all_gpu = torch.cat(
@@ -202,17 +207,28 @@ def forward(
                     Tensor([num_valid_tokens_in_microbatch]).cuda().clone().detach(),
                 ]
             )
+
+            # Reduce the loss sum across the data parallel group to get the total loss
+            # for all data parallel / distributed microbatches.
             torch.distributed.all_reduce(
                 loss_sum_and_microbatch_size_all_gpu,
                 group=parallel_state.get_data_parallel_group(),
                 op=torch.distributed.ReduceOp.SUM,
             )
+
+            # Return the loss tensor multiplied by the context parallel size,
+            # and the data & context parallel reduced loss sum.
             return loss_for_microbatch * cp_size, {
                 "loss_sum_and_microbatch_size": loss_sum_and_microbatch_size_all_gpu
             }
 
-        # average the losses across the data parallel group, but also return the unreduced loss
-        reduced_loss = average_losses_across_data_parallel_group([loss_for_microbatch])
+        # Return the loss tensor multiplied by the context parallel size, as well as
+        # the data-parallel averaged loss, i.e. the loss divided by the DP size.
+        # Normalize the loss by the number of "valid" tokens, because masked_token_loss
+        # no longer does this normalization, and BioNeMo losses expect this normalization.
+        reduced_loss = (
+            average_losses_across_data_parallel_group([loss_for_microbatch]) / num_valid_tokens_in_microbatch
+        )
         return loss_for_microbatch * cp_size, {"avg": reduced_loss}
 
 
diff --git a/sub-packages/bionemo-llm/src/bionemo/llm/utils/callbacks.py b/sub-packages/bionemo-llm/src/bionemo/llm/utils/callbacks.py
@@ -29,7 +29,12 @@
 
 
 class PredictionWriter(BasePredictionWriter, pl.Callback):
-    """A callback that writes predictions to disk at specified intervals during training."""
+    """A callback that writes predictions to disk at specified intervals during training.
+
+    Logits, Embeddings, Hiddens, Input IDs, and Labels may all be saved to the disk depending on trainer configuration.
+    Batch Idxs are provided for each prediction in the same dictionary. These must be used to maintain order between
+    multi device predictions and single device predictions.
+    """
 
     def __init__(
         self,
@@ -42,15 +47,28 @@ def __init__(
 
         Args:
             output_dir: The directory where predictions will be written.
-            write_interval: The interval at which predictions will be written. (batch, epoch)
+            write_interval: The interval at which predictions will be written (batch, epoch). Epoch may not be used with multi-device trainers.
             batch_dim_key_defaults: The default batch dimension for each key, if different from the standard 0.
             seq_dim_key_defaults: The default sequence dimension for each key, if different from the standard 1.
         """
         super().__init__(write_interval)
+        self.write_interval = write_interval
         self.output_dir = str(output_dir)
         self.batch_dim_key_defaults = batch_dim_key_defaults
         self.seq_dim_key_defaults = seq_dim_key_defaults
 
+    def setup(self, trainer: pl.Trainer, pl_module: pl.LightningModule, *args, **kwargs) -> None:  # noqa: D417
+        """Invoked with Trainer.fit, validate, test, and predict are called. Will immediately fail when 'write_interval' is 'epoch' and 'trainer.num_devices' > 1.
+
+        Args:
+            trainer: The Trainer instance.
+            pl_module: The LightningModule instance.
+        """
+        if trainer.num_devices > 1 and self.write_interval == "epoch":
+            raise ValueError(
+                "Multi-GPU predictions are not permitted as outputs are not ordered and batch indices are lost."
+            )
+
     def write_on_batch_end(
         self,
         trainer: pl.Trainer,
@@ -63,6 +81,9 @@ def write_on_batch_end(
     ) -> None:
         """Writes predictions to disk at the end of each batch.
 
+        Predictions files follow the naming pattern, where rank is the active GPU in which the predictions were made.
+        predictions__rank_{rank}__batch_{batch_idx}.pt
+
         Args:
             trainer: The Trainer instance.
             pl_module: The LightningModule instance.
@@ -77,7 +98,12 @@ def write_on_batch_end(
         result_path = os.path.join(self.output_dir, f"predictions__rank_{trainer.global_rank}__batch_{batch_idx}.pt")
 
         # batch_indices is not captured due to a lightning bug when return_predictions = False
-        # we use input IDs in the prediction to map the result to input
+        # we use input IDs in the prediction to map the result to input.
+
+        # NOTE store the batch_idx so we do not need to rely on filenames for reconstruction of inputs. This is wrapped
+        # in a tensor and list container to ensure compatibility with batch_collator.
+        prediction["batch_idx"] = torch.tensor([batch_idx], dtype=torch.int64)
+
         torch.save(prediction, result_path)
         logging.info(f"Inference predictions are stored in {result_path}\n{prediction.keys()}")
 
@@ -90,14 +116,23 @@ def write_on_epoch_end(
     ) -> None:
         """Writes predictions to disk at the end of each epoch.
 
+        Writing all predictions on epoch end is memory intensive. It is recommended to use the batch writer instead for
+        large predictions.
+
+        Multi-device predictions will likely yield predictions in an order that is inconsistent with single device predictions and the input data.
+
         Args:
             trainer: The Trainer instance.
             pl_module: The LightningModule instance.
             predictions: The predictions made by the model.
             batch_indices: The indices of the batch.
+
+        Raises:
+            Multi-GPU predictions are output in an inconsistent order with multiple devices.
         """
         # this will create N (num processes) files in `output_dir` each containing
         # the predictions of it's respective rank
+
         result_path = os.path.join(self.output_dir, f"predictions__rank_{trainer.global_rank}.pt")
 
         # collate multiple batches / ignore empty ones
@@ -106,13 +141,14 @@ def write_on_epoch_end(
             collate_kwargs["batch_dim_key_defaults"] = self.batch_dim_key_defaults
         if self.seq_dim_key_defaults is not None:
             collate_kwargs["seq_dim_key_defaults"] = self.seq_dim_key_defaults
+
         prediction = batch_collator([item for item in predictions if item is not None], **collate_kwargs)
 
         # batch_indices is not captured due to a lightning bug when return_predictions = False
         # we use input IDs in the prediction to map the result to input
-        torch.save(prediction, result_path)
         if isinstance(prediction, dict):
             keys = prediction.keys()
         else:
             keys = "tensor"
+        torch.save(prediction, result_path)
         logging.info(f"Inference predictions are stored in {result_path}\n{keys}")
diff --git a/sub-packages/bionemo-llm/tests/bionemo/llm/model/test_loss.py b/sub-packages/bionemo-llm/tests/bionemo/llm/model/test_loss.py
@@ -75,7 +75,7 @@ def test_loss_equivalency_nemo_vs_pytorch():
             batch=batch_megatron,
             forward_out=unreduced_megatron_loss,  # wants the loss directly
         )
-        final_nemo_loss = nemo_default_loss_fn.reduce([forward_nemo_loss[1]])
+        final_nemo_loss = nemo_default_loss_fn.reduce([forward_nemo_loss[2]])
 
         # First check, nemo+megatron loss
         torch.testing.assert_close(expected_loss, final_nemo_loss)

Original file line number	Diff line number	Diff line change
`@@ -207,4 +207,4 @@ The 106M parameter variant of Geneformer achieves over 50 TFLOPS per GPU during`
`207`	`207`
`208`	`208`	`![GPU Performance (TFLOPS) Comparison Between Geneformer Model Variants on A100 GPUs](../assets/images/geneformer/model_tflops_per_gpu_chart_geneformer.png)`
`209`	`209`
`210`		- Performance will increase if the `num_dataset_workers` and the `micro_batch_size` are set appropriately. For the above metrics, we set `num_dataset_workers=8`. For the 10m model, set `micro_batch_size=120` and for the 106m model set the `micro_batch_size=16`. This will enable you to achieve similar performance results.
	`210`	+Performance will increase if the `num_dataset_workers` and the `micro_batch_size` are set appropriately. For the above metrics, we set `num_dataset_workers=8`. For the 10m model, set `micro_batch_size=120` and for the 106m model set the `micro_batch_size=16`. This will enable you to achieve similar performance results.
Original file line number	Diff line number	Diff line change
`@@ -146,7 +146,7 @@ def test_train_evo2_stops(tmp_path):`
`146`	`146`	`)`
`147`	`147`
`148`	`148`	`assert "reduced_train_loss" in trainer.logged_metrics # validation logging on by default`
`149`		`- assert "tflops_per_sec_per_gpu" in trainer.logged_metrics # ensuring that tflops logger can be added`
	`149`	`+ assert "TFLOPS_per_GPU" in trainer.logged_metrics # ensuring that tflops logger can be added`
`150`	`150`	`assert "train_step_timing in s" in trainer.logged_metrics`
`151`	`151`
`152`	`152`