You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fix masked token loss refactor from NeMo bump. (#855)
### Description
<!-- Provide a detailed description of the changes in this PR -->
- Fixes the error caused by NVIDIA-NeMo/NeMo#12459
refactoring the definition of `masked_token_loss` and
`masked_token_loss_context_parallel` into a single function with a
`cp_size` argument that no longer divides the loss by the number of
"valid" (i.e. non-masked) tokens. So it returns a CP-reduced loss sum.
- Specifically, this breaks one of our golden value tests in
`bionemo-llm`:
`sub-packages/bionemo-llm/tests/bionemo/llm/model/test_loss.py::test_loss_equivalency_bionemo_vs_pytorch`,
and this fixes it with no behavior change to the LLM model `forward()`,
i.e. we perform the normalization on valid tokens on our side now.
### Details
- Bump NeMo to a version greater than:
NVIDIA-NeMo/NeMo#12856 or matching this:
#798
- Update: Need to migrate to `inference_context` in NeMo:
https://github.com/NVIDIA/NeMo/tree/cye/hyena-gpt-infer-context
- Bump Megatron to support new imports in the NeMo bump. Found a commit
that bisects the new Megatron inference engine and the new NeMo imports
to prevent breakage of our inference tests.
- Use a backend version of RoPE for the Amplify Megatron vs. PyTorch/HF
parity test to avoid the CP process group requirement.
- `MaskedTokenLossReduction.forward()` return API changed.
- Added commentary for future devs to understand the code.
#### Appendix
- NeMo Fork Hotfix Patch: Safe import of a future module in Megatron to
avoid upgrading.
```
get_gpt_heterogeneous_layer_spec, HAVE_GPT_HETEROGENEOUS = safe_import("megatron.core.models.gpt.heterogeneous.heterogeneous_layer_specs")
```
### Type of changes
<!-- Mark the relevant option with an [x] -->
- [x] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Refactor
- [ ] Documentation update
- [ ] Other (please describe):
### Usage / Testing
<!--- How does a user interact with the changed code -->
- Tested against the commit specified in this PR:
#798
```python
cd 3rdparty/NeMo
git checkout c998e273f9cd23e36d7348fa27d0c2692efd87c8
pytest -s sub-packages/bionemo-llm/tests/bionemo/llm/model/test_loss.py::test_loss_equivalency_bionemo_vs_pytorch
```
---------
Signed-off-by: Farhad Ramezanghorbani <farhadr@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: cspades <cory0ye@gmail.com>
Signed-off-by: Timur Rvachov <trvachov@nvidia.com>
Signed-off-by: Danny <dreidenbach@nvidia.com>
Signed-off-by: Cory Ye <44509866+cspades@users.noreply.github.com>
Signed-off-by: nvdreidenbach <97637601+nvdreidenbach@users.noreply.github.com>
Signed-off-by: Peter St. John <pstjohn@nvidia.com>
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Polina Binder <pbinder@nvidia.com>
Signed-off-by: polinabinder1 <pbinder@nvidia.com>
Signed-off-by: dorotat <dorotat@nvidia.com>
Signed-off-by: Truong Nguyen <tgnguyen@nvidia.com>
Signed-off-by: Jonathan Mitchell <jomitchell@nvidia.com>
Signed-off-by: Timur Rvachov <120140748+trvachov@users.noreply.github.com>
Signed-off-by: Steven <skothenhill@nvidia.com>
Co-authored-by: Farhad Ramezanghorbani <farhadr@nvidia.com>
Co-authored-by: Farhad Ramezanghorbani <farhadrgh@users.noreply.github.com>
Co-authored-by: Dorota Toczydlowska <115542912+dorotat-nv@users.noreply.github.com>
Co-authored-by: Timur Rvachov <120140748+trvachov@users.noreply.github.com>
Co-authored-by: nvdreidenbach <97637601+nvdreidenbach@users.noreply.github.com>
Co-authored-by: Steven Kothen-Hill <148821680+skothenhill-nv@users.noreply.github.com>
Co-authored-by: Peter St. John <pstjohn@nvidia.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: polinabinder1 <pbinder@nvidia.com>
Co-authored-by: Truong Nguyen <tgnguyen@nvidia.com>
Co-authored-by: jomitchellnv <148147880+jomitchellnv@users.noreply.github.com>
Co-authored-by: lvojtku <lvojtku@nvidia.com>
Signed-off-by: Farhad Ramezanghorbani <farhadr@nvidia.com>
Copy file name to clipboardExpand all lines: docs/docs/models/geneformer.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -207,4 +207,4 @@ The 106M parameter variant of Geneformer achieves over 50 TFLOPS per GPU during
207
207
208
208

209
209
210
-
Performance will increase if the `num_dataset_workers` and the `micro_batch_size` are set appropriately. For the above metrics, we set `num_dataset_workers=8`. For the 10m model, set `micro_batch_size=120` and for the 106m model set the `micro_batch_size=16`. This will enable you to achieve similar performance results.
210
+
Performance will increase if the `num_dataset_workers` and the `micro_batch_size` are set appropriately. For the above metrics, we set `num_dataset_workers=8`. For the 10m model, set `micro_batch_size=120` and for the 106m model set the `micro_batch_size=16`. This will enable you to achieve similar performance results.
Copy file name to clipboardExpand all lines: docs/docs/user-guide/appendix/releasenotes-fw.md
+18Lines changed: 18 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,17 @@
1
1
# Release Notes
2
2
3
+
## BioNeMo Framework v2.6
4
+
5
+
### New Features
6
+
7
+
* Adds support for AMPLIFY [doi:10.1101/2024.09.23.614603](https://doi.org/10.1101/2024.09.23.614603) pre-training and inference, offering a 70% speedup over the xformers-based attention backend with similar final perplexity values at 1M pre-training steps. (4.23 for 120M, 3.05 for 350M). The model is fully compatible with existing weights on HuggingFace.
8
+
* Adds alpha support for [LoRA fine-tuning to for ESM2 models](https://nvidia.github.io/bionemo-framework/models/ESM-2/#lora-fine-tuning-performace). Inference and fine-tuning are enabled along with resumption from a checkpoint.
9
+
10
+
### Updates & Improvements
11
+
12
+
* Blackwell support, tested on B200 systems.
13
+
* Fixed Grace CPU support, released ARM compatible container.
14
+
3
15
## BioNeMo Framework v2.5
4
16
5
17
### New Features
@@ -12,6 +24,9 @@
12
24
* Upgrade bionemo-moco to v0.0.2
13
25
* Brev.dev launchable tutorials
14
26
27
+
#### Known Issues
28
+
* Partial test failures on ARM CPUs.
29
+
15
30
## BioNeMo Framework v2.4.1
16
31
17
32
### Updates & Improvements
@@ -23,6 +38,9 @@
23
38
* Draft implementation of Evo2 with support for Hyena operators
24
39
* bionemo-moco v0.0.1 released for building diffusion-like generative models.
25
40
41
+
### Known Issues
42
+
* Partial test failures on ARM CPUs.
43
+
26
44
### Updates & Improvements
27
45
28
46
* ESM2 fine-tuning script with CLI (finetune_esm2) that supports sequence-level/token-level classification/regression using a CSV dataset.
"""Invoked with Trainer.fit, validate, test, and predict are called. Will immediately fail when 'write_interval' is 'epoch' and 'trainer.num_devices' > 1.
0 commit comments