adds esm2-15b plot to esm2_native_te readme (#1523)

jomitchellnv · web-flow · commit 46112e773414 · 2026-03-16T02:41:22.000Z
### Description Another plot moved and a description added #### Usage  ```python TODO: Add code snippet ``` ### Type of changes  - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Refactor - [ ] Documentation update - [ ] Other (please describe): ### CI Pipeline Configuration Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run. - [ciflow:skip](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:skip) - Skip all CI tests for this PR - [ciflow:notebooks](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:notebooks) - Run Jupyter notebooks execution tests for bionemo2 - [ciflow:slow](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:slow) - Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2 - [ciflow:all](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all) - Run all tests (unit tests, slow tests, and notebooks) for bionemo2. This label can be used to enforce running tests for all bionemo2. - [ciflow:all-recipes](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all-recipes) - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes. Unit tests marked as `@pytest.mark.multi_gpu` or `@pytest.mark.distributed` are not run in the PR pipeline. For more details, see [CONTRIBUTING](CONTRIBUTING.md) > [!NOTE] > By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage. #### Authorizing CI Runs We use [copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation) to manage authorization of CI runs on NVIDIA's compute resources. - If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123) - If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an `/ok to test` comment on the pull request to trigger CI. This will need to be done for each new commit. #### Triggering Code Rabbit AI Review To trigger a code review from code rabbit, comment on a pull request with one of these commands: - @coderabbitai review - Triggers a standard review - @coderabbitai full review - Triggers a comprehensive review See https://docs.coderabbit.ai/reference/review-commands for a full list of commands. ### Pre-submit Checklist  - [ ] I have tested these changes locally - [ ] I have updated the documentation accordingly - [ ] I have added/updated tests as needed - [ ] All existing tests pass successfully Signed-off-by: Jonathan Mitchell <jomitchell@nvidia.com>
diff --git a/bionemo-recipes/recipes/esm2_native_te/README.md b/bionemo-recipes/recipes/esm2_native_te/README.md
@@ -67,12 +67,8 @@ docker run -it --gpus all --network host --ipc=host --rm -v ${PWD}:/workspace/bi
 
 ### Performance Benchmarks
 
-![Performance Benchmarks](../../../docs/docs/assets/images/esm2/esm2_native_te_benchmarks.svg)
-
-Note: "compiled" refers to `torch.compile`. "fa2" is [FlashAttention2](https://github.com/Dao-AILab/flash-attention).
-Recently, we measured 2800 tokens/second/GPU training speed on H100 with HuggingFace Transformers's ESM-2 implementation
-of THD sequence packing, however we have not been able to make this configuration work on Blackwell and this work is
-still in progress.
+![Performance Benchmarks](../../../docs/docs/assets/images/esm2/esm2_low_precision/esm2_15b_grouped_bars.png)
+ESM-2 15B single-node pretraining benchmarks were conducted on 1 node (8 GPUs) across H200 (140 GB) and B200 (192 GB) hardware. We evaluate a progression of optimization strategies: PyTorch Flash Attention 2 (with and without torch.compile), Transformer Engine with BSHD and THD (sequence packing) layouts, and low-precision training with FP8 Block Scaling (H200), MXFP8, and NVFP4 (B200). On H200, moving from baseline FA2 to TE with FP8 Block Scaling yields a 4.4x improvement in unpadded tokens/s/GPU (1,630 → 7,119). On B200, the full optimization stack from FA2 to NVFP4 delivers a 7.3x speedup (2,958 → 21,476), with NVFP4 reaching over 2,000 TFLOPS per GPU. Sequence packing (THD) alone accounts for a 2.5–2.7x gain over padded BSHD on both platforms, while Blackwell-native low-precision formats (MXFP8, NVFP4) unlock an additional 1.3–1.8x on top of that. For more information on how low precision scales with model parameters see the next section.
 
 ### Low precision performance benchmarks