Skip to content

Commit abd0ecb

Browse files
authored
Fix evo2 recipe fp8 (#1403)
### Description Fix FP8 current scaling in evo2. NOTE: this breaks delayed scaling, but delayed scaling is getting deprecated. The following image shows that by around step 5k, the fp8-current scaling experiment (with first/last two layers bf16) converges to the previous bf16 runs. <img width="860" height="386" alt="FP8-current scaling with first/last two layers in bf16 converges to the full bf16 line, while still being faster." src="https://github.com/user-attachments/assets/e06dfdc3-393a-4463-bf0c-cf7619815699" /> Step timing comparison: * BF16: 25,100 tokens/sec/gpu * FP8-delayed (poor accuracy): 29,200 tokens/sec/gpu * FP8-current first/last 2 layers bf16: 28,600 tokens/sec/gpu Later we should test fp8-current without the first/last two layers bf16 and see how well it does. For the purpose of this PR we have a method that is working well now though. #### Usage `--mixed-precision-recipe nemotron_h_bf16_with_fp8_current_scaling_mixed` ### Type of changes <!-- Mark the relevant option with an [x] --> - [x] Bug fix (non-breaking change which fixes an issue) - [x] New feature (non-breaking change which adds functionality) - [ ] Refactor - [ ] Documentation update - [ ] Other (please describe): ### CI Pipeline Configuration Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run. - [ciflow:skip](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:skip) - Skip all CI tests for this PR - [ciflow:notebooks](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:notebooks) - Run Jupyter notebooks execution tests for bionemo2 - [ciflow:slow](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:slow) - Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2 - [ciflow:all](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all) - Run all tests (unit tests, slow tests, and notebooks) for bionemo2. This label can be used to enforce running tests for all bionemo2. - [ciflow:all-recipes](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all-recipes) - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes. Unit tests marked as `@pytest.mark.multi_gpu` or `@pytest.mark.distributed` are not run in the PR pipeline. For more details, see [CONTRIBUTING](CONTRIBUTING.md) > [!NOTE] > By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage. #### Authorizing CI Runs We use [copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation) to manage authorization of CI runs on NVIDIA's compute resources. - If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123) - If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an `/ok to test` comment on the pull request to trigger CI. This will need to be done for each new commit. ### Pre-submit Checklist <!--- Ensure all items are completed before submitting --> - [ ] I have tested these changes locally - [ ] I have updated the documentation accordingly - [ ] I have added/updated tests as needed - [ ] All existing tests pass successfully --------- Signed-off-by: John St. John <jstjohn@nvidia.com>
1 parent 3d2a96d commit abd0ecb

4 files changed

Lines changed: 315 additions & 80 deletions

File tree

bionemo-recipes/recipes/evo2_megatron/.ci_build.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
# FIXME: Fix for "No such file or directory: /workspace/TransformerEngine"
44
# Remove once bug has been addressed in the nvidia/pytorch container.
55
rm -f /usr/local/lib/python*/dist-packages/transformer_engine-*.dist-info/direct_url.json
6-
6+
export UV_LOCK_TIMEOUT=900 # increase to 15 minutes (900 seconds), adjust as needed
77
export UV_LINK_MODE=copy
88
uv venv --system-site-packages
99

0 commit comments

Comments
 (0)