Commit d91788a
authored
fix memory error on nightly a bnm2712 (#1104)
### Description
- A recent run of the nightly blossum pipeline showed the error below.
-
https://prod.blsm.nvidia.com/bionemo-external-bionemo-fw/job/test_pytest/2411/pipeline-console/log?nodeId=90
`[2025-08-29T13:07:20.866Z] E torch.OutOfMemoryError: CUDA out of
memory. Tried to allocate 12.81 GiB. GPU 0 has a total capacity of 39.50
GiB of which 12.39 GiB is free. Process 500601 has 27.10 GiB memory in
use. Of the allocated memory 25.96 GiB is allocated by PyTorch, with
2.00 MiB allocated in private pools (e.g., CUDA Graphs), and 176.45 MiB
is reserved by PyTorch but unallocated. If reserved but unallocated
memory is large try setting
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.
See documentation for Memory Management
(https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)`
`[2025-08-29T12:42:04.086Z]
sub-packages/bionemo-evo2/tests/bionemo/evo2/test_evo2.py::test_batch_generate[evo2/7b-1m:1.0-get_model_and_tokenizer-expected_matchpercents4]
FAILED [ 51%] `
- The nightline CI pipeline is run on A100-40GB. See ticket for more
detail https://jirasw.nvidia.com/browse/BIONEMO-2712,
- We update the logic for selecting memory threshold for affected tests.
### Type of changes
### Local pytest results
**1. sub-packages/bionemo-evo2/tests/bionemo/evo2/test_evo2.py - single
H100 80GB device**
-
[pytests_nightly_ci_memory_errors_80gb-device_sub-packages-bionemo-evo2-tests-bionemo-evo2-test_evo2_notslow_br_bnm2712_memory_error_on_nightly_a_20250903T2236_5f4ce5d5.log](https://github.com/user-attachments/files/22128577/pytests_nightly_ci_memory_errors_80gb-device_sub-packages-bionemo-evo2-tests-bionemo-evo2-test_evo2_notslow_br_bnm2712_memory_error_on_nightly_a_20250903T2236_5f4ce5d5.log)
- 18 passed, 2 skipped, 6 deselected, 1655 warnings in 1736.85s
(0:28:56)
- max memory reserved for tensors etc: 41.426 GB
**2. sub-packages/bionemo-evo2/tests/bionemo/evo2/test_evo2.py - single
H100 80GB device, restricted to 40GB**
running
[pytests_nightly_ci_memory_errors_40gb-device_sub-packages-bionemo-evo2-tests-bionemo-evo2-test_evo2_notslow_br_bnm2712_memory_error_on_nightly_a_20250903T2343_5f4ce5d5.log](https://github.com/user-attachments/files/22128575/pytests_nightly_ci_memory_errors_40gb-device_sub-packages-bionemo-evo2-tests-bionemo-evo2-test_evo2_notslow_br_bnm2712_memory_error_on_nightly_a_20250903T2343_5f4ce5d5.log)
<img width="669" height="57" alt="image"
src="https://github.com/user-attachments/assets/1c31d587-0c88-4e59-92e3-da542f1248fb"
/>
**3. sub-packages/bionemo-evo2/tests/bionemo/evo2/test_evo2.py - single
H100 80GB device, restricted to 20GB**
[pytests_nightly_ci_memory_errors_80gb-device_sub-packages-bionemo-evo2-tests-bionemo-evo2-test_evo2_notslow_br_bnm2712_memory_error_on_nightly_a_20250903T2236_5f4ce5d5.log](https://github.com/user-attachments/files/22129350/pytests_nightly_ci_memory_errors_80gb-device_sub-packages-bionemo-evo2-tests-bionemo-evo2-test_evo2_notslow_br_bnm2712_memory_error_on_nightly_a_20250903T2236_5f4ce5d5.log)
<img width="645" height="65" alt="image"
src="https://github.com/user-attachments/assets/bbfb7659-5abb-4528-920f-da85c34e3a7c"
/>
<img width="635" height="155" alt="image"
src="https://github.com/user-attachments/assets/890ffd2c-61a2-4303-a065-e0c833336f8a"
/>
<!-- Mark the relevant option with an [x] -->
- [x] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Refactor
- [ ] Documentation update
- [ ] Other (please describe):
### CI Pipeline Configuration
Configure CI behavior by applying the relevant labels:
-
[SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci)
- Skip all continuous integration tests
-
[INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests)
- Execute notebook validation tests in pytest
-
[INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests)
- Execute tests labelled as slow in pytest for extensive testing
> [!NOTE]
> By default, the notebooks validation tests are skipped unless
explicitly enabled.
#### Authorizing CI Runs
We use
[copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation)
to manage authorization of CI
runs on NVIDIA's compute resources.
- If a pull request is opened by a trusted user and contains only
trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source
repository (e.g. pull-request/123)
- If a pull request is opened by an untrusted user or contains untrusted
changes, an NVIDIA org member must leave an
`/ok to test` comment on the pull request to trigger CI. This will need
to be done for each new commit.
### Usage
<!--- How does a user interact with the changed code -->
```python
# TODO: Add code snippet
```
### Pre-submit Checklist
<!--- Ensure all items are completed before submitting -->
- [wip] I have tested these changes locally
- [na] I have updated the documentation accordingly
- [na] I have added/updated tests as needed
- [wip] All existing tests pass successfully
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* Configurable maximum sequence length when loading models/tokenizers,
applied to inference limits and thresholds.
* Additional inference options enabled for improved runtime performance,
including faster decode paths and decode-time optimizations.
* **Tests**
* Per-test GPU memory gating to skip runs when resources are
insufficient, driven by test-aware memory requirements.
* Tests compute per-test sequence-length caps, enable fast decode paths
where applicable, and record tokens/sec for performance tracking.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Brian Roland <broland@nvidia.com>1 parent 0284fdd commit d91788a
1 file changed
Lines changed: 118 additions & 37 deletions
Lines changed: 118 additions & 37 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
16 | 16 | | |
17 | 17 | | |
18 | 18 | | |
| 19 | + | |
19 | 20 | | |
20 | 21 | | |
21 | 22 | | |
22 | 23 | | |
23 | 24 | | |
24 | 25 | | |
25 | 26 | | |
| 27 | + | |
26 | 28 | | |
27 | 29 | | |
| 30 | + | |
28 | 31 | | |
29 | 32 | | |
30 | 33 | | |
| |||
48 | 51 | | |
49 | 52 | | |
50 | 53 | | |
51 | | - | |
52 | | - | |
53 | | - | |
54 | | - | |
55 | | - | |
56 | | - | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
57 | 65 | | |
58 | 66 | | |
59 | 67 | | |
60 | 68 | | |
61 | 69 | | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
62 | 138 | | |
63 | 139 | | |
64 | | - | |
65 | | - | |
66 | 140 | | |
67 | 141 | | |
68 | | - | |
69 | | - | |
70 | 142 | | |
71 | 143 | | |
72 | 144 | | |
73 | | - | |
74 | | - | |
75 | | - | |
| 145 | + | |
| 146 | + | |
76 | 147 | | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
77 | 151 | | |
78 | 152 | | |
79 | 153 | | |
| |||
328 | 402 | | |
329 | 403 | | |
330 | 404 | | |
331 | | - | |
| 405 | + | |
| 406 | + | |
332 | 407 | | |
333 | 408 | | |
334 | 409 | | |
| |||
347 | 422 | | |
348 | 423 | | |
349 | 424 | | |
350 | | - | |
351 | | - | |
| 425 | + | |
| 426 | + | |
352 | 427 | | |
353 | 428 | | |
354 | 429 | | |
| |||
357 | 432 | | |
358 | 433 | | |
359 | 434 | | |
360 | | - | |
361 | | - | |
| 435 | + | |
| 436 | + | |
362 | 437 | | |
363 | 438 | | |
364 | | - | |
| 439 | + | |
365 | 440 | | |
366 | | - | |
| 441 | + | |
367 | 442 | | |
368 | 443 | | |
369 | 444 | | |
| |||
404 | 479 | | |
405 | 480 | | |
406 | 481 | | |
407 | | - | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
408 | 485 | | |
409 | 486 | | |
410 | 487 | | |
| |||
463 | 540 | | |
464 | 541 | | |
465 | 542 | | |
466 | | - | |
| 543 | + | |
| 544 | + | |
| 545 | + | |
467 | 546 | | |
468 | 547 | | |
469 | 548 | | |
| |||
572 | 651 | | |
573 | 652 | | |
574 | 653 | | |
575 | | - | |
| 654 | + | |
576 | 655 | | |
577 | 656 | | |
578 | 657 | | |
579 | 658 | | |
580 | 659 | | |
581 | 660 | | |
582 | | - | |
| 661 | + | |
583 | 662 | | |
584 | 663 | | |
585 | 664 | | |
| |||
591 | 670 | | |
592 | 671 | | |
593 | 672 | | |
594 | | - | |
595 | 673 | | |
596 | | - | |
597 | 674 | | |
598 | 675 | | |
599 | | - | |
| 676 | + | |
| 677 | + | |
| 678 | + | |
| 679 | + | |
| 680 | + | |
| 681 | + | |
600 | 682 | | |
601 | 683 | | |
602 | 684 | | |
| |||
613 | 695 | | |
614 | 696 | | |
615 | 697 | | |
| 698 | + | |
616 | 699 | | |
617 | 700 | | |
618 | 701 | | |
| |||
638 | 721 | | |
639 | 722 | | |
640 | 723 | | |
641 | | - | |
| 724 | + | |
642 | 725 | | |
643 | 726 | | |
644 | 727 | | |
| |||
648 | 731 | | |
649 | 732 | | |
650 | 733 | | |
651 | | - | |
| 734 | + | |
652 | 735 | | |
653 | 736 | | |
654 | 737 | | |
| |||
660 | 743 | | |
661 | 744 | | |
662 | 745 | | |
663 | | - | |
664 | | - | |
665 | | - | |
666 | 746 | | |
667 | 747 | | |
668 | 748 | | |
669 | 749 | | |
670 | 750 | | |
671 | 751 | | |
672 | | - | |
| 752 | + | |
| 753 | + | |
| 754 | + | |
| 755 | + | |
673 | 756 | | |
674 | 757 | | |
675 | 758 | | |
| |||
748 | 831 | | |
749 | 832 | | |
750 | 833 | | |
751 | | - | |
| 834 | + | |
752 | 835 | | |
753 | 836 | | |
754 | 837 | | |
| |||
757 | 840 | | |
758 | 841 | | |
759 | 842 | | |
760 | | - | |
| 843 | + | |
761 | 844 | | |
762 | 845 | | |
763 | 846 | | |
| |||
776 | 859 | | |
777 | 860 | | |
778 | 861 | | |
779 | | - | |
780 | | - | |
781 | 862 | | |
782 | 863 | | |
783 | 864 | | |
| |||
0 commit comments