Skip to content

Commit d91788a

Browse files
authored
fix memory error on nightly a bnm2712 (#1104)
### Description - A recent run of the nightly blossum pipeline showed the error below. - https://prod.blsm.nvidia.com/bionemo-external-bionemo-fw/job/test_pytest/2411/pipeline-console/log?nodeId=90 `[2025-08-29T13:07:20.866Z] E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.81 GiB. GPU 0 has a total capacity of 39.50 GiB of which 12.39 GiB is free. Process 500601 has 27.10 GiB memory in use. Of the allocated memory 25.96 GiB is allocated by PyTorch, with 2.00 MiB allocated in private pools (e.g., CUDA Graphs), and 176.45 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)` `[2025-08-29T12:42:04.086Z] sub-packages/bionemo-evo2/tests/bionemo/evo2/test_evo2.py::test_batch_generate[evo2/7b-1m:1.0-get_model_and_tokenizer-expected_matchpercents4] FAILED [ 51%] ` - The nightline CI pipeline is run on A100-40GB. See ticket for more detail https://jirasw.nvidia.com/browse/BIONEMO-2712, - We update the logic for selecting memory threshold for affected tests. ### Type of changes ### Local pytest results **1. sub-packages/bionemo-evo2/tests/bionemo/evo2/test_evo2.py - single H100 80GB device** - [pytests_nightly_ci_memory_errors_80gb-device_sub-packages-bionemo-evo2-tests-bionemo-evo2-test_evo2_notslow_br_bnm2712_memory_error_on_nightly_a_20250903T2236_5f4ce5d5.log](https://github.com/user-attachments/files/22128577/pytests_nightly_ci_memory_errors_80gb-device_sub-packages-bionemo-evo2-tests-bionemo-evo2-test_evo2_notslow_br_bnm2712_memory_error_on_nightly_a_20250903T2236_5f4ce5d5.log) - 18 passed, 2 skipped, 6 deselected, 1655 warnings in 1736.85s (0:28:56) - max memory reserved for tensors etc: 41.426 GB **2. sub-packages/bionemo-evo2/tests/bionemo/evo2/test_evo2.py - single H100 80GB device, restricted to 40GB** running [pytests_nightly_ci_memory_errors_40gb-device_sub-packages-bionemo-evo2-tests-bionemo-evo2-test_evo2_notslow_br_bnm2712_memory_error_on_nightly_a_20250903T2343_5f4ce5d5.log](https://github.com/user-attachments/files/22128575/pytests_nightly_ci_memory_errors_40gb-device_sub-packages-bionemo-evo2-tests-bionemo-evo2-test_evo2_notslow_br_bnm2712_memory_error_on_nightly_a_20250903T2343_5f4ce5d5.log) <img width="669" height="57" alt="image" src="https://github.com/user-attachments/assets/1c31d587-0c88-4e59-92e3-da542f1248fb" /> **3. sub-packages/bionemo-evo2/tests/bionemo/evo2/test_evo2.py - single H100 80GB device, restricted to 20GB** [pytests_nightly_ci_memory_errors_80gb-device_sub-packages-bionemo-evo2-tests-bionemo-evo2-test_evo2_notslow_br_bnm2712_memory_error_on_nightly_a_20250903T2236_5f4ce5d5.log](https://github.com/user-attachments/files/22129350/pytests_nightly_ci_memory_errors_80gb-device_sub-packages-bionemo-evo2-tests-bionemo-evo2-test_evo2_notslow_br_bnm2712_memory_error_on_nightly_a_20250903T2236_5f4ce5d5.log) <img width="645" height="65" alt="image" src="https://github.com/user-attachments/assets/bbfb7659-5abb-4528-920f-da85c34e3a7c" /> <img width="635" height="155" alt="image" src="https://github.com/user-attachments/assets/890ffd2c-61a2-4303-a065-e0c833336f8a" /> <!-- Mark the relevant option with an [x] --> - [x] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Refactor - [ ] Documentation update - [ ] Other (please describe): ### CI Pipeline Configuration Configure CI behavior by applying the relevant labels: - [SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci) - Skip all continuous integration tests - [INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests) - Execute notebook validation tests in pytest - [INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests) - Execute tests labelled as slow in pytest for extensive testing > [!NOTE] > By default, the notebooks validation tests are skipped unless explicitly enabled. #### Authorizing CI Runs We use [copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation) to manage authorization of CI runs on NVIDIA's compute resources. - If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123) - If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an `/ok to test` comment on the pull request to trigger CI. This will need to be done for each new commit. ### Usage <!--- How does a user interact with the changed code --> ```python # TODO: Add code snippet ``` ### Pre-submit Checklist <!--- Ensure all items are completed before submitting --> - [wip] I have tested these changes locally - [na] I have updated the documentation accordingly - [na] I have added/updated tests as needed - [wip] All existing tests pass successfully <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Configurable maximum sequence length when loading models/tokenizers, applied to inference limits and thresholds. * Additional inference options enabled for improved runtime performance, including faster decode paths and decode-time optimizations. * **Tests** * Per-test GPU memory gating to skip runs when resources are insufficient, driven by test-aware memory requirements. * Tests compute per-test sequence-length caps, enable fast decode paths where applicable, and record tokens/sec for performance tracking. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Brian Roland <broland@nvidia.com>
1 parent 0284fdd commit d91788a

1 file changed

Lines changed: 118 additions & 37 deletions

File tree

sub-packages/bionemo-evo2/tests/bionemo/evo2/test_evo2.py

Lines changed: 118 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -16,15 +16,18 @@
1616
# See the License for the specific language governing permissions and
1717
# limitations under the License.
1818

19+
import inspect
1920
import logging
2021
import os
2122
import time
2223
from pathlib import Path
2324
from typing import Any, Callable, Literal
2425

2526
import numpy as np
27+
import pandas as pd
2628
import pytest
2729
import torch
30+
from megatron.core.inference.common_inference_params import CommonInferenceParams
2831
from megatron.core.transformer.enums import AttnBackend
2932
from megatron.core.transformer.module import Float16Module
3033
from nemo.collections import llm
@@ -48,32 +51,103 @@
4851
logger.setLevel(logging.DEBUG) # Capture all levels in the logger itself
4952

5053

51-
def determine_memory_requirement_and_skip_if_not_met(ckpt_name: str, flash_decode: bool | None = None) -> int:
52-
"""Determine the memory requirement for a given checkpoint and flash decode condition.
53-
ckpt_name : str
54-
the name of the checkpoint to test
55-
flash_decode: bool | None
56-
whether to test with flash decode
54+
def determine_memory_requirement_and_skip_if_not_met(ckpt_name: str, test_name: str | None = None) -> int:
55+
"""Determine the memory requirement for a given checkpoint and test_name.
56+
57+
The memory requirement recorded is not discriminated for flash_decode True or False. The memory requirement
58+
recorded depend on checkpoint name only through model size.
59+
60+
Args:
61+
ckpt_name: str
62+
the name of the checkpoint to test
63+
test_name: str | None
64+
the name of the test that is to be run.
5765
Returns:
5866
The input sequence length cap, for the model sin the checkpoint, given certain memory requirements.
5967
If the memory requirement is not met, the test is skipped.
6068
"""
6169

70+
# memory_needed_by_test: max reserved rounded up + 1, for stand-alone test
71+
memory_needed_df = pd.DataFrame(
72+
[
73+
{
74+
"test_name": "test_forward",
75+
"model_size": "1b",
76+
"seq_len_cap": 6000,
77+
"memory_needed_by_test": 18,
78+
}, # checked both variants in isolation
79+
{
80+
"test_name": "test_forward",
81+
"model_size": "7b",
82+
"seq_len_cap": 4000,
83+
"memory_needed_by_test": 33,
84+
}, # checked both variants in isolation
85+
{
86+
"test_name": "test_forward_manual",
87+
"model_size": "1b",
88+
"seq_len_cap": 6000,
89+
"memory_needed_by_test": 18,
90+
}, # checked both variants in isolation
91+
{
92+
"test_name": "test_forward_manual",
93+
"model_size": "7b",
94+
"seq_len_cap": 4000,
95+
"memory_needed_by_test": 21,
96+
}, # checked both variants in isolation
97+
{
98+
"test_name": "test_batch_generate",
99+
"model_size": "1b",
100+
"seq_len_cap": -1,
101+
"memory_needed_by_test": 16,
102+
}, # checked both variants in isolation
103+
{
104+
"test_name": "test_batch_generate",
105+
"model_size": "7b",
106+
"seq_len_cap": -1,
107+
"memory_needed_by_test": 43,
108+
}, # checked both variants in isolation
109+
{
110+
"test_name": "test_batch_generate_coding_sequences",
111+
"model_size": "1b",
112+
"seq_len_cap": -1,
113+
"memory_needed_by_test": 6,
114+
}, # checked both variants in isolation
115+
{
116+
"test_name": "test_batch_generate_coding_sequences",
117+
"model_size": "7b",
118+
"seq_len_cap": -1,
119+
"memory_needed_by_test": 21,
120+
}, # checked both variants in isolation
121+
{
122+
"test_name": "test_generate_speed",
123+
"model_size": "1b",
124+
"seq_len_cap": -1,
125+
"memory_needed_by_test": -1,
126+
}, # skipped for now until Anton's changes
127+
{
128+
"test_name": "test_generate_speed",
129+
"model_size": "7b",
130+
"seq_len_cap": -1,
131+
"memory_needed_by_test": -1,
132+
}, # skipped for now until Anton's changes
133+
],
134+
columns=["test_name", "model_size", "seq_len_cap", "memory_needed_by_test"],
135+
)
136+
memory_needed_df_wi_index = memory_needed_df.set_index(["test_name", "model_size"])
137+
62138
if "1b" in ckpt_name:
63139
model_size = "1b"
64-
seq_len_cap = 6000
65-
memory_needed_by_test = 17 # max reserved rounded up, for stand-alone test
66140
elif "7b" in ckpt_name:
67141
model_size = "7b"
68-
seq_len_cap = 4000
69-
memory_needed_by_test = 32 # max reserved rounded up, for stand-alone test
70142
else:
71143
raise ValueError(f"{ckpt_name=} is not supported for testing")
72144

73-
skip_condition_flash = flash_decode is None or flash_decode
74-
gb_available = torch.cuda.mem_get_info()[0] / 1024**3
75-
skip_condition = gb_available < memory_needed_by_test and skip_condition_flash
145+
seq_len_cap = memory_needed_df_wi_index.loc[(test_name, model_size), "seq_len_cap"]
146+
memory_needed_by_test = memory_needed_df_wi_index.loc[(test_name, model_size), "memory_needed_by_test"]
76147

148+
# skip_condition_flash = flash_decode is None or flash_decode
149+
gb_available = torch.cuda.mem_get_info()[0] / 1024**3
150+
skip_condition = gb_available < memory_needed_by_test
77151
if skip_condition:
78152
pytest.skip(
79153
", ".join(
@@ -328,7 +402,8 @@ def get_trainer(pipeline_parallel=1):
328402
)
329403

330404

331-
def get_model_and_tokenizer_raw(ckpt_dir_or_name: Path | str, **kwargs):
405+
# here: pass arg through to inference_batch_times_seqlen_threshold and inference_max_seq_length
406+
def get_model_and_tokenizer_raw(ckpt_dir_or_name: Path | str, seq_len_max: int = 8192, **kwargs):
332407
"""
333408
Load a model and tokenizer from a checkpoint directory or name. If you supply a Path argument then we assume that
334409
the path is already a checkpoint directory, otherwise we load the checkpoint from NGC or PBSS depending on
@@ -347,8 +422,8 @@ def get_model_and_tokenizer_raw(ckpt_dir_or_name: Path | str, **kwargs):
347422
path=ckpt_dir,
348423
trainer=trainer,
349424
params_dtype=torch.bfloat16,
350-
inference_batch_times_seqlen_threshold=8192, # TODO
351-
inference_max_seq_length=8192, # TODO
425+
inference_batch_times_seqlen_threshold=seq_len_max,
426+
inference_max_seq_length=seq_len_max,
352427
recompute_granularity=None,
353428
recompute_num_layers=None,
354429
recompute_method=None,
@@ -357,13 +432,13 @@ def get_model_and_tokenizer_raw(ckpt_dir_or_name: Path | str, **kwargs):
357432
return inference_wrapped_model, mcore_tokenizer
358433

359434

360-
def get_model_and_tokenizer(ckpt_name, vortex_style_fp8=False, **kwargs):
361-
return get_model_and_tokenizer_raw(ckpt_name, vortex_style_fp8=vortex_style_fp8, **kwargs)
435+
def get_model_and_tokenizer(ckpt_name, vortex_style_fp8=False, seq_len_max: int = 8192, **kwargs):
436+
return get_model_and_tokenizer_raw(ckpt_name, vortex_style_fp8=vortex_style_fp8, seq_len_max=seq_len_max, **kwargs)
362437

363438

364-
def get_model_and_tokenizer_ignore_vortex(ckpt_name, vortex_style_fp8=False, **kwargs):
439+
def get_model_and_tokenizer_ignore_vortex(ckpt_name, vortex_style_fp8=False, seq_len_max: int = 8192, **kwargs):
365440
# Capture and remove the vortex_style_fp8 argument for mamba models.
366-
return get_model_and_tokenizer_raw(ckpt_name, **kwargs)
441+
return get_model_and_tokenizer_raw(ckpt_name, seq_len_max=seq_len_max, **kwargs)
367442

368443

369444
def calc_matchrate(*, tokenizer, in_seq, logits):
@@ -404,7 +479,9 @@ def check_matchrate(*, ckpt_name, matchrate, assert_matchrate=True):
404479
)
405480
def test_forward(sequences: list[str], ckpt_name: str, expected_matchpercents: list[float]):
406481
assert len(sequences) > 0
407-
seq_len_cap = determine_memory_requirement_and_skip_if_not_met(ckpt_name)
482+
seq_len_cap = determine_memory_requirement_and_skip_if_not_met(
483+
ckpt_name, test_name=inspect.currentframe().f_code.co_name
484+
)
408485

409486
is_fp8_supported, compute_capability, device_info = check_fp8_support(torch.cuda.current_device())
410487
skip = "evo2/1b-8k:" in ckpt_name and not is_fp8_supported
@@ -463,7 +540,9 @@ def test_forward(sequences: list[str], ckpt_name: str, expected_matchpercents: l
463540
)
464541
def test_forward_manual(sequences: list[str], ckpt_name: str, expected_matchpercents: list[float], flash_decode: bool):
465542
assert len(sequences) > 0
466-
seq_len_cap = determine_memory_requirement_and_skip_if_not_met(ckpt_name, flash_decode)
543+
seq_len_cap = determine_memory_requirement_and_skip_if_not_met(
544+
ckpt_name, test_name=inspect.currentframe().f_code.co_name
545+
)
467546

468547
is_fp8_supported, compute_capability, device_info = check_fp8_support(torch.cuda.current_device())
469548
skip = "evo2/1b-8k:" in ckpt_name and not is_fp8_supported
@@ -572,14 +651,14 @@ def calculate_sequence_identity(seq1: str, seq2: str) -> float | None:
572651
("evo2/1b-8k:1.0", get_model_and_tokenizer, [96.8, 29.7, 76.6, 71.6]),
573652
("evo2_mamba/7b-8k:0.1", get_model_and_tokenizer_ignore_vortex, [99.2, 51.0, 73.0, 82.6]),
574653
("evo2/7b-8k:1.0", get_model_and_tokenizer, [97.60, 89.63, 80.03, 84.57]),
575-
# ("evo2/7b-1m:1.0", get_model_and_tokenizer, [97.60, 89.63, 80.03, 84.57]),
654+
("evo2/7b-1m:1.0", get_model_and_tokenizer, [97.60, 89.63, 80.03, 84.57]),
576655
],
577656
)
578657
def test_batch_generate(
579658
sequences: list[str], ckpt_name: str, model_tokenizer_provider: Callable, expected_matchpercents: list[float]
580659
):
581660
assert len(sequences) > 0
582-
determine_memory_requirement_and_skip_if_not_met(ckpt_name)
661+
_ = determine_memory_requirement_and_skip_if_not_met(ckpt_name, test_name=inspect.currentframe().f_code.co_name)
583662

584663
is_fp8_supported, compute_capability, device_info = check_fp8_support(torch.cuda.current_device())
585664
skip = "evo2/1b-8k:" in ckpt_name and not is_fp8_supported
@@ -591,12 +670,15 @@ def test_batch_generate(
591670
pytest.skip(f"Skipping {ckpt_name} because it is not on NGC yet. Run with `BIONEMO_DATA_SOURCE=pbss`.")
592671
# only use vortex_style_fp8 for non-bf16 checkpoints with fp8 support
593672
vortex_style_fp8 = is_fp8_supported and "bf16" not in ckpt_name
594-
inference_wrapped_model, mcore_tokenizer = model_tokenizer_provider(ckpt_name, vortex_style_fp8=vortex_style_fp8)
595673

596-
match_percents = []
597674
num_tokens = 500
598675
seq_prompts = [mid_point_split(seq=seq, num_tokens=num_tokens) for seq in sequences]
599-
from megatron.core.inference.common_inference_params import CommonInferenceParams
676+
seq_len_max = num_tokens + max([len(sq[0]) for sq in seq_prompts])
677+
inference_wrapped_model, mcore_tokenizer = model_tokenizer_provider(
678+
ckpt_name,
679+
vortex_style_fp8=vortex_style_fp8,
680+
seq_len_max=seq_len_max,
681+
)
600682

601683
results = generate(
602684
model=inference_wrapped_model,
@@ -613,6 +695,7 @@ def test_batch_generate(
613695
),
614696
)
615697

698+
match_percents = []
616699
for i, (result, (prompt, target)) in enumerate(zip(results, seq_prompts)):
617700
gen_seq = result.generated_text
618701
logging.info(f"{ckpt_name} {torch.distributed.get_rank()=} {gen_seq=}")
@@ -638,7 +721,7 @@ def test_batch_generate(
638721
("evo2/1b-8k:1.0", get_model_and_tokenizer, [86.4, 78.8, 87.6]),
639722
("evo2_mamba/7b-8k:0.1", get_model_and_tokenizer_ignore_vortex, [86.5, 88.4, 88.2]),
640723
("evo2/7b-8k:1.0", get_model_and_tokenizer, [88.8, 88.5, 82.2]),
641-
# ("evo2/7b-1m:1.0", get_model_and_tokenizer, [88.8, 88.5, 82.2]),
724+
("evo2/7b-1m:1.0", get_model_and_tokenizer, [88.8, 88.5, 82.2]),
642725
],
643726
)
644727
def test_batch_generate_coding_sequences(
@@ -648,7 +731,7 @@ def test_batch_generate_coding_sequences(
648731
expected_matchpercents: list[float],
649732
):
650733
assert len(coding_sequences) > 0
651-
determine_memory_requirement_and_skip_if_not_met(ckpt_name)
734+
determine_memory_requirement_and_skip_if_not_met(ckpt_name, test_name=inspect.currentframe().f_code.co_name)
652735

653736
is_fp8_supported, compute_capability, device_info = check_fp8_support(torch.cuda.current_device())
654737
skip = "evo2/1b-8k:" in ckpt_name and not is_fp8_supported
@@ -660,16 +743,16 @@ def test_batch_generate_coding_sequences(
660743
pytest.skip(f"Skipping {ckpt_name} because it is not on NGC yet. Run with `BIONEMO_DATA_SOURCE=pbss`.")
661744
# only use vortex_style_fp8 for non-bf16 checkpoints with fp8 support
662745
vortex_style_fp8 = is_fp8_supported and "bf16" not in ckpt_name
663-
inference_wrapped_model, mcore_tokenizer = model_tokenizer_provider(
664-
ckpt_name, vortex_style_fp8=vortex_style_fp8, enable_flash_decode=True, flash_decode=True
665-
)
666746

667747
match_percents: list[float] = []
668748
cds_lengths: list[int | None] = []
669749
original_cds_lengths: list[int] = [len(seq) for seq in coding_sequences]
670750
seq_prompts = [mid_point_split(seq=seq, num_tokens=None, fraction=0.3) for seq in coding_sequences]
671751
num_tokens = max(len(sq[1]) for sq in seq_prompts) + 15
672-
from megatron.core.inference.common_inference_params import CommonInferenceParams
752+
753+
inference_wrapped_model, mcore_tokenizer = model_tokenizer_provider(
754+
ckpt_name, vortex_style_fp8=vortex_style_fp8, enable_flash_decode=True, flash_decode=True
755+
)
673756

674757
_ = generate(
675758
model=inference_wrapped_model,
@@ -748,7 +831,7 @@ def test_batch_generate_coding_sequences(
748831
("evo2/1b-8k:1.0", get_model_and_tokenizer, 41.0),
749832
("evo2_mamba/7b-8k:0.1", get_model_and_tokenizer_ignore_vortex, 39.73),
750833
("evo2/7b-8k:1.0", get_model_and_tokenizer, 32.0),
751-
# ("evo2/7b-1m:1.0", get_model_and_tokenizer, 32.0),
834+
("evo2/7b-1m:1.0", get_model_and_tokenizer, 32.0),
752835
],
753836
)
754837
def test_generate_speed(
@@ -757,7 +840,7 @@ def test_generate_speed(
757840
expected_tokens_sec: float,
758841
):
759842
is_fp8_supported, compute_capability, device_info = check_fp8_support(torch.cuda.current_device())
760-
determine_memory_requirement_and_skip_if_not_met(ckpt_name)
843+
determine_memory_requirement_and_skip_if_not_met(ckpt_name, test_name=inspect.currentframe().f_code.co_name)
761844

762845
skip = "evo2/1b-8k:" in ckpt_name and not is_fp8_supported
763846
if skip:
@@ -776,8 +859,6 @@ def test_generate_speed(
776859
flash_decode=True,
777860
)
778861

779-
from megatron.core.inference.common_inference_params import CommonInferenceParams
780-
781862
# warm up the model with a single call before timing. This should take care of compilation etc.
782863
_ = generate(
783864
model=inference_wrapped_model,

0 commit comments

Comments
 (0)