fix memory error on nightly a bnm2712 (#1104)

broland-hat · web-flow · commit d91788a700a9 · 2025-09-04T20:13:40.000Z
### Description - A recent run of the nightly blossum pipeline showed the error below. - https://prod.blsm.nvidia.com/bionemo-external-bionemo-fw/job/test_pytest/2411/pipeline-console/log?nodeId=90 `[2025-08-29T13:07:20.866Z] E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.81 GiB. GPU 0 has a total capacity of 39.50 GiB of which 12.39 GiB is free. Process 500601 has 27.10 GiB memory in use. Of the allocated memory 25.96 GiB is allocated by PyTorch, with 2.00 MiB allocated in private pools (e.g., CUDA Graphs), and 176.45 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)` `[2025-08-29T12:42:04.086Z] sub-packages/bionemo-evo2/tests/bionemo/evo2/test_evo2.py::test_batch_generate[evo2/7b-1m:1.0-get_model_and_tokenizer-expected_matchpercents4] FAILED [ 51%] ` - The nightline CI pipeline is run on A100-40GB. See ticket for more detail https://jirasw.nvidia.com/browse/BIONEMO-2712, - We update the logic for selecting memory threshold for affected tests. ### Type of changes ### Local pytest results **1. sub-packages/bionemo-evo2/tests/bionemo/evo2/test_evo2.py - single H100 80GB device** - [pytests_nightly_ci_memory_errors_80gb-device_sub-packages-bionemo-evo2-tests-bionemo-evo2-test_evo2_notslow_br_bnm2712_memory_error_on_nightly_a_20250903T2236_5f4ce5d5.log](https://github.com/user-attachments/files/22128577/pytests_nightly_ci_memory_errors_80gb-device_sub-packages-bionemo-evo2-tests-bionemo-evo2-test_evo2_notslow_br_bnm2712_memory_error_on_nightly_a_20250903T2236_5f4ce5d5.log) - 18 passed, 2 skipped, 6 deselected, 1655 warnings in 1736.85s (0:28:56) - max memory reserved for tensors etc: 41.426 GB **2. sub-packages/bionemo-evo2/tests/bionemo/evo2/test_evo2.py - single H100 80GB device, restricted to 40GB** running [pytests_nightly_ci_memory_errors_40gb-device_sub-packages-bionemo-evo2-tests-bionemo-evo2-test_evo2_notslow_br_bnm2712_memory_error_on_nightly_a_20250903T2343_5f4ce5d5.log](https://github.com/user-attachments/files/22128575/pytests_nightly_ci_memory_errors_40gb-device_sub-packages-bionemo-evo2-tests-bionemo-evo2-test_evo2_notslow_br_bnm2712_memory_error_on_nightly_a_20250903T2343_5f4ce5d5.log) <img width="669" height="57" alt="image" src="https://github.com/user-attachments/assets/1c31d587-0c88-4e59-92e3-da542f1248fb" /> **3. sub-packages/bionemo-evo2/tests/bionemo/evo2/test_evo2.py - single H100 80GB device, restricted to 20GB** [pytests_nightly_ci_memory_errors_80gb-device_sub-packages-bionemo-evo2-tests-bionemo-evo2-test_evo2_notslow_br_bnm2712_memory_error_on_nightly_a_20250903T2236_5f4ce5d5.log](https://github.com/user-attachments/files/22129350/pytests_nightly_ci_memory_errors_80gb-device_sub-packages-bionemo-evo2-tests-bionemo-evo2-test_evo2_notslow_br_bnm2712_memory_error_on_nightly_a_20250903T2236_5f4ce5d5.log) <img width="645" height="65" alt="image" src="https://github.com/user-attachments/assets/bbfb7659-5abb-4528-920f-da85c34e3a7c" /> <img width="635" height="155" alt="image" src="https://github.com/user-attachments/assets/890ffd2c-61a2-4303-a065-e0c833336f8a" />  - [x] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Refactor - [ ] Documentation update - [ ] Other (please describe): ### CI Pipeline Configuration Configure CI behavior by applying the relevant labels: - [SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci) - Skip all continuous integration tests - [INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests) - Execute notebook validation tests in pytest - [INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests) - Execute tests labelled as slow in pytest for extensive testing > [!NOTE] > By default, the notebooks validation tests are skipped unless explicitly enabled. #### Authorizing CI Runs We use [copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation) to manage authorization of CI runs on NVIDIA's compute resources. - If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123) - If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an `/ok to test` comment on the pull request to trigger CI. This will need to be done for each new commit. ### Usage  ```python # TODO: Add code snippet ``` ### Pre-submit Checklist  - [wip] I have tested these changes locally - [na] I have updated the documentation accordingly - [na] I have added/updated tests as needed - [wip] All existing tests pass successfully  ## Summary by CodeRabbit * **New Features** * Configurable maximum sequence length when loading models/tokenizers, applied to inference limits and thresholds. * Additional inference options enabled for improved runtime performance, including faster decode paths and decode-time optimizations. * **Tests** * Per-test GPU memory gating to skip runs when resources are insufficient, driven by test-aware memory requirements. * Tests compute per-test sequence-length caps, enable fast decode paths where applicable, and record tokens/sec for performance tracking.  --------- Signed-off-by: Brian Roland <broland@nvidia.com>
diff --git a/sub-packages/bionemo-evo2/tests/bionemo/evo2/test_evo2.py b/sub-packages/bionemo-evo2/tests/bionemo/evo2/test_evo2.py
@@ -16,15 +16,18 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
+import inspect
 import logging
 import os
 import time
 from pathlib import Path
 from typing import Any, Callable, Literal
 
 import numpy as np
+import pandas as pd
 import pytest
 import torch
+from megatron.core.inference.common_inference_params import CommonInferenceParams
 from megatron.core.transformer.enums import AttnBackend
 from megatron.core.transformer.module import Float16Module
 from nemo.collections import llm
@@ -48,32 +51,103 @@
 logger.setLevel(logging.DEBUG)  # Capture all levels in the logger itself
 
 
-def determine_memory_requirement_and_skip_if_not_met(ckpt_name: str, flash_decode: bool | None = None) -> int:
-    """Determine the memory requirement for a given checkpoint and flash decode condition.
-    ckpt_name : str
-        the name of the checkpoint to test
-    flash_decode: bool | None
-        whether to test with flash decode
+def determine_memory_requirement_and_skip_if_not_met(ckpt_name: str, test_name: str | None = None) -> int:
+    """Determine the memory requirement for a given checkpoint and test_name.
+
+    The memory requirement recorded is not discriminated for flash_decode True or False.  The memory requirement
+    recorded depend on checkpoint name only through model size.
+
+    Args:
+        ckpt_name: str
+            the name of the checkpoint to test
+        test_name: str | None
+            the name of the test that is to be run.
     Returns:
         The input sequence length cap, for the model sin the checkpoint, given certain memory requirements.
         If the memory requirement is not met, the test is skipped.
     """
 
+    # memory_needed_by_test: max reserved rounded up + 1, for stand-alone test
+    memory_needed_df = pd.DataFrame(
+        [
+            {
+                "test_name": "test_forward",
+                "model_size": "1b",
+                "seq_len_cap": 6000,
+                "memory_needed_by_test": 18,
+            },  # checked both variants in isolation
+            {
+                "test_name": "test_forward",
+                "model_size": "7b",
+                "seq_len_cap": 4000,
+                "memory_needed_by_test": 33,
+            },  # checked both variants in isolation
+            {
+                "test_name": "test_forward_manual",
+                "model_size": "1b",
+                "seq_len_cap": 6000,
+                "memory_needed_by_test": 18,
+            },  # checked both variants in isolation
+            {
+                "test_name": "test_forward_manual",
+                "model_size": "7b",
+                "seq_len_cap": 4000,
+                "memory_needed_by_test": 21,
+            },  # checked both variants in isolation
+            {
+                "test_name": "test_batch_generate",
+                "model_size": "1b",
+                "seq_len_cap": -1,
+                "memory_needed_by_test": 16,
+            },  # checked both variants in isolation
+            {
+                "test_name": "test_batch_generate",
+                "model_size": "7b",
+                "seq_len_cap": -1,
+                "memory_needed_by_test": 43,
+            },  # checked both variants in isolation
+            {
+                "test_name": "test_batch_generate_coding_sequences",
+                "model_size": "1b",
+                "seq_len_cap": -1,
+                "memory_needed_by_test": 6,
+            },  # checked both variants in isolation
+            {
+                "test_name": "test_batch_generate_coding_sequences",
+                "model_size": "7b",
+                "seq_len_cap": -1,
+                "memory_needed_by_test": 21,
+            },  # checked both variants in isolation
+            {
+                "test_name": "test_generate_speed",
+                "model_size": "1b",
+                "seq_len_cap": -1,
+                "memory_needed_by_test": -1,
+            },  # skipped for now until Anton's changes
+            {
+                "test_name": "test_generate_speed",
+                "model_size": "7b",
+                "seq_len_cap": -1,
+                "memory_needed_by_test": -1,
+            },  # skipped for now until Anton's changes
+        ],
+        columns=["test_name", "model_size", "seq_len_cap", "memory_needed_by_test"],
+    )
+    memory_needed_df_wi_index = memory_needed_df.set_index(["test_name", "model_size"])
+
     if "1b" in ckpt_name:
         model_size = "1b"
-        seq_len_cap = 6000
-        memory_needed_by_test = 17  # max reserved rounded up, for stand-alone test
     elif "7b" in ckpt_name:
         model_size = "7b"
-        seq_len_cap = 4000
-        memory_needed_by_test = 32  # max reserved rounded up, for stand-alone test
     else:
         raise ValueError(f"{ckpt_name=} is not supported for testing")
 
-    skip_condition_flash = flash_decode is None or flash_decode
-    gb_available = torch.cuda.mem_get_info()[0] / 1024**3
-    skip_condition = gb_available < memory_needed_by_test and skip_condition_flash
+    seq_len_cap = memory_needed_df_wi_index.loc[(test_name, model_size), "seq_len_cap"]
+    memory_needed_by_test = memory_needed_df_wi_index.loc[(test_name, model_size), "memory_needed_by_test"]
 
+    # skip_condition_flash = flash_decode is None or flash_decode
+    gb_available = torch.cuda.mem_get_info()[0] / 1024**3
+    skip_condition = gb_available < memory_needed_by_test
     if skip_condition:
         pytest.skip(
             ", ".join(
@@ -328,7 +402,8 @@ def get_trainer(pipeline_parallel=1):
     )
 
 
-def get_model_and_tokenizer_raw(ckpt_dir_or_name: Path | str, **kwargs):
+# here: pass arg through to inference_batch_times_seqlen_threshold and inference_max_seq_length
+def get_model_and_tokenizer_raw(ckpt_dir_or_name: Path | str, seq_len_max: int = 8192, **kwargs):
     """
     Load a model and tokenizer from a checkpoint directory or name. If you supply a Path argument then we assume that
     the path is already a checkpoint directory, otherwise we load the checkpoint from NGC or PBSS depending on
@@ -347,8 +422,8 @@ def get_model_and_tokenizer_raw(ckpt_dir_or_name: Path | str, **kwargs):
         path=ckpt_dir,
         trainer=trainer,
         params_dtype=torch.bfloat16,
-        inference_batch_times_seqlen_threshold=8192,  # TODO
-        inference_max_seq_length=8192,  # TODO
+        inference_batch_times_seqlen_threshold=seq_len_max,
+        inference_max_seq_length=seq_len_max,
         recompute_granularity=None,
         recompute_num_layers=None,
         recompute_method=None,
@@ -357,13 +432,13 @@ def get_model_and_tokenizer_raw(ckpt_dir_or_name: Path | str, **kwargs):
     return inference_wrapped_model, mcore_tokenizer
 
 
-def get_model_and_tokenizer(ckpt_name, vortex_style_fp8=False, **kwargs):
-    return get_model_and_tokenizer_raw(ckpt_name, vortex_style_fp8=vortex_style_fp8, **kwargs)
+def get_model_and_tokenizer(ckpt_name, vortex_style_fp8=False, seq_len_max: int = 8192, **kwargs):
+    return get_model_and_tokenizer_raw(ckpt_name, vortex_style_fp8=vortex_style_fp8, seq_len_max=seq_len_max, **kwargs)
 
 
-def get_model_and_tokenizer_ignore_vortex(ckpt_name, vortex_style_fp8=False, **kwargs):
+def get_model_and_tokenizer_ignore_vortex(ckpt_name, vortex_style_fp8=False, seq_len_max: int = 8192, **kwargs):
     # Capture and remove the vortex_style_fp8 argument for mamba models.
-    return get_model_and_tokenizer_raw(ckpt_name, **kwargs)
+    return get_model_and_tokenizer_raw(ckpt_name, seq_len_max=seq_len_max, **kwargs)
 
 
 def calc_matchrate(*, tokenizer, in_seq, logits):
@@ -404,7 +479,9 @@ def check_matchrate(*, ckpt_name, matchrate, assert_matchrate=True):
 )
 def test_forward(sequences: list[str], ckpt_name: str, expected_matchpercents: list[float]):
     assert len(sequences) > 0
-    seq_len_cap = determine_memory_requirement_and_skip_if_not_met(ckpt_name)
+    seq_len_cap = determine_memory_requirement_and_skip_if_not_met(
+        ckpt_name, test_name=inspect.currentframe().f_code.co_name
+    )
 
     is_fp8_supported, compute_capability, device_info = check_fp8_support(torch.cuda.current_device())
     skip = "evo2/1b-8k:" in ckpt_name and not is_fp8_supported
@@ -463,7 +540,9 @@ def test_forward(sequences: list[str], ckpt_name: str, expected_matchpercents: l
 )
 def test_forward_manual(sequences: list[str], ckpt_name: str, expected_matchpercents: list[float], flash_decode: bool):
     assert len(sequences) > 0
-    seq_len_cap = determine_memory_requirement_and_skip_if_not_met(ckpt_name, flash_decode)
+    seq_len_cap = determine_memory_requirement_and_skip_if_not_met(
+        ckpt_name, test_name=inspect.currentframe().f_code.co_name
+    )
 
     is_fp8_supported, compute_capability, device_info = check_fp8_support(torch.cuda.current_device())
     skip = "evo2/1b-8k:" in ckpt_name and not is_fp8_supported
@@ -572,14 +651,14 @@ def calculate_sequence_identity(seq1: str, seq2: str) -> float | None:
         ("evo2/1b-8k:1.0", get_model_and_tokenizer, [96.8, 29.7, 76.6, 71.6]),
         ("evo2_mamba/7b-8k:0.1", get_model_and_tokenizer_ignore_vortex, [99.2, 51.0, 73.0, 82.6]),
         ("evo2/7b-8k:1.0", get_model_and_tokenizer, [97.60, 89.63, 80.03, 84.57]),
-        # ("evo2/7b-1m:1.0", get_model_and_tokenizer, [97.60, 89.63, 80.03, 84.57]),
+        ("evo2/7b-1m:1.0", get_model_and_tokenizer, [97.60, 89.63, 80.03, 84.57]),
     ],
 )
 def test_batch_generate(
     sequences: list[str], ckpt_name: str, model_tokenizer_provider: Callable, expected_matchpercents: list[float]
 ):
     assert len(sequences) > 0
-    determine_memory_requirement_and_skip_if_not_met(ckpt_name)
+    _ = determine_memory_requirement_and_skip_if_not_met(ckpt_name, test_name=inspect.currentframe().f_code.co_name)
 
     is_fp8_supported, compute_capability, device_info = check_fp8_support(torch.cuda.current_device())
     skip = "evo2/1b-8k:" in ckpt_name and not is_fp8_supported
@@ -591,12 +670,15 @@ def test_batch_generate(
         pytest.skip(f"Skipping {ckpt_name} because it is not on NGC yet. Run with `BIONEMO_DATA_SOURCE=pbss`.")
     # only use vortex_style_fp8 for non-bf16 checkpoints with fp8 support
     vortex_style_fp8 = is_fp8_supported and "bf16" not in ckpt_name
-    inference_wrapped_model, mcore_tokenizer = model_tokenizer_provider(ckpt_name, vortex_style_fp8=vortex_style_fp8)
 
-    match_percents = []
     num_tokens = 500
     seq_prompts = [mid_point_split(seq=seq, num_tokens=num_tokens) for seq in sequences]
-    from megatron.core.inference.common_inference_params import CommonInferenceParams
+    seq_len_max = num_tokens + max([len(sq[0]) for sq in seq_prompts])
+    inference_wrapped_model, mcore_tokenizer = model_tokenizer_provider(
+        ckpt_name,
+        vortex_style_fp8=vortex_style_fp8,
+        seq_len_max=seq_len_max,
+    )
 
     results = generate(
         model=inference_wrapped_model,
@@ -613,6 +695,7 @@ def test_batch_generate(
         ),
     )
 
+    match_percents = []
     for i, (result, (prompt, target)) in enumerate(zip(results, seq_prompts)):
         gen_seq = result.generated_text
         logging.info(f"{ckpt_name} {torch.distributed.get_rank()=} {gen_seq=}")
@@ -638,7 +721,7 @@ def test_batch_generate(
         ("evo2/1b-8k:1.0", get_model_and_tokenizer, [86.4, 78.8, 87.6]),
         ("evo2_mamba/7b-8k:0.1", get_model_and_tokenizer_ignore_vortex, [86.5, 88.4, 88.2]),
         ("evo2/7b-8k:1.0", get_model_and_tokenizer, [88.8, 88.5, 82.2]),
-        # ("evo2/7b-1m:1.0", get_model_and_tokenizer, [88.8, 88.5, 82.2]),
+        ("evo2/7b-1m:1.0", get_model_and_tokenizer, [88.8, 88.5, 82.2]),
     ],
 )
 def test_batch_generate_coding_sequences(
@@ -648,7 +731,7 @@ def test_batch_generate_coding_sequences(
     expected_matchpercents: list[float],
 ):
     assert len(coding_sequences) > 0
-    determine_memory_requirement_and_skip_if_not_met(ckpt_name)
+    determine_memory_requirement_and_skip_if_not_met(ckpt_name, test_name=inspect.currentframe().f_code.co_name)
 
     is_fp8_supported, compute_capability, device_info = check_fp8_support(torch.cuda.current_device())
     skip = "evo2/1b-8k:" in ckpt_name and not is_fp8_supported
@@ -660,16 +743,16 @@ def test_batch_generate_coding_sequences(
         pytest.skip(f"Skipping {ckpt_name} because it is not on NGC yet. Run with `BIONEMO_DATA_SOURCE=pbss`.")
     # only use vortex_style_fp8 for non-bf16 checkpoints with fp8 support
     vortex_style_fp8 = is_fp8_supported and "bf16" not in ckpt_name
-    inference_wrapped_model, mcore_tokenizer = model_tokenizer_provider(
-        ckpt_name, vortex_style_fp8=vortex_style_fp8, enable_flash_decode=True, flash_decode=True
-    )
 
     match_percents: list[float] = []
     cds_lengths: list[int | None] = []
     original_cds_lengths: list[int] = [len(seq) for seq in coding_sequences]
     seq_prompts = [mid_point_split(seq=seq, num_tokens=None, fraction=0.3) for seq in coding_sequences]
     num_tokens = max(len(sq[1]) for sq in seq_prompts) + 15
-    from megatron.core.inference.common_inference_params import CommonInferenceParams
+
+    inference_wrapped_model, mcore_tokenizer = model_tokenizer_provider(
+        ckpt_name, vortex_style_fp8=vortex_style_fp8, enable_flash_decode=True, flash_decode=True
+    )
 
     _ = generate(
         model=inference_wrapped_model,
@@ -748,7 +831,7 @@ def test_batch_generate_coding_sequences(
         ("evo2/1b-8k:1.0", get_model_and_tokenizer, 41.0),
         ("evo2_mamba/7b-8k:0.1", get_model_and_tokenizer_ignore_vortex, 39.73),
         ("evo2/7b-8k:1.0", get_model_and_tokenizer, 32.0),
-        # ("evo2/7b-1m:1.0", get_model_and_tokenizer, 32.0),
+        ("evo2/7b-1m:1.0", get_model_and_tokenizer, 32.0),
     ],
 )
 def test_generate_speed(
@@ -757,7 +840,7 @@ def test_generate_speed(
     expected_tokens_sec: float,
 ):
     is_fp8_supported, compute_capability, device_info = check_fp8_support(torch.cuda.current_device())
-    determine_memory_requirement_and_skip_if_not_met(ckpt_name)
+    determine_memory_requirement_and_skip_if_not_met(ckpt_name, test_name=inspect.currentframe().f_code.co_name)
 
     skip = "evo2/1b-8k:" in ckpt_name and not is_fp8_supported
     if skip:
@@ -776,8 +859,6 @@ def test_generate_speed(
         flash_decode=True,
     )
 
-    from megatron.core.inference.common_inference_params import CommonInferenceParams
-
     # warm up the model with a single call before timing. This should take care of compilation etc.
     _ = generate(
         model=inference_wrapped_model,