fix(pt): Treat cuBLAS allocation failures as PyTorch OOM during auto bs by OutisLi · Pull Request #5440 · deepmodeling/deepmd-kit

OutisLi · 2026-05-11T01:25:37Z

Treat cuBLAS allocation failures as PyTorch OOM during auto batch sizing

PyTorch inference can raise after an oversized
batch attempt, especially during . Previously this was treated as a generic RuntimeError, so stopped after the first batch size reduction instead of continuing to shrink the inference batch.

Add this cuBLAS allocation failure to the PyTorch auto-batch OOM markers and cover it with a unit test, allowing to continue retrying with smaller batch sizes.

Summary by CodeRabbit

Bug Fixes
- Enhanced GPU memory error detection to properly recognize and handle CUDA allocation failures, enabling automatic memory recovery and improving reliability during compute-intensive operations.

…batch sizing PyTorch inference can raise after an oversized batch attempt, especially during . Previously this was treated as a generic RuntimeError, so stopped after the first batch size reduction instead of continuing to shrink the inference batch. Add this cuBLAS allocation failure to the PyTorch auto-batch OOM markers and cover it with a unit test, allowing to continue retrying with smaller batch sizes.

Copilot

Pull request overview

Treats cuBLAS allocation failures (CUBLAS_STATUS_ALLOC_FAILED) as GPU OOM signals in the PyTorch auto-batch-sizing path, so inference can continue shrinking the batch size instead of bailing out on a generic RuntimeError.

Changes:

Add CUBLAS_STATUS_ALLOC_FAILED to the PyTorch OOM marker substrings used by AutoBatchSize.is_oom_error.
Add a unit test asserting that this error string is classified as OOM and triggers torch.cuda.empty_cache().

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
`deepmd/pt/utils/auto_batch_size.py`	Extends the RuntimeError message markers considered OOM to include cuBLAS allocation failures.
`source/tests/pt/test_auto_batch_size.py`	Adds coverage to ensure the new cuBLAS allocation failure marker is treated as OOM and clears CUDA cache.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

coderabbitai · 2026-05-11T01:27:45Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: fe2fcd59-601d-4da7-9a22-1ac85e4b787e

📥 Commits

Reviewing files that changed from the base of the PR and between 57f870f and 7ccb11d.

📒 Files selected for processing (2)

deepmd/pt/utils/auto_batch_size.py
source/tests/pt/test_auto_batch_size.py

📝 Walkthrough

Walkthrough

This PR expands CUDA out-of-memory error detection to recognize cuBLAS allocation failures. The AutoBatchSize.is_oom_error method now treats "CUBLAS_STATUS_ALLOC_FAILED" as an OOM condition, with a corresponding unit test validating the detection logic.

Changes

CUBLAS OOM Detection

Layer / File(s)	Summary
OOM Error Detection `deepmd/pt/utils/auto_batch_size.py`	Plain-text CUDA OOM marker tuple is extended to include `"CUBLAS_STATUS_ALLOC_FAILED"`, triggering `torch.cuda.empty_cache()` and returning `True` when this error is encountered.
Test Coverage `source/tests/pt/test_auto_batch_size.py`	New test `test_is_oom_error_cublas_alloc_failed` patches `torch.cuda.empty_cache` and asserts that cublas allocation failure messages are correctly identified as OOM errors.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

deepmodeling/deepmd-kit#5418: Both PRs expand OOM detection in AutoBatchSize.is_oom_error—this PR adds CUBLAS_STATUS_ALLOC_FAILED marker recognition while the related PR adds cause-chain traversal and AOTInductor wrapper detection.

Suggested labels

bug, Python

Suggested reviewers

njzjz

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely describes the main change: treating cuBLAS allocation failures as OOM errors during auto batch sizing, which matches the changeset's core objective.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-11T02:17:23Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.84%. Comparing base (57f870f) to head (7ccb11d).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5440      +/-   ##
==========================================
+ Coverage   82.50%   82.84%   +0.34%     
==========================================
  Files         826      830       +4     
  Lines       87935    91245    +3310     
  Branches     4206     4376     +170     
==========================================
+ Hits        72547    75591    +3044     
- Misses      14104    14349     +245     
- Partials     1284     1305      +21

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot AI review requested due to automatic review settings May 11, 2026 01:25

github-actions Bot added the Python label May 11, 2026

Copilot started reviewing on behalf of OutisLi May 11, 2026 01:26 View session

dosubot Bot added the bug label May 11, 2026

OutisLi requested review from iProzd and njzjz May 11, 2026 01:26

Copilot AI reviewed May 11, 2026

View reviewed changes

njzjz approved these changes May 11, 2026

View reviewed changes

OutisLi closed this by deleting the head repository May 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pt): Treat cuBLAS allocation failures as PyTorch OOM during auto bs#5440

fix(pt): Treat cuBLAS allocation failures as PyTorch OOM during auto bs#5440
OutisLi wants to merge 1 commit into
deepmodeling:masterfrom
OutisLi:pr/bs

OutisLi commented May 11, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

coderabbitai Bot commented May 11, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented May 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

OutisLi commented May 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

coderabbitai Bot commented May 11, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

OutisLi commented May 11, 2026 •

edited by coderabbitai Bot

Loading

codecov Bot commented May 11, 2026 •

edited

Loading