Skip to content

fix(pt): Treat cuBLAS allocation failures as PyTorch OOM during auto bs#5440

Closed
OutisLi wants to merge 1 commit into
deepmodeling:masterfrom
OutisLi:pr/bs
Closed

fix(pt): Treat cuBLAS allocation failures as PyTorch OOM during auto bs#5440
OutisLi wants to merge 1 commit into
deepmodeling:masterfrom
OutisLi:pr/bs

Conversation

@OutisLi
Copy link
Copy Markdown
Collaborator

@OutisLi OutisLi commented May 11, 2026

Treat cuBLAS allocation failures as PyTorch OOM during auto batch sizing

PyTorch inference can raise after an oversized
batch attempt, especially during . Previously this was treated as a generic RuntimeError, so stopped after the first batch size reduction instead of continuing to shrink the inference batch.

Add this cuBLAS allocation failure to the PyTorch auto-batch OOM markers and cover it with a unit test, allowing to continue retrying with smaller batch sizes.

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced GPU memory error detection to properly recognize and handle CUDA allocation failures, enabling automatic memory recovery and improving reliability during compute-intensive operations.

Review Change Stack

…batch sizing

PyTorch inference can raise  after an oversized
batch attempt, especially during . Previously this was
treated as a generic RuntimeError, so  stopped after the first batch
size reduction instead of continuing to shrink the inference batch.

Add this cuBLAS allocation failure to the PyTorch auto-batch OOM markers and
cover it with a unit test, allowing  to continue retrying with smaller
batch sizes.
Copilot AI review requested due to automatic review settings May 11, 2026 01:25
@dosubot dosubot Bot added the bug label May 11, 2026
@OutisLi OutisLi requested review from iProzd and njzjz May 11, 2026 01:26
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Treats cuBLAS allocation failures (CUBLAS_STATUS_ALLOC_FAILED) as GPU OOM signals in the PyTorch auto-batch-sizing path, so inference can continue shrinking the batch size instead of bailing out on a generic RuntimeError.

Changes:

  • Add CUBLAS_STATUS_ALLOC_FAILED to the PyTorch OOM marker substrings used by AutoBatchSize.is_oom_error.
  • Add a unit test asserting that this error string is classified as OOM and triggers torch.cuda.empty_cache().

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
deepmd/pt/utils/auto_batch_size.py Extends the RuntimeError message markers considered OOM to include cuBLAS allocation failures.
source/tests/pt/test_auto_batch_size.py Adds coverage to ensure the new cuBLAS allocation failure marker is treated as OOM and clears CUDA cache.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 11, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: fe2fcd59-601d-4da7-9a22-1ac85e4b787e

📥 Commits

Reviewing files that changed from the base of the PR and between 57f870f and 7ccb11d.

📒 Files selected for processing (2)
  • deepmd/pt/utils/auto_batch_size.py
  • source/tests/pt/test_auto_batch_size.py

📝 Walkthrough

Walkthrough

This PR expands CUDA out-of-memory error detection to recognize cuBLAS allocation failures. The AutoBatchSize.is_oom_error method now treats "CUBLAS_STATUS_ALLOC_FAILED" as an OOM condition, with a corresponding unit test validating the detection logic.

Changes

CUBLAS OOM Detection

Layer / File(s) Summary
OOM Error Detection
deepmd/pt/utils/auto_batch_size.py
Plain-text CUDA OOM marker tuple is extended to include "CUBLAS_STATUS_ALLOC_FAILED", triggering torch.cuda.empty_cache() and returning True when this error is encountered.
Test Coverage
source/tests/pt/test_auto_batch_size.py
New test test_is_oom_error_cublas_alloc_failed patches torch.cuda.empty_cache and asserts that cublas allocation failure messages are correctly identified as OOM errors.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

  • deepmodeling/deepmd-kit#5418: Both PRs expand OOM detection in AutoBatchSize.is_oom_error—this PR adds CUBLAS_STATUS_ALLOC_FAILED marker recognition while the related PR adds cause-chain traversal and AOTInductor wrapper detection.

Suggested labels

bug, Python

Suggested reviewers

  • njzjz
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main change: treating cuBLAS allocation failures as OOM errors during auto batch sizing, which matches the changeset's core objective.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.84%. Comparing base (57f870f) to head (7ccb11d).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5440      +/-   ##
==========================================
+ Coverage   82.50%   82.84%   +0.34%     
==========================================
  Files         826      830       +4     
  Lines       87935    91245    +3310     
  Branches     4206     4376     +170     
==========================================
+ Hits        72547    75591    +3044     
- Misses      14104    14349     +245     
- Partials     1284     1305      +21     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@OutisLi OutisLi closed this by deleting the head repository May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants