Skip to content

feat: add Blackwell GPU arch mismatch diagnostic#653

Open
neuron-tech-ai wants to merge 1 commit into
jamiepine:mainfrom
neuron-tech-ai:feat/blackwell-diagnostic
Open

feat: add Blackwell GPU arch mismatch diagnostic#653
neuron-tech-ai wants to merge 1 commit into
jamiepine:mainfrom
neuron-tech-ai:feat/blackwell-diagnostic

Conversation

@neuron-tech-ai
Copy link
Copy Markdown

@neuron-tech-ai neuron-tech-ai commented May 14, 2026

RTX 5000-series (Blackwell, sm_100) GPUs produce a CUDA arch mismatch error during model load that surfaces as a generic crash — confusing for users who just got a new GPU and expect it to work.

This adds detection for the specific error pattern and shows a clear user-facing message: the installed CUDA libraries were compiled for an older architecture and need to be updated. Points to the resolution path instead of leaving the user with a raw stack trace.

Affected: Windows and Linux users with RTX 5080/5090 (Blackwell, compute capability 10.0).

Summary by CodeRabbit

  • New Features

    • Enhanced GPU architecture compatibility detection with customized warning messages based on GPU type
    • Blackwell-class GPU users receive specific instructions to download the correct CUDA binary from Settings
  • Bug Fixes

    • Improved CUDA compatibility checking with clearer guidance when GPU architecture doesn't match installed CUDA binary
    • More informative error messages for unsupported GPU configurations

Review Change Stack

Add get_cuda_arch() to platform_detect.py. Improve check_cuda_compatibility()
warning to give Blackwell (sm_120+) users a clear re-download path via
Settings → Server → GPU Acceleration. Surface cuda_arch_warning on every
ModelStatus entry in /models/status so the frontend can highlight it.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 188f1030-5f76-4341-a2d1-8ad5a225bda9

📥 Commits

Reviewing files that changed from the base of the PR and between b35b909 and aa95089.

📒 Files selected for processing (4)
  • backend/backends/base.py
  • backend/models.py
  • backend/routes/models.py
  • backend/utils/platform_detect.py

📝 Walkthrough

Walkthrough

This PR adds GPU CUDA architecture compatibility detection and reporting. It introduces runtime GPU architecture probing via torch, enhances CUDA compatibility checking with GPU-specific warning messages (especially for Blackwell architectures), and integrates the warnings into the model-status API response.

Changes

GPU CUDA Compatibility Warning System

Layer / File(s) Summary
GPU Architecture Detection
backend/utils/platform_detect.py
New get_cuda_arch() function detects primary GPU compute capability at runtime and returns an SM architecture string (e.g., sm_90) or None when CUDA is unavailable or an error occurs.
CUDA Compatibility Warning Logic
backend/backends/base.py
check_cuda_compatibility() now constructs targeted warning messages: Blackwell GPUs (major >= 12) receive a "re-download from Settings → Server → GPU Acceleration" message; other unsupported GPUs receive a "download compatible CUDA binary" message.
Model Status API Response Integration
backend/models.py, backend/routes/models.py
ModelStatus adds optional cuda_arch_warning field. get_model_status() computes the warning by checking CUDA availability and calling check_cuda_compatibility(), then includes it in both the primary success path and exception fallback path.

Sequence Diagram

sequenceDiagram
  participant Client
  participant get_model_status
  participant check_cuda_compatibility
  participant get_cuda_arch
  participant CUDA_Device
  
  Client->>get_model_status: Request model status
  get_model_status->>get_cuda_arch: Detect GPU architecture
  get_cuda_arch->>CUDA_Device: torch.cuda.get_device_capability()
  CUDA_Device-->>get_cuda_arch: (major, minor) compute capability
  get_cuda_arch-->>get_model_status: sm_90 (or None if unavailable)
  get_model_status->>check_cuda_compatibility: Check PyTorch build support
  check_cuda_compatibility-->>get_model_status: (compatible: bool, warning: Optional[str])
  get_model_status-->>Client: ModelStatus with cuda_arch_warning field
Loading

Estimated Code Review Effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 A GPU speaks its secret name,
SM architectures, Blackwell's claim!
PyTorch listens, checks the tune,
Warnings flutter to the moon.
Users download, all is well—
CUDA harmony does swell! 🚀

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: adding a diagnostic for Blackwell GPU architecture mismatch.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant