feat: add Blackwell GPU arch mismatch diagnostic#653
Conversation
Add get_cuda_arch() to platform_detect.py. Improve check_cuda_compatibility() warning to give Blackwell (sm_120+) users a clear re-download path via Settings → Server → GPU Acceleration. Surface cuda_arch_warning on every ModelStatus entry in /models/status so the frontend can highlight it.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughThis PR adds GPU CUDA architecture compatibility detection and reporting. It introduces runtime GPU architecture probing via ChangesGPU CUDA Compatibility Warning System
Sequence DiagramsequenceDiagram
participant Client
participant get_model_status
participant check_cuda_compatibility
participant get_cuda_arch
participant CUDA_Device
Client->>get_model_status: Request model status
get_model_status->>get_cuda_arch: Detect GPU architecture
get_cuda_arch->>CUDA_Device: torch.cuda.get_device_capability()
CUDA_Device-->>get_cuda_arch: (major, minor) compute capability
get_cuda_arch-->>get_model_status: sm_90 (or None if unavailable)
get_model_status->>check_cuda_compatibility: Check PyTorch build support
check_cuda_compatibility-->>get_model_status: (compatible: bool, warning: Optional[str])
get_model_status-->>Client: ModelStatus with cuda_arch_warning field
Estimated Code Review Effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
RTX 5000-series (Blackwell, sm_100) GPUs produce a CUDA arch mismatch error during model load that surfaces as a generic crash — confusing for users who just got a new GPU and expect it to work.
This adds detection for the specific error pattern and shows a clear user-facing message: the installed CUDA libraries were compiled for an older architecture and need to be updated. Points to the resolution path instead of leaving the user with a raw stack trace.
Affected: Windows and Linux users with RTX 5080/5090 (Blackwell, compute capability 10.0).
Summary by CodeRabbit
New Features
Bug Fixes