The GPU module in RustWhy provides comprehensive diagnostics across all major GPU vendors:
- NVIDIA (GeForce, Quadro, Tesla)
- AMD (Radeon, FirePro, Instinct)
- Intel (Integrated and Arc GPUs)
The module automatically detects all GPUs in the system and uses vendor-specific tools when available, falling back to generic sysfs reading for basic information.
# Analyze all GPUs
rustwhy gpu
# Verbose output with additional details
rustwhy gpu --verbose
# JSON output for scripting
rustwhy gpu --json# Continuous GPU monitoring (updates every 2 seconds)
rustwhy gpu --watch
# Custom update interval (5 seconds)
rustwhy gpu --watch --interval 5The module attempts to collect the following metrics for each GPU:
| Metric | Description | Unit | Vendors |
|---|---|---|---|
| Name | GPU model name | - | All |
| Utilization | GPU usage percentage | % | NVIDIA, AMD, Intel |
| Memory Used | VRAM usage | MiB | NVIDIA, AMD |
| Memory Total | Total VRAM | MiB | NVIDIA, AMD |
| Temperature | GPU core temperature | °C | All |
| Power Draw | Current power consumption | W | NVIDIA, AMD, Intel |
| Fan Speed | Cooling fan RPM | RPM | NVIDIA, AMD |
| Clock Speed | GPU core clock frequency | MHz | NVIDIA, Intel |
Note: Not all metrics are available for all vendors/models. The module gracefully handles missing data.
The module scans /sys/class/drm/card* to discover all GPUs in the system.
For each device, it reads:
- Vendor ID from
/sys/class/drm/cardX/device/vendor - PCI Address for device identification
- Device path for further queries
Vendor detection is based on PCI vendor IDs:
| Vendor | PCI ID | Detection |
|---|---|---|
| NVIDIA | 0x10de | Automatic |
| AMD | 0x1002 | Automatic |
| Intel | 0x8086 | Automatic |
Based on the detected vendor, the module uses different backends:
Primary Method: nvidia-smi
nvidia-smi --query-gpu=name,utilization.gpu,memory.used,memory.total,temperature.gpu,power.draw,fan.speed,clocks.gr --format=csv,noheader,nounitsFallback: sysfs reading from /sys/class/drm/cardX/device/
Requirements:
- NVIDIA driver installed
nvidia-smiin PATH (usually included with drivers)
Optional: NVML library support via --features nvidia for programmatic access
Primary Methods (in order of preference):
-
rocm-smi (for ROCm-enabled GPUs)
rocm-smi --showuse --showmeminfo vram --showtemp
-
radeontop (for monitoring utilization)
radeontop -d 1 -l 1
-
sysfs (amdgpu driver)
- Memory:
/sys/class/drm/cardX/device/mem_info_vram_{total,used} - Temperature:
/sys/class/drm/cardX/device/hwmon/hwmon*/temp1_input - Power:
/sys/class/drm/cardX/device/hwmon/hwmon*/power1_average - Fan:
/sys/class/drm/cardX/device/hwmon/hwmon*/fan1_input
- Memory:
Requirements:
- amdgpu driver (kernel 4.2+)
- Optional:
rocm-smiorradeontopfor detailed stats
Primary Method: intel_gpu_top
intel_gpu_top -J -s 1000 # JSON output for 1 secondFallback: sysfs reading (i915 driver)
- Temperature:
/sys/class/drm/cardX/device/hwmon/hwmon*/temp1_input - Power:
/sys/class/drm/cardX/device/hwmon/hwmon*/power1_average
Requirements:
- i915 driver (integrated) or xe driver (Arc)
- Optional:
intel-gpu-toolspackage forintel_gpu_top
Ubuntu/Debian:
sudo apt-get install nvidia-utilsFedora/RHEL:
sudo dnf install nvidia-driver-utilsArch Linux:
sudo pacman -S nvidia-utilsUbuntu/Debian:
# radeontop (simpler tool)
sudo apt-get install radeontop
# ROCm (full AMD GPU computing stack)
# See: https://rocm.docs.amd.com/Fedora/RHEL:
sudo dnf install radeontopArch Linux:
sudo pacman -S radeontop
# AUR: rocm-smiUbuntu/Debian:
sudo apt-get install intel-gpu-toolsFedora/RHEL:
sudo dnf install intel-gpu-toolsArch Linux:
sudo pacman -S intel-gpu-toolsGPU DIAGNOSTICS
════════════════════════════════════════════════════════════
Overall Status: ⚠️ WARNING - 1 GPU(s) detected with issues
GPU Devices Detected: 1
NVIDIA GPU 0 - Name: NVIDIA GeForce RTX 3080
NVIDIA GPU 0 - Utilization: 92.0%
NVIDIA GPU 0 - Memory Used: 8924 MiB / 10240 MiB (87.1%)
NVIDIA GPU 0 - Temperature: 78°C
NVIDIA GPU 0 - Power Draw: 320.5W
NVIDIA GPU 0 - Fan Speed: 2340RPM
NVIDIA GPU 0 - Clock Speed: 1890MHz
💡 WHY is this happening?
┌─ Finding: NVIDIA GPU 0 is under high load (92.0%)
│ → GPU is near maximum utilization. This may cause performance bottlenecks.
└─ ⚠️ WARNING
┌─ Finding: NVIDIA GPU 0 temperature is elevated (78°C)
│ → Consider improving case airflow or cleaning dust filters.
└─ ⚠️ WARNING
📋 RECOMMENDATIONS:
1. [HIGH] Identify GPU-intensive processes
$ nvidia-smi pmon -c 1
→ Monitor which processes are using the GPU.
2. [MEDIUM] Improve GPU cooling immediately
→ High GPU temperatures can cause throttling or hardware damage.
GPU DIAGNOSTICS
════════════════════════════════════════════════════════════
Overall Status: ✅ OK - 2 GPU(s) detected and operating normally
GPU Devices Detected: 2
NVIDIA GPU 0 - Name: NVIDIA GeForce RTX 3060
NVIDIA GPU 0 - Utilization: 12.0%
NVIDIA GPU 0 - Memory Used: 1024 MiB / 12288 MiB (8.3%)
NVIDIA GPU 0 - Temperature: 45°C
Intel GPU 1 - Name: Intel UHD Graphics 770
Intel GPU 1 - Utilization: 3.5%
Intel GPU 1 - Temperature: 42°C
GPU DIAGNOSTICS
════════════════════════════════════════════════════════════
Overall Status: ✅ OK - 1 GPU(s) detected and operating normally
GPU Devices Detected: 1
AMD GPU 0 - Name: AMD Radeon RX 6800 XT
AMD GPU 0 - Memory Used: 2048 MiB / 16384 MiB (12.5%)
AMD GPU 0 - Temperature: 52°C
AMD GPU 0 - Power Draw: 85.0W
AMD GPU 0 - Fan Speed: 1200RPM
📋 RECOMMENDATIONS:
1. [LOW] Install AMD GPU monitoring tools
$ # Install: apt-get install radeontop
→ Vendor tools provide the most detailed GPU metrics.
- Normal: 0-80%
- Warning: 80-95%
- Info: > 95% (reported if high load is normal for workload)
- Normal: 0-90%
- Warning: > 90% (near capacity)
- Normal: < 75°C
- Warning: 75-84°C
- Critical: ≥ 85°C (thermal throttling likely)
Note: Thresholds may vary by GPU model. Laptop GPUs typically run hotter.
Cause: No devices found in /sys/class/drm
Solutions:
- Check if GPU is physically installed:
lspci | grep -i vga - Verify drivers are loaded:
lsmod | grep -E "nvidia|amdgpu|i915" - Install appropriate GPU drivers
- Reboot after driver installation
Cause: nvidia-smi not found or driver issue
Solutions:
- Install NVIDIA drivers and utils:
sudo apt install nvidia-driver-XXX nvidia-utils - Verify driver is loaded:
nvidia-smi(should work in terminal) - Check if GPU is recognized:
lspci -k | grep -A 3 VGA
Cause: AMD tools not installed or driver issue
Solutions:
- Install
radeontop:sudo apt install radeontop - For ROCm GPUs, install ROCm suite
- Check if amdgpu driver is loaded:
lsmod | grep amdgpu - Some stats require root access; try with
sudo
Cause: Uncommon or old GPU not recognized
Solutions:
- Check vendor manually:
lspci -nn | grep VGA - The module will still show basic detection info
- Install generic tools like
glxinfoorvulkaninfofor additional info
Some metrics may not be available depending on:
- Driver version: Older drivers may lack certain sysfs entries
- GPU model: Budget models may not report all metrics
- Tool availability: Install vendor-specific tools for complete data
- Permissions: Some stats require root access
# Only show NVIDIA GPUs (if multiple vendors present)
rustwhy gpu --nvidia
# Only show AMD GPUs
rustwhy gpu --amd
# Only show Intel GPUs
rustwhy gpu --intelNote: Filtering flags are parsed but currently show all detected GPUs. Full filtering coming in future release.
# Show which processes are using the GPU
rustwhy gpu --processesRequirements:
- NVIDIA:
nvidia-smi(automatically included) - AMD: ROCm or manual
/procinspection - Intel: Limited support
# Verbose mode shows additional information
rustwhy gpu --verboseVerbose mode includes:
- Idle GPUs (utilization < 5%)
- Additional device information
- Fallback method notices
- Tool availability warnings
rustwhy gpu --jsonExample output:
{
"module": "gpu",
"timestamp": "2024-01-15T10:30:45Z",
"overall_severity": "Ok",
"summary": "1 GPU(s) detected and operating normally",
"findings": [],
"recommendations": [
{
"priority": 3,
"action": "Monitor AMD GPU in real-time",
"command": "radeontop",
"explanation": "Use vendor-specific tools for detailed live monitoring."
}
],
"metrics": [
{
"name": "GPU Devices Detected",
"value": 1,
"unit": null,
"threshold": null
},
{
"name": "AMD GPU 0 - Temperature",
"value": 52,
"unit": "°C",
"threshold": {
"warning": 75.0,
"critical": 85.0
}
}
]
}- First run: May take 1-2 seconds to collect all metrics
- Watch mode: Updates every 2 seconds by default (configurable)
- Vendor tools: Add 0.5-1s overhead per invocation
- Sysfs only: Nearly instantaneous (< 100ms)
- Multi-GPU: Limited support for GPU-specific process attribution
- Laptop GPUs: Battery-related GPU metrics not yet implemented
- Vulkan/OpenGL: No API-level profiling (only driver stats)
- GPU Memory Details: No per-process memory breakdown yet
- Historical Data: No trend analysis (planned for v0.2.0)
- Linux: Full support (primary target)
- Windows: Not supported
- macOS: Not supported (different GPU architecture)
Planned for future releases:
- Per-process GPU memory usage
- GPU encoder/decoder utilization
- PCIe bandwidth monitoring
- Multi-GPU compute distribution analysis
- GPU power limit recommendations
- Thermal throttling detection and alerts
- Integration with
glxinfoandvulkaninfo - Support for compute/CUDA/ROCm workload analysis
Want to improve GPU support? We welcome contributions!
Areas for improvement:
- Better vendor tool parsing
- Support for additional GPU vendors (Matrox, VIA, etc.)
- Enhanced Intel Arc support
- GPU compute workload detection
- Better multi-GPU handling
See CONTRIBUTING.md for guidelines.