AI-Inference-Managed-by-AI
本文档描述 AIMA 的硬件抽象层(HAL),包括硬件检测和实时监控。
| 工具 | JSON-RPC 方法 | 功能 |
|---|---|---|
hardware.detect |
hal.detect |
检测 GPU/CPU/RAM,返回能力向量 |
hardware.metrics |
hal.metrics |
实时资源利用率 + 功耗 + 温度 |
type HardwareInfo struct {
GPU *GPUInfo `json:"gpu"`
CPU *CPUInfo `json:"cpu"`
RAM *RAMInfo `json:"ram"`
}
type GPUInfo struct {
Vendor string `json:"vendor"` // nvidia | amd | intel | huawei | mthreads | hygon
Name string `json:"name"` // NVIDIA GeForce RTX 4090 | AMD Radeon Graphics | ...
Arch string `json:"arch"` // Blackwell | Ada | RDNA3.5 | Apple | ...
VRAMMiB int `json:"vram_mib"`
ComputeID string `json:"compute_id"` // "8.9" (NV) | "gfx1151" (AMD) | "metal3" (Apple)
ComputeUnits int `json:"compute_units"` // CUDA cores | stream processors | GPU cores
DriverVersion string `json:"driver_version"` // "566.36" (NV) | "6.14.0-1020-oem" (AMD kernel)
SDKVersion string `json:"sdk_version"` // "CUDA 12.7" | "ROCm 7.9.0"
UnifiedMemory bool `json:"unified_memory"` // true for APU / shared memory devices
Count int `json:"count"` // number of GPUs
}
type CPUInfo struct {
Arch string `json:"arch"` // x86_64 | arm64 | arm
Cores int `json:"cores"`
FreqGHz float64 `json:"freq_ghz"`
Model string `json:"model"`
}
type RAMInfo struct {
TotalMiB int `json:"total_mib"`
AvailableMiB int `json:"available_mib"`
}
type StorageInfo struct {
DataDirPath string `json:"data_dir_path"`
FreeMiB int64 `json:"free_mib"`
TotalMiB int64 `json:"total_mib"`
Volumes []VolumeInfo `json:"volumes,omitempty"`
}
type VolumeInfo struct {
MountPoint string `json:"mount_point"`
Device string `json:"device,omitempty"`
TotalMiB int64 `json:"total_mib"`
FreeMiB int64 `json:"free_mib"`
}type Metrics struct {
GPU *GPUMetrics `json:"gpu,omitempty"`
CPU *CPUMetrics `json:"cpu,omitempty"`
RAM *RAMMetrics `json:"ram,omitempty"`
Power *PowerMetrics `json:"power,omitempty"`
}
type GPUMetrics struct {
UtilizationPercent float64 `json:"utilization_percent"`
MemoryUsedMiB int `json:"memory_used_mib"`
MemoryTotalMiB int `json:"memory_total_mib"`
TemperatureC float64 `json:"temperature_c"`
PowerDrawWatts float64 `json:"power_draw_watts"`
}
type CPUMetrics struct {
UtilizationPercent float64 `json:"utilization_percent"`
}
type RAMMetrics struct {
UsedMiB int `json:"used_mib"`
TotalMiB int `json:"total_mib"`
}
type PowerMetrics struct {
CurrentWatts float64 `json:"current_watts"`
PowerMode string `json:"power_mode"` // 15W | 30W | 60W | 100W
}Detect()
│
├── 1. CPU 检测 (runtime.GOARCH + /proc/cpuinfo 或 sysctl)
│
├── 2. RAM 检测 (/proc/meminfo 或 sysctl)
│
├── 3. GPU 检测 (probe chain, 按优先级尝试)
│ ├── Hygon DCU (sysfs: /opt/hyhal + DRM uevent DRIVER=hycu) ← 必须在 AMD 之前
│ ├── nvidia-smi (NVIDIA) → enrichNvidiaGPU
│ ├── rocm-smi (AMD) → enrichAMDGPU
│ ├── xpu-smi (Intel)
│ ├── npu-smi (Huawei)
│ ├── mthreads-gmi (MThreads)
│ └── 无 GPU
│
├── 4. 存储检测
│ ├── AIMA 数据目录磁盘空间 (free + total)
│ └── 所有挂载卷列表 (listVolumes)
│ ├── Linux: /proc/mounts 解析
│ ├── macOS: /Volumes 目录扫描
│ └── Windows: GetLogicalDrives API
│
└── 5. OS 检测 (版本、内核、容器运行时)
主探测只获取基础信息(名称、VRAM、温度等),enrichment 补全 SDK 版本和驱动版本:
| Vendor | SDKVersion 来源 | DriverVersion 来源 |
|---|---|---|
| NVIDIA | nvidia-smi 输出解析 → "CUDA 12.7" |
nvidia-smi --query-gpu=driver_version |
| AMD | cat /opt/rocm/.info/version → "ROCm 7.9.0" |
modinfo -F version amdgpu,空则 fallback uname -r |
| Huawei | cat .../ascend-toolkit/latest/version.cfg → "CANN 8.3.RC1" |
cat /usr/local/Ascend/driver/version.info → "25.3.rc1" |
| Hygon | (sysfs 无 SDK 信息) | (sysfs 无驱动版本) |
| Intel/MThreads | (未实现) | (未实现) |
npu-smi info 输出有两种格式,AIMA 支持双路解析:
- JSON 格式 (
-jflag):新版 npu-smi 支持,直接json.Unmarshal - 文本表格 (plain
info):旧版 npu-smi(如 25.3.rc1)不支持-j,AIMA 解析表格结构
表格结构(每 NPU 两行):
| 0 910B1 | OK | 99.3 50 0 / 0 |
| 0 | 0000:C1:00.0 | 0 0 / 0 3453 / 65536 |
- Row 1: Cell 0 = 设备号+芯片名, Cell 2 = Power(W)+Temp(C)
- Row 2: Cell 2 = AICore%+Memory-Usage (HBM used / total)
Enrichment: 从系统文件读取驱动版本和 CANN SDK 版本:
/usr/local/Ascend/driver/version.info→Version=25.3.rc1/usr/local/Ascend/ascend-toolkit/latest/version.cfg→version=8.3.RC1
宿主机无 GPU CLI 工具(rocm-smi 仅在容器内),使用 Linux sysfs 检测:
- 哨兵检查:
/opt/hyhal目录存在 → 确认 Hygon 平台 - DRM 枚举: 遍历
/sys/class/drm/card*/device/uevent→ 找DRIVER=hycu的卡 - VRAM 读取:
mem_info_vram_total(bytes → MiB) - 设备映射:
PCI_ID=1D94:6320→ Name="BW150", Arch="DCU", ComputeID="DCU-C3000" - 指标采集:
mem_info_vram_used+mem_info_vram_total(无 GPU 利用率/温度,sysfs 不提供)
必须在 AMD probe 之前运行:DCU 也暴露 /dev/kfd,AMD probe 会误匹配。
| Vendor | 资源名 |
|---|---|
| NVIDIA | nvidia.com/gpu |
| AMD | amd.com/gpu |
| Intel | gpu.intel.com/i915 |
| Huawei Ascend | (无,通过 hostPath 设备透传 + --runtime ascend) |
| Hygon DCU | (无,通过 hostPath 设备透传) |
| 无 GPU | (空字符串) |
资源名用于 Pod YAML 生成时的 GPU 资源声明,支持多厂商 GPU。
Hygon DCU 通过 Hardware Profile YAML 的 container.devices 字段透传 /dev/kfd, /dev/mkfd, /dev/dri。
Huawei Ascend 通过 Hardware Profile YAML 的 container.devices 透传 /dev/davinci* 设备,
并通过 container.docker_runtime: ascend 使用华为 Ascend Docker Runtime。
HAL 检测的数据通过 buildHardwareInfo() 映射到 knowledge.HardwareInfo,用于配置解析:
hal.Detect() knowledge.HardwareInfo
GPU.Arch ──→ GPUArch
GPU.VRAMMiB ──→ GPUVRAMMiB (静态层: VRAM 过滤)
GPU.Count ──→ GPUCount
GPU.UnifiedMemory ──→ UnifiedMemory (静态层: 统一显存过滤)
CPU.Arch ──→ CPUArch
CPU.Cores ──→ CPUCores
RAM.TotalMiB ──→ RAMTotalMiB
RAM.AvailableMiB ──→ RAMAvailMiB
hal.CollectMetrics()
GPU.MemoryUsedMiB ──→ GPUMemUsedMiB (动态层: CheckFit)
GPU.MemoryTotalMiB - MemoryUsedMiB ──→ GPUMemFreeMiB (动态层: CheckFit)
所有新字段零值 = "未知" = 跳过检查(graceful degradation)。
hal.Detect() 或 hal.CollectMetrics() 失败时不阻止部署,仅降级到 gpu_arch-only 匹配。
func queryGPUMetrics() (*GPUMetrics, error) {
out, err := exec.Command("nvidia-smi",
"--query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu,power.draw",
"--format=csv,noheader,nounits").Output()
// 解析 CSV 行: "65, 8192, 16384, 72, 72.5"
// ...
}支持的功耗模式从 Hardware Profile 读取:
constraints:
power_modes: [15W, 30W, 60W, 100W]./aima hal detect
# NVIDIA 输出示例 (dev-win)
{
"gpu": {
"vendor": "nvidia",
"name": "NVIDIA GeForce RTX 4060 Laptop GPU",
"arch": "Ada",
"vram_mib": 8188,
"compute_id": "8.9",
"driver_version": "566.36",
"sdk_version": "CUDA 12.7",
"power_draw_watts": 17.41,
"power_limit_watts": 75,
"temperature_celsius": 57,
"unified_memory": false,
"count": 1
},
"cpu": { "arch": "amd64", "model": "13th Gen Intel(R) Core(TM) i9-13980HX", "cores": 24 },
"ram": { "total_mib": 32388, "available_mib": 10054 }
}
# AMD 输出示例 (amd395)
{
"gpu": {
"vendor": "amd",
"name": "AMD Radeon Graphics",
"arch": "RDNA3.5",
"vram_mib": 65536,
"compute_id": "gfx1151",
"driver_version": "6.14.0-1020-oem",
"sdk_version": "ROCm 7.9.0",
"power_draw_watts": 13.03,
"temperature_celsius": 51,
"unified_memory": true,
"count": 1
},
"cpu": { "arch": "amd64", "model": "AMD RYZEN AI MAX+ 395 w/ Radeon 8060S", "cores": 16 },
"ram": { "total_mib": 63937, "available_mib": 56926 }
}
# Hygon DCU 输出示例 (hygon)
{
"gpu": {
"vendor": "hygon",
"name": "BW150",
"arch": "DCU",
"vram_mib": 65520,
"compute_id": "DCU-C3000",
"unified_memory": false,
"count": 8
},
"cpu": { "arch": "amd64", "model": "Hygon C86-4G (OPN:7470)", "cores": 48 },
"ram": { "total_mib": 769657, "available_mib": 740926 }
}
# Huawei Ascend 输出示例 (qjq2)
{
"gpu": {
"vendor": "huawei",
"name": "910B1",
"arch": "Ascend910B",
"vram_mib": 65536,
"driver_version": "25.3.rc1",
"sdk_version": "CANN 8.3.RC1",
"power_draw_watts": 99.3,
"temperature_celsius": 50,
"unified_memory": false,
"count": 8
},
"cpu": { "arch": "arm64", "model": "Kunpeng-920", "cores": 192 },
"ram": { "total_mib": 1572864, "available_mib": 1520000 }
}./aima hal metrics
# 输出示例
{
"gpu": {
"utilization_percent": 65.2,
"memory_used_mib": 8192,
"memory_total_mib": 24576,
"temperature_c": 72.0,
"power_draw_watts": 350.5
},
"cpu": {
"utilization_percent": 45.0
},
"ram": {
"used_mib": 32768,
"total_mib": 131072
},
"power": {
"current_watts": 450.0,
"power_mode": "100W"
}
}internal/hal/detect.go- 硬件检测实现internal/hal/hal.go- 数据结构定义(HardwareInfo, StorageInfo, VolumeInfo 等)internal/hal/gpu.go- 多厂商 GPU 探测链 + enrichment(NVIDIA/AMD/Intel/Huawei/MThreads)internal/hal/disk_linux.go- Linux 磁盘/卷检测internal/hal/disk_darwin.go- macOS 磁盘/卷检测internal/hal/disk_windows.go- Windows 磁盘/卷检测internal/hal/metrics.go- 实时监控实现
execRunner.Run() 为每个外部命令(nvidia-smi, rocm-smi, npu-smi 等)施加 10 秒超时,
防止单个工具挂起阻塞整个检测流程。
最后更新:2026-03-04 (检测命令 10s 防御性超时)