Skip to content

Latest commit

 

History

History
375 lines (303 loc) · 11.5 KB

File metadata and controls

375 lines (303 loc) · 11.5 KB

HAL Domain Documentation

AI-Inference-Managed-by-AI

本文档描述 AIMA 的硬件抽象层(HAL),包括硬件检测和实时监控。

接口定义

MCP 工具

工具 JSON-RPC 方法 功能
hardware.detect hal.detect 检测 GPU/CPU/RAM,返回能力向量
hardware.metrics hal.metrics 实时资源利用率 + 功耗 + 温度

数据结构

HardwareInfo (internal/hal/detect.go)

type HardwareInfo struct {
    GPU *GPUInfo `json:"gpu"`
    CPU *CPUInfo `json:"cpu"`
    RAM *RAMInfo `json:"ram"`
}

type GPUInfo struct {
    Vendor       string  `json:"vendor"`         // nvidia | amd | intel | huawei | mthreads | hygon
    Name         string  `json:"name"`           // NVIDIA GeForce RTX 4090 | AMD Radeon Graphics | ...
    Arch         string  `json:"arch"`           // Blackwell | Ada | RDNA3.5 | Apple | ...
    VRAMMiB      int     `json:"vram_mib"`
    ComputeID    string  `json:"compute_id"`     // "8.9" (NV) | "gfx1151" (AMD) | "metal3" (Apple)
    ComputeUnits int     `json:"compute_units"`  // CUDA cores | stream processors | GPU cores
    DriverVersion string `json:"driver_version"` // "566.36" (NV) | "6.14.0-1020-oem" (AMD kernel)
    SDKVersion   string  `json:"sdk_version"`    // "CUDA 12.7" | "ROCm 7.9.0"
    UnifiedMemory bool   `json:"unified_memory"` // true for APU / shared memory devices
    Count        int     `json:"count"`          // number of GPUs
}

type CPUInfo struct {
    Arch   string  `json:"arch"`   // x86_64 | arm64 | arm
    Cores  int     `json:"cores"`
    FreqGHz float64 `json:"freq_ghz"`
    Model  string  `json:"model"`
}

type RAMInfo struct {
    TotalMiB     int `json:"total_mib"`
    AvailableMiB int `json:"available_mib"`
}

type StorageInfo struct {
    DataDirPath string       `json:"data_dir_path"`
    FreeMiB     int64        `json:"free_mib"`
    TotalMiB    int64        `json:"total_mib"`
    Volumes     []VolumeInfo `json:"volumes,omitempty"`
}

type VolumeInfo struct {
    MountPoint string `json:"mount_point"`
    Device     string `json:"device,omitempty"`
    TotalMiB   int64  `json:"total_mib"`
    FreeMiB    int64  `json:"free_mib"`
}

Metrics (internal/hal/metrics.go)

type Metrics struct {
    GPU *GPUMetrics `json:"gpu,omitempty"`
    CPU *CPUMetrics `json:"cpu,omitempty"`
    RAM *RAMMetrics `json:"ram,omitempty"`
    Power *PowerMetrics `json:"power,omitempty"`
}

type GPUMetrics struct {
    UtilizationPercent float64 `json:"utilization_percent"`
    MemoryUsedMiB     int     `json:"memory_used_mib"`
    MemoryTotalMiB    int     `json:"memory_total_mib"`
    TemperatureC      float64 `json:"temperature_c"`
    PowerDrawWatts    float64 `json:"power_draw_watts"`
}

type CPUMetrics struct {
    UtilizationPercent float64 `json:"utilization_percent"`
}

type RAMMetrics struct {
    UsedMiB  int     `json:"used_mib"`
    TotalMiB int     `json:"total_mib"`
}

type PowerMetrics struct {
    CurrentWatts float64 `json:"current_watts"`
    PowerMode   string  `json:"power_mode"` // 15W | 30W | 60W | 100W
}

硬件检测

检测流程

Detect()
  │
  ├── 1. CPU 检测 (runtime.GOARCH + /proc/cpuinfo 或 sysctl)
  │
  ├── 2. RAM 检测 (/proc/meminfo 或 sysctl)
  │
  ├── 3. GPU 检测 (probe chain, 按优先级尝试)
  │      ├── Hygon DCU (sysfs: /opt/hyhal + DRM uevent DRIVER=hycu) ← 必须在 AMD 之前
  │      ├── nvidia-smi (NVIDIA) → enrichNvidiaGPU
  │      ├── rocm-smi (AMD) → enrichAMDGPU
  │      ├── xpu-smi (Intel)
  │      ├── npu-smi (Huawei)
  │      ├── mthreads-gmi (MThreads)
  │      └── 无 GPU
  │
  ├── 4. 存储检测
  │      ├── AIMA 数据目录磁盘空间 (free + total)
  │      └── 所有挂载卷列表 (listVolumes)
  │          ├── Linux: /proc/mounts 解析
  │          ├── macOS: /Volumes 目录扫描
  │          └── Windows: GetLogicalDrives API
  │
  └── 5. OS 检测 (版本、内核、容器运行时)

GPU Enrichment(二次探测)

主探测只获取基础信息(名称、VRAM、温度等),enrichment 补全 SDK 版本和驱动版本:

Vendor SDKVersion 来源 DriverVersion 来源
NVIDIA nvidia-smi 输出解析 → "CUDA 12.7" nvidia-smi --query-gpu=driver_version
AMD cat /opt/rocm/.info/version"ROCm 7.9.0" modinfo -F version amdgpu,空则 fallback uname -r
Huawei cat .../ascend-toolkit/latest/version.cfg"CANN 8.3.RC1" cat /usr/local/Ascend/driver/version.info"25.3.rc1"
Hygon (sysfs 无 SDK 信息) (sysfs 无驱动版本)
Intel/MThreads (未实现) (未实现)

Huawei Ascend NPU 检测

npu-smi info 输出有两种格式,AIMA 支持双路解析:

  1. JSON 格式 (-j flag):新版 npu-smi 支持,直接 json.Unmarshal
  2. 文本表格 (plain info):旧版 npu-smi(如 25.3.rc1)不支持 -j,AIMA 解析表格结构

表格结构(每 NPU 两行):

| 0       910B1     | OK              | 99.3        50                0    / 0                       |
| 0                 | 0000:C1:00.0    | 0           0    / 0          3453 / 65536                   |
  • Row 1: Cell 0 = 设备号+芯片名, Cell 2 = Power(W)+Temp(C)
  • Row 2: Cell 2 = AICore%+Memory-Usage (HBM used / total)

Enrichment: 从系统文件读取驱动版本和 CANN SDK 版本:

  • /usr/local/Ascend/driver/version.infoVersion=25.3.rc1
  • /usr/local/Ascend/ascend-toolkit/latest/version.cfgversion=8.3.RC1

Hygon DCU 检测(sysfs)

宿主机无 GPU CLI 工具(rocm-smi 仅在容器内),使用 Linux sysfs 检测:

  1. 哨兵检查: /opt/hyhal 目录存在 → 确认 Hygon 平台
  2. DRM 枚举: 遍历 /sys/class/drm/card*/device/uevent → 找 DRIVER=hycu 的卡
  3. VRAM 读取: mem_info_vram_total (bytes → MiB)
  4. 设备映射: PCI_ID=1D94:6320 → Name="BW150", Arch="DCU", ComputeID="DCU-C3000"
  5. 指标采集: mem_info_vram_used + mem_info_vram_total(无 GPU 利用率/温度,sysfs 不提供)

必须在 AMD probe 之前运行:DCU 也暴露 /dev/kfd,AMD probe 会误匹配。

GPU 资源名映射

Vendor 资源名
NVIDIA nvidia.com/gpu
AMD amd.com/gpu
Intel gpu.intel.com/i915
Huawei Ascend (无,通过 hostPath 设备透传 + --runtime ascend)
Hygon DCU (无,通过 hostPath 设备透传)
无 GPU (空字符串)

资源名用于 Pod YAML 生成时的 GPU 资源声明,支持多厂商 GPU。 Hygon DCU 通过 Hardware Profile YAML 的 container.devices 字段透传 /dev/kfd, /dev/mkfd, /dev/dri。 Huawei Ascend 通过 Hardware Profile YAML 的 container.devices 透传 /dev/davinci* 设备, 并通过 container.docker_runtime: ascend 使用华为 Ascend Docker Runtime。


配置解析桥接

HAL 检测的数据通过 buildHardwareInfo() 映射到 knowledge.HardwareInfo,用于配置解析:

hal.Detect()                             knowledge.HardwareInfo
  GPU.Arch          ──→                    GPUArch
  GPU.VRAMMiB       ──→                    GPUVRAMMiB        (静态层: VRAM 过滤)
  GPU.Count         ──→                    GPUCount
  GPU.UnifiedMemory ──→                    UnifiedMemory     (静态层: 统一显存过滤)
  CPU.Arch          ──→                    CPUArch
  CPU.Cores         ──→                    CPUCores
  RAM.TotalMiB      ──→                    RAMTotalMiB
  RAM.AvailableMiB  ──→                    RAMAvailMiB

hal.CollectMetrics()
  GPU.MemoryUsedMiB ──→                    GPUMemUsedMiB     (动态层: CheckFit)
  GPU.MemoryTotalMiB - MemoryUsedMiB ──→   GPUMemFreeMiB     (动态层: CheckFit)

所有新字段零值 = "未知" = 跳过检查(graceful degradation)。 hal.Detect()hal.CollectMetrics() 失败时不阻止部署,仅降级到 gpu_arch-only 匹配。


实时监控

nvidia-smi 解析

func queryGPUMetrics() (*GPUMetrics, error) {
    out, err := exec.Command("nvidia-smi",
        "--query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu,power.draw",
        "--format=csv,noheader,nounits").Output()
    // 解析 CSV 行: "65, 8192, 16384, 72, 72.5"
    // ...
}

功耗模式

支持的功耗模式从 Hardware Profile 读取:

constraints:
  power_modes: [15W, 30W, 60W, 100W]

使用示例

检测硬件

./aima hal detect

# NVIDIA 输出示例 (dev-win)
{
  "gpu": {
    "vendor": "nvidia",
    "name": "NVIDIA GeForce RTX 4060 Laptop GPU",
    "arch": "Ada",
    "vram_mib": 8188,
    "compute_id": "8.9",
    "driver_version": "566.36",
    "sdk_version": "CUDA 12.7",
    "power_draw_watts": 17.41,
    "power_limit_watts": 75,
    "temperature_celsius": 57,
    "unified_memory": false,
    "count": 1
  },
  "cpu": { "arch": "amd64", "model": "13th Gen Intel(R) Core(TM) i9-13980HX", "cores": 24 },
  "ram": { "total_mib": 32388, "available_mib": 10054 }
}

# AMD 输出示例 (amd395)
{
  "gpu": {
    "vendor": "amd",
    "name": "AMD Radeon Graphics",
    "arch": "RDNA3.5",
    "vram_mib": 65536,
    "compute_id": "gfx1151",
    "driver_version": "6.14.0-1020-oem",
    "sdk_version": "ROCm 7.9.0",
    "power_draw_watts": 13.03,
    "temperature_celsius": 51,
    "unified_memory": true,
    "count": 1
  },
  "cpu": { "arch": "amd64", "model": "AMD RYZEN AI MAX+ 395 w/ Radeon 8060S", "cores": 16 },
  "ram": { "total_mib": 63937, "available_mib": 56926 }
}

# Hygon DCU 输出示例 (hygon)
{
  "gpu": {
    "vendor": "hygon",
    "name": "BW150",
    "arch": "DCU",
    "vram_mib": 65520,
    "compute_id": "DCU-C3000",
    "unified_memory": false,
    "count": 8
  },
  "cpu": { "arch": "amd64", "model": "Hygon C86-4G (OPN:7470)", "cores": 48 },
  "ram": { "total_mib": 769657, "available_mib": 740926 }
}

# Huawei Ascend 输出示例 (qjq2)
{
  "gpu": {
    "vendor": "huawei",
    "name": "910B1",
    "arch": "Ascend910B",
    "vram_mib": 65536,
    "driver_version": "25.3.rc1",
    "sdk_version": "CANN 8.3.RC1",
    "power_draw_watts": 99.3,
    "temperature_celsius": 50,
    "unified_memory": false,
    "count": 8
  },
  "cpu": { "arch": "arm64", "model": "Kunpeng-920", "cores": 192 },
  "ram": { "total_mib": 1572864, "available_mib": 1520000 }
}

查询实时指标

./aima hal metrics

# 输出示例
{
  "gpu": {
    "utilization_percent": 65.2,
    "memory_used_mib": 8192,
    "memory_total_mib": 24576,
    "temperature_c": 72.0,
    "power_draw_watts": 350.5
  },
  "cpu": {
    "utilization_percent": 45.0
  },
  "ram": {
    "used_mib": 32768,
    "total_mib": 131072
  },
  "power": {
    "current_watts": 450.0,
    "power_mode": "100W"
  }
}

相关文件

  • internal/hal/detect.go - 硬件检测实现
  • internal/hal/hal.go - 数据结构定义(HardwareInfo, StorageInfo, VolumeInfo 等)
  • internal/hal/gpu.go - 多厂商 GPU 探测链 + enrichment(NVIDIA/AMD/Intel/Huawei/MThreads)
  • internal/hal/disk_linux.go - Linux 磁盘/卷检测
  • internal/hal/disk_darwin.go - macOS 磁盘/卷检测
  • internal/hal/disk_windows.go - Windows 磁盘/卷检测
  • internal/hal/metrics.go - 实时监控实现

防御性超时

execRunner.Run() 为每个外部命令(nvidia-smi, rocm-smi, npu-smi 等)施加 10 秒超时, 防止单个工具挂起阻塞整个检测流程。


最后更新:2026-03-04 (检测命令 10s 防御性超时)