HAL Domain Documentation

AI-Inference-Managed-by-AI

本文档描述 AIMA 的硬件抽象层（HAL），包括硬件检测和实时监控。

接口定义

MCP 工具

工具	JSON-RPC 方法	功能
`hardware.detect`	`hal.detect`	检测 GPU/CPU/RAM，返回能力向量
`hardware.metrics`	`hal.metrics`	实时资源利用率 + 功耗 + 温度

数据结构

HardwareInfo (internal/hal/detect.go)

type HardwareInfo struct {
    GPU *GPUInfo `json:"gpu"`
    CPU *CPUInfo `json:"cpu"`
    RAM *RAMInfo `json:"ram"`
}

type GPUInfo struct {
    Vendor       string  `json:"vendor"`         // nvidia | amd | intel | huawei | mthreads | hygon
    Name         string  `json:"name"`           // NVIDIA GeForce RTX 4090 | AMD Radeon Graphics | ...
    Arch         string  `json:"arch"`           // Blackwell | Ada | RDNA3.5 | Apple | ...
    VRAMMiB      int     `json:"vram_mib"`
    ComputeID    string  `json:"compute_id"`     // "8.9" (NV) | "gfx1151" (AMD) | "metal3" (Apple)
    ComputeUnits int     `json:"compute_units"`  // CUDA cores | stream processors | GPU cores
    DriverVersion string `json:"driver_version"` // "566.36" (NV) | "6.14.0-1020-oem" (AMD kernel)
    SDKVersion   string  `json:"sdk_version"`    // "CUDA 12.7" | "ROCm 7.9.0"
    UnifiedMemory bool   `json:"unified_memory"` // true for APU / shared memory devices
    Count        int     `json:"count"`          // number of GPUs
}

type CPUInfo struct {
    Arch   string  `json:"arch"`   // x86_64 | arm64 | arm
    Cores  int     `json:"cores"`
    FreqGHz float64 `json:"freq_ghz"`
    Model  string  `json:"model"`
}

type RAMInfo struct {
    TotalMiB     int `json:"total_mib"`
    AvailableMiB int `json:"available_mib"`
}

type StorageInfo struct {
    DataDirPath string       `json:"data_dir_path"`
    FreeMiB     int64        `json:"free_mib"`
    TotalMiB    int64        `json:"total_mib"`
    Volumes     []VolumeInfo `json:"volumes,omitempty"`
}

type VolumeInfo struct {
    MountPoint string `json:"mount_point"`
    Device     string `json:"device,omitempty"`
    TotalMiB   int64  `json:"total_mib"`
    FreeMiB    int64  `json:"free_mib"`
}

Metrics (internal/hal/metrics.go)

type Metrics struct {
    GPU *GPUMetrics `json:"gpu,omitempty"`
    CPU *CPUMetrics `json:"cpu,omitempty"`
    RAM *RAMMetrics `json:"ram,omitempty"`
    Power *PowerMetrics `json:"power,omitempty"`
}

type GPUMetrics struct {
    UtilizationPercent float64 `json:"utilization_percent"`
    MemoryUsedMiB     int     `json:"memory_used_mib"`
    MemoryTotalMiB    int     `json:"memory_total_mib"`
    TemperatureC      float64 `json:"temperature_c"`
    PowerDrawWatts    float64 `json:"power_draw_watts"`
}

type CPUMetrics struct {
    UtilizationPercent float64 `json:"utilization_percent"`
}

type RAMMetrics struct {
    UsedMiB  int     `json:"used_mib"`
    TotalMiB int     `json:"total_mib"`
}

type PowerMetrics struct {
    CurrentWatts float64 `json:"current_watts"`
    PowerMode   string  `json:"power_mode"` // 15W | 30W | 60W | 100W
}

硬件检测

检测流程

Detect()
  │
  ├── 1. CPU 检测 (runtime.GOARCH + /proc/cpuinfo 或 sysctl)
  │
  ├── 2. RAM 检测 (/proc/meminfo 或 sysctl)
  │
  ├── 3. GPU 检测 (probe chain, 按优先级尝试)
  │      ├── Hygon DCU (sysfs: /opt/hyhal + DRM uevent DRIVER=hycu) ← 必须在 AMD 之前
  │      ├── nvidia-smi (NVIDIA) → enrichNvidiaGPU
  │      ├── rocm-smi (AMD) → enrichAMDGPU
  │      ├── xpu-smi (Intel)
  │      ├── npu-smi (Huawei)
  │      ├── mthreads-gmi (MThreads)
  │      └── 无 GPU
  │
  ├── 4. 存储检测
  │      ├── AIMA 数据目录磁盘空间 (free + total)
  │      └── 所有挂载卷列表 (listVolumes)
  │          ├── Linux: /proc/mounts 解析
  │          ├── macOS: /Volumes 目录扫描
  │          └── Windows: GetLogicalDrives API
  │
  └── 5. OS 检测 (版本、内核、容器运行时)

GPU Enrichment（二次探测）

主探测只获取基础信息（名称、VRAM、温度等），enrichment 补全 SDK 版本和驱动版本：

Vendor	SDKVersion 来源	DriverVersion 来源
NVIDIA	`nvidia-smi` 输出解析 → `"CUDA 12.7"`	`nvidia-smi --query-gpu=driver_version`
AMD	`cat /opt/rocm/.info/version` → `"ROCm 7.9.0"`	`modinfo -F version amdgpu`，空则 fallback `uname -r`
Huawei	`cat .../ascend-toolkit/latest/version.cfg` → `"CANN 8.3.RC1"`	`cat /usr/local/Ascend/driver/version.info` → `"25.3.rc1"`
Hygon	(sysfs 无 SDK 信息)	(sysfs 无驱动版本)
Intel/MThreads	(未实现)	(未实现)

Huawei Ascend NPU 检测

npu-smi info 输出有两种格式，AIMA 支持双路解析：

JSON 格式 (-j flag)：新版 npu-smi 支持，直接 json.Unmarshal
文本表格 (plain info)：旧版 npu-smi（如 25.3.rc1）不支持 -j，AIMA 解析表格结构

表格结构（每 NPU 两行）：

| 0       910B1     | OK              | 99.3        50                0    / 0                       |
| 0                 | 0000:C1:00.0    | 0           0    / 0          3453 / 65536                   |

Row 1: Cell 0 = 设备号+芯片名, Cell 2 = Power(W)+Temp(C)
Row 2: Cell 2 = AICore%+Memory-Usage (HBM used / total)

Enrichment: 从系统文件读取驱动版本和 CANN SDK 版本：

/usr/local/Ascend/driver/version.info → Version=25.3.rc1
/usr/local/Ascend/ascend-toolkit/latest/version.cfg → version=8.3.RC1

Hygon DCU 检测（sysfs）

宿主机无 GPU CLI 工具（rocm-smi 仅在容器内），使用 Linux sysfs 检测：

哨兵检查: /opt/hyhal 目录存在 → 确认 Hygon 平台
DRM 枚举: 遍历 /sys/class/drm/card*/device/uevent → 找 DRIVER=hycu 的卡
VRAM 读取: mem_info_vram_total (bytes → MiB)
设备映射: PCI_ID=1D94:6320 → Name="BW150", Arch="DCU", ComputeID="DCU-C3000"
指标采集: mem_info_vram_used + mem_info_vram_total（无 GPU 利用率/温度，sysfs 不提供）

必须在 AMD probe 之前运行：DCU 也暴露 /dev/kfd，AMD probe 会误匹配。

GPU 资源名映射

Vendor	资源名
NVIDIA	`nvidia.com/gpu`
AMD	`amd.com/gpu`
Intel	`gpu.intel.com/i915`
Huawei Ascend	(无，通过 hostPath 设备透传 + `--runtime ascend`)
Hygon DCU	(无，通过 hostPath 设备透传)
无 GPU	(空字符串)

资源名用于 Pod YAML 生成时的 GPU 资源声明，支持多厂商 GPU。 Hygon DCU 通过 Hardware Profile YAML 的 container.devices 字段透传 /dev/kfd, /dev/mkfd, /dev/dri。 Huawei Ascend 通过 Hardware Profile YAML 的 container.devices 透传 /dev/davinci* 设备，并通过 container.docker_runtime: ascend 使用华为 Ascend Docker Runtime。

配置解析桥接

HAL 检测的数据通过 buildHardwareInfo() 映射到 knowledge.HardwareInfo，用于配置解析：

hal.Detect()                             knowledge.HardwareInfo
  GPU.Arch          ──→                    GPUArch
  GPU.VRAMMiB       ──→                    GPUVRAMMiB        (静态层: VRAM 过滤)
  GPU.Count         ──→                    GPUCount
  GPU.UnifiedMemory ──→                    UnifiedMemory     (静态层: 统一显存过滤)
  CPU.Arch          ──→                    CPUArch
  CPU.Cores         ──→                    CPUCores
  RAM.TotalMiB      ──→                    RAMTotalMiB
  RAM.AvailableMiB  ──→                    RAMAvailMiB

hal.CollectMetrics()
  GPU.MemoryUsedMiB ──→                    GPUMemUsedMiB     (动态层: CheckFit)
  GPU.MemoryTotalMiB - MemoryUsedMiB ──→   GPUMemFreeMiB     (动态层: CheckFit)

所有新字段零值 = "未知" = 跳过检查（graceful degradation）。 hal.Detect() 或 hal.CollectMetrics() 失败时不阻止部署，仅降级到 gpu_arch-only 匹配。

实时监控

nvidia-smi 解析

func queryGPUMetrics() (*GPUMetrics, error) {
    out, err := exec.Command("nvidia-smi",
        "--query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu,power.draw",
        "--format=csv,noheader,nounits").Output()
    // 解析 CSV 行: "65, 8192, 16384, 72, 72.5"
    // ...
}

功耗模式

支持的功耗模式从 Hardware Profile 读取：

constraints:
  power_modes: [15W, 30W, 60W, 100W]

使用示例

检测硬件

./aima hal detect

# NVIDIA 输出示例 (dev-win)
{
  "gpu": {
    "vendor": "nvidia",
    "name": "NVIDIA GeForce RTX 4060 Laptop GPU",
    "arch": "Ada",
    "vram_mib": 8188,
    "compute_id": "8.9",
    "driver_version": "566.36",
    "sdk_version": "CUDA 12.7",
    "power_draw_watts": 17.41,
    "power_limit_watts": 75,
    "temperature_celsius": 57,
    "unified_memory": false,
    "count": 1
  },
  "cpu": { "arch": "amd64", "model": "13th Gen Intel(R) Core(TM) i9-13980HX", "cores": 24 },
  "ram": { "total_mib": 32388, "available_mib": 10054 }
}

# AMD 输出示例 (amd395)
{
  "gpu": {
    "vendor": "amd",
    "name": "AMD Radeon Graphics",
    "arch": "RDNA3.5",
    "vram_mib": 65536,
    "compute_id": "gfx1151",
    "driver_version": "6.14.0-1020-oem",
    "sdk_version": "ROCm 7.9.0",
    "power_draw_watts": 13.03,
    "temperature_celsius": 51,
    "unified_memory": true,
    "count": 1
  },
  "cpu": { "arch": "amd64", "model": "AMD RYZEN AI MAX+ 395 w/ Radeon 8060S", "cores": 16 },
  "ram": { "total_mib": 63937, "available_mib": 56926 }
}

# Hygon DCU 输出示例 (hygon)
{
  "gpu": {
    "vendor": "hygon",
    "name": "BW150",
    "arch": "DCU",
    "vram_mib": 65520,
    "compute_id": "DCU-C3000",
    "unified_memory": false,
    "count": 8
  },
  "cpu": { "arch": "amd64", "model": "Hygon C86-4G (OPN:7470)", "cores": 48 },
  "ram": { "total_mib": 769657, "available_mib": 740926 }
}

# Huawei Ascend 输出示例 (qjq2)
{
  "gpu": {
    "vendor": "huawei",
    "name": "910B1",
    "arch": "Ascend910B",
    "vram_mib": 65536,
    "driver_version": "25.3.rc1",
    "sdk_version": "CANN 8.3.RC1",
    "power_draw_watts": 99.3,
    "temperature_celsius": 50,
    "unified_memory": false,
    "count": 8
  },
  "cpu": { "arch": "arm64", "model": "Kunpeng-920", "cores": 192 },
  "ram": { "total_mib": 1572864, "available_mib": 1520000 }
}

查询实时指标

./aima hal metrics

# 输出示例
{
  "gpu": {
    "utilization_percent": 65.2,
    "memory_used_mib": 8192,
    "memory_total_mib": 24576,
    "temperature_c": 72.0,
    "power_draw_watts": 350.5
  },
  "cpu": {
    "utilization_percent": 45.0
  },
  "ram": {
    "used_mib": 32768,
    "total_mib": 131072
  },
  "power": {
    "current_watts": 450.0,
    "power_mode": "100W"
  }
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HAL Domain Documentation

接口定义

MCP 工具

数据结构

HardwareInfo (internal/hal/detect.go)

Metrics (internal/hal/metrics.go)

硬件检测

检测流程

GPU Enrichment（二次探测）

Huawei Ascend NPU 检测

Hygon DCU 检测（sysfs）

GPU 资源名映射

配置解析桥接

实时监控

nvidia-smi 解析

功耗模式

使用示例

检测硬件

查询实时指标

相关文件

防御性超时

FilesExpand file tree

hal.md

Latest commit

History

hal.md

File metadata and controls

HAL Domain Documentation

接口定义

MCP 工具

数据结构

HardwareInfo (internal/hal/detect.go)

Metrics (internal/hal/metrics.go)

硬件检测

检测流程

GPU Enrichment（二次探测）

Huawei Ascend NPU 检测

Hygon DCU 检测（sysfs）

GPU 资源名映射

配置解析桥接

实时监控

nvidia-smi 解析

功耗模式

使用示例

检测硬件

查询实时指标

相关文件

防御性超时