gpustack
diff --git a/‎.claude/skills/gpustack-operator-xbuild-and-verify/SKILL.md‎
Lines changed: 121 additions & 0 deletions b/‎.claude/skills/gpustack-operator-xbuild-and-verify/SKILL.md‎
Lines changed: 121 additions & 0 deletions
diff --git a/‎.claude/skills/gpustack-operator-xbuild-and-verify/cases/ascend-case-1.sh‎
Lines changed: 71 additions & 0 deletions b/‎.claude/skills/gpustack-operator-xbuild-and-verify/cases/ascend-case-1.sh‎
Lines changed: 71 additions & 0 deletions
diff --git a/‎.claude/skills/gpustack-operator-xbuild-and-verify/cases/ascend-case-2.sh‎
Lines changed: 80 additions & 0 deletions b/‎.claude/skills/gpustack-operator-xbuild-and-verify/cases/ascend-case-2.sh‎
Lines changed: 80 additions & 0 deletions
@@ -0,0 +1,121 @@
+---
+name: gpustack-operator-xbuild-and-verify
+description: "Build and verify the GPUStack Operator's accelerator soft-slicing **builder stages** (`xbuild-ascend-cann-*` and `xbuild-nvidia-cuda-*` in `pack/gpustack-operator/Dockerfile`) end to end, either on the local docker host or on a remote accelerator host over ssh. Builds one stage via buildx `--target`, then runs numbered cases against the produced runtime. SCOPE — two backends: **Ascend (vcann-rt: `libvruntime.so` + `enpu-monitor`)** and **NVIDIA (HAMi-core: `libvgpu.so`)**. Ascend cases: (1) artifacts+linking [no NPU], (2) inject + `enpu-monitor`, (3) memory-quota enforcement. NVIDIA cases: (1) artifacts+linking [no GPU], (2) single-card inject + `nvidia-smi`/SM-limit, (3) multi-card per-device limits. The hardware cases need a real accelerator. Proactively offer this whenever a branch changes the Docker build flow — `pack/gpustack-operator/Dockerfile` or `pack/gpustack-operator/external/(ascend|nvidia)/**`. Examples: \"verify my Dockerfile build-stage change\", \"did the vcann-rt / HAMi-core build still link\", \"test the soft-slicing build on the 910B / 4090 host\", \"does enpu-monitor still work in a container\", \"does nvidia-smi show the sliced memory\", \"prove the memory slice is enforced on real hardware\"."
+allowed-tools: "Read, AskUserQuestion, Bash(bash .claude/skills/gpustack-operator-xbuild-and-verify/scripts/preflight.sh*), Bash(bash .claude/skills/gpustack-operator-xbuild-and-verify/scripts/build.sh*), Bash(bash .claude/skills/gpustack-operator-xbuild-and-verify/cases/ascend-case-1.sh*), Bash(bash .claude/skills/gpustack-operator-xbuild-and-verify/cases/ascend-case-2.sh*), Bash(bash .claude/skills/gpustack-operator-xbuild-and-verify/cases/ascend-case-3.sh*), Bash(bash .claude/skills/gpustack-operator-xbuild-and-verify/cases/nvidia-case-1.sh*), Bash(bash .claude/skills/gpustack-operator-xbuild-and-verify/cases/nvidia-case-2.sh*), Bash(bash .claude/skills/gpustack-operator-xbuild-and-verify/cases/nvidia-case-3.sh*), Bash(grep*), Bash(git diff*), Bash(git rev-parse*), Bash(ssh*), Bash(docker buildx*), Bash(docker images*), Bash(docker info*), Bash(command -v*)"
+model: sonnet
+---
+
+# GPUStack Operator — accelerator xbuild & verify
+
+Build one soft-slicing builder stage from `pack/gpustack-operator/Dockerfile` and verify the runtime it
+produces, on the local docker host or on a remote accelerator host over ssh. Two backends:
+
+- **Ascend (vcann-rt).** `xbuild-ascend-cann-*` → `libvruntime.so` + `enpu-monitor`. Verifies artifacts/
+  linking, the `npu_info.config` injection, and **real memory-quota enforcement** on a real NPU.
+- **NVIDIA (HAMi-core).** `xbuild-nvidia-cuda-*` → `libvgpu.so`. Verifies artifacts/linking, single-card
+  injection (`nvidia-smi` shows the sliced VRAM + the SM/compute limit is applied), and **multi-card
+  per-device limits** on real GPUs.
+
+It is the build+runtime-contract counterpart to the cluster-level `gpustack-operator-e2e` (scheduling
+chain) and `gpustack-operator-chart-e2e` (chart). This is an evolving, e2e-style skill: extend the cases as
+the build flow grows.
+
+## When to offer it
+Proactively suggest this skill when a branch changes the Docker build flow:
+```bash
+git diff --name-only origin/main...HEAD | grep -E 'pack/gpustack-operator/(Dockerfile|external/(ascend|nvidia)/)'
+```
+
+## Runner model (local or remote)
+All scripts source `scripts/lib.sh` and run through one runner, selected by env:
+- `XB_MODE=local` — build & verify on this host.
+- `XB_MODE=ssh XB_HOST=user@host` — build & verify on a remote host. Files move via base64-over-ssh
+  (never scp — a login banner corrupts it); a remote login banner is filtered from output.
+
+The remote host is **never hardcoded** — always ask the user for it.
+
+## Hard rules
+- **Never push images** — builds use `buildx --load` into the local/remote docker store only.
+- **Confirm before any remote build or container run** (they consume the host's accelerator/driver).
+  Preflight and the build-artifact case (ASCEND-CASE 1 / NVIDIA-CASE 1) are safe once the user names the target.
+- Touch only what the skill creates (the `vcann-build:*` / `vgpu-build:*` image, `${XB_STAGE}` artifacts,
+  `${XB_STAGE}/test` config/preload, the remote build context). Never modify the user's other resources.
+- The hardware cases require a **real accelerator** (local or the ssh host): ASCEND-CASE 2/3 need an NPU;
+  NVIDIA-CASE 2 needs a GPU, NVIDIA-CASE 3 needs **≥ 2** GPUs. The two CASE-1 builds need only docker+buildx.
+
+## Flow
+
+1. **Discover targets.** List the builder stages and ask which to verify (multi-select):
+   ```bash
+   grep -nE 'AS xbuild-(ascend-cann|nvidia-cuda)-' pack/gpustack-operator/Dockerfile
+   ```
+   Ascend: `xbuild-ascend-cann-8-910b`, `-8-910c`, `-9-910b`, `-9-910c`, `-9-950`.
+   NVIDIA: `xbuild-nvidia-cuda-12`, `-13`.
+
+2. **Pick connection (AskUserQuestion).** Local, or ssh — and if ssh, the host. Set `XB_MODE`/`XB_HOST`.
+
+3. **Preflight (read-only, confirm target first).**
+   ```bash
+   XB_MODE=… XB_HOST=… bash .claude/skills/gpustack-operator-xbuild-and-verify/scripts/preflight.sh
+   ```
+   docker+buildx must PASS to build. The hardware rows (`npu-smi`/ascend-runtime/`/dev/davinci*` and
+   `nvidia-smi`/nvidia-runtime/`/dev/nvidia*`) WARN when absent — the matching hardware cases are then
+   unavailable. If buildx is missing, the table prints the install one-liner.
+
+4. **Build the chosen target (confirm).** `build.sh` infers the backend from the target prefix. Native on a
+   matching-arch host (fast); cross-arch uses qemu.
+   ```bash
+   XB_MODE=… XB_HOST=… bash .claude/skills/gpustack-operator-xbuild-and-verify/scripts/build.sh xbuild-nvidia-cuda-13
+   ```
+   Produces `XB_IMAGE` (Ascend `vcann-build:<suffix>` / NVIDIA `vgpu-build:<suffix>`) and stages the
+   artifacts under `XB_STAGE` (Ascend `/opt/enpu/vcann-rt`, NVIDIA `/opt/vgpu`). The built image is
+   CANN/CUDA-based and doubles as the workload image for the hardware cases (`XB_WORKLOAD_IMAGE` defaults
+   to it).
+
+5. **Run cases.** Pass the same target; read each PASS/FAIL table — don't re-derive from raw logs.
+   ```bash
+   # Ascend
+   XB_MODE=… XB_HOST=… bash .../cases/ascend-case-1.sh xbuild-ascend-cann-8-910b
+   XB_MODE=… XB_HOST=… XB_NPU=0 bash .../cases/ascend-case-2.sh xbuild-ascend-cann-8-910b
+   XB_MODE=… XB_HOST=… XB_NPU=0 XB_MEM=1024 bash .../cases/ascend-case-3.sh xbuild-ascend-cann-8-910b
+   # NVIDIA
+   XB_MODE=… XB_HOST=… bash .../cases/nvidia-case-1.sh xbuild-nvidia-cuda-13
+   XB_MODE=… XB_HOST=… XB_GPU=0  XB_MEM=4096 XB_SM=50 bash .../cases/nvidia-case-2.sh xbuild-nvidia-cuda-13
+   XB_MODE=… XB_HOST=… XB_GPUS=0,1 XB_MEM=4096       bash .../cases/nvidia-case-3.sh xbuild-nvidia-cuda-13
+   ```
+
+## Cases (locked titles)
+
+### Ascend (`xbuild-ascend-cann-*`)
+| Case | Title | Needs NPU | Asserts |
+|---|---|---|---|
+| 1 | Build artifacts + linking | no | `libvruntime.so` (0644) + `enpu-monitor` (0755) exist; ELF arch == build platform; build linked (the `--allow-shlib-undefined` path); both `NEEDED` `libc_sec.so`+`libascendcl.so`; notes the weak UND `dcmi_*` syms |
+| 2 | Inject + enpu-monitor | yes | VDie-ID→`shm-id`; render `npu_info.config`; preload (libdcmi×2 + libvruntime); container `enpu-monitor` loads all 6 fields, initializes, and prints `Aicore Limit Quota`/`Memory Limit quota` matching the config |
+| 3 | Memory-quota enforcement | yes | injected HBM alloc capped at `memory-quota` (the `Out of memory! … quota:<bytes>` log); baseline (no inject) exceeds it |
+
+### NVIDIA (`xbuild-nvidia-cuda-*`)
+| Case | Title | Needs GPU | Asserts |
+|---|---|---|---|
+| 1 | Build artifacts + linking | no | `libvgpu.so` (0644) exists; ELF arch == build platform; `NEEDED` `libcuda.so.1`+`libnvidia-ml.so.1` (hard deps the NVIDIA runtime injects — no weak-UND preload, contrast Ascend) |
+| 2 | Single-card inject + nvidia-smi | yes (1 GPU) | preload `libvgpu.so`; with `CUDA_DEVICE_MEMORY_LIMIT_0`+`CUDA_DEVICE_SM_LIMIT`, `nvidia-smi memory.total` == the limit (NVML hook); a CUDA probe logs `core utilization limit = <SM>` and `cuMemGetInfo` total == the limit (real CUDA-level enforcement) |
+| 3 | Multi-card per-device limits | yes (≥2 GPU) | each exposed card gets a **distinct** `CUDA_DEVICE_MEMORY_LIMIT_<i>` and the container's `nvidia-smi` reports each card's `memory.total` at its own limit (skips with WARN if too few GPUs) |
+
+## Env knobs
+`XB_MODE`/`XB_HOST` (runner); `XB_PLATFORM` (default from target arch); `XB_IMAGE`/`XB_WORKLOAD_IMAGE`;
+`XB_STAGE` (Ascend `/opt/enpu/vcann-rt` | NVIDIA `/opt/vgpu`); `XB_REMOTE_CTX` (remote build-context dir).
+- Ascend: `XB_NPU`/`XB_CHIP` (card/chip, default 0); `XB_VNPU` (0); `XB_AICORE` (20); `XB_MEM` (1024 MB).
+- NVIDIA: `XB_GPU` (single-card index, 0); `XB_GPUS` (multi-card csv, `0,1`); `XB_MEM` (MiB, 4096);
+  `XB_SM` (compute %, 50 / 30).
+
+## References
+- `references/ascend-npu-info-config.md` — Ascend: the 6 config fields, VDie-ID→shm-id, allocator mapping.
+- `references/ascend-ld-preload-and-libdcmi.md` — Ascend activation via `/etc/ld.so.preload`; **why libdcmi must
+  be preloaded** (weak dcmi syms); the `libc_sec`/CANN-image requirement.
+- `references/ascend-npu-smi-and-aicore.md` — Ascend: `npu-smi` shows the physical card; AICore-quota mechanism,
+  the benign CANN-8.5.0 warnings, the unverified-throttle gap.
+- `references/nvidia-hami-core-vgpu.md` — NVIDIA: what `libvgpu.so` is, the env+mount injection contract, the
+  one-CUDA-major-per-container rule, HAMi-core knobs.
+- `references/nvidia-smi-and-sm-limit.md` — NVIDIA: memory limit is directly visible in `nvidia-smi` (NVML
+  hook); the SM/compute limit is a time-slice throttle (HAMi log / under-load only); CUDA-13 probe gotchas.
+- `references/troubleshooting.md` — both backends: scp banner, buildx-missing, Ascend link/segfault/hgemm,
+  NVIDIA runtime/preload/stale-cache/cuCtxCreate-v4/stub-lib/SM-visibility.
@@ -0,0 +1,71 @@
+#!/usr/bin/env bash
+#
+# ASCEND-CASE 1 — Build artifacts + linking   (no NPU; runs on any docker host)
+#
+#   ascend-case-1.sh <TARGET>
+#
+# Assumes build.sh already built XB_IMAGE for <TARGET>. A successful build IS the
+# link test: the enpu-monitor executable only links because build-libvnpu.sh passes
+# -Wl,--allow-shlib-undefined (the vendor .so cross-refs — HAL drv*, ErrorManager::* —
+# resolve at runtime, not link time). This case then inspects the produced artifacts:
+#   - /out/lib/libvruntime.so   exists, mode 0644
+#   - /out/tools/enpu-monitor   exists, mode 0755
+#   - both ELF machine == the build platform's arch
+#   - both NEEDED libc_sec.so + libascendcl.so (the runtime deps; libc_sec is the
+#     securec/libboundscheck library — see references/ascend-ld-preload-and-libdcmi.md)
+#   - enpu-monitor carries WEAK UND dcmi_* symbols (informational: that is why
+#     libdcmi must be preloaded at runtime — ASCEND-CASE 2 covers it)
+#
+# Prints a STATUS | CHECK | DETAIL table; exits non-zero on any FAIL.
+set -uo pipefail
+HERE="$(cd "$(dirname "$0")" && pwd)"
+# shellcheck source=../scripts/lib.sh
+. "${HERE}/../scripts/lib.sh"
+
+TARGET="${1:?usage: ascend-case-1.sh <TARGET>}"
+XB_IMAGE="${XB_IMAGE:-vcann-build:${TARGET#xbuild-ascend-cann-}}"
+XB_PLATFORM="${XB_PLATFORM:-}"
+
+# Expected ELF machine from the platform; if unset, derive from the target host arch.
+if [ -z "${XB_PLATFORM}" ]; then
+  a="$(xrun 'uname -m' | tr -d '[:space:]')"
+  case "${a}" in x86_64) XB_PLATFORM=linux/amd64 ;; aarch64) XB_PLATFORM=linux/arm64 ;; esac
+fi
+case "${XB_PLATFORM}" in
+  linux/amd64) EXPECT_MACHINE="X86-64" ;;
+  linux/arm64) EXPECT_MACHINE="AArch64" ;;
+  *) EXPECT_MACHINE="" ;;
+esac
+
+echo "# ASCEND-CASE 1 — ${XB_IMAGE} (expect ${EXPECT_MACHINE:-any}) on $(xtarget_desc)"
+
+out="$(xsh XB_IMAGE="${XB_IMAGE}" EXPECT_MACHINE="${EXPECT_MACHINE}" <<'PAYLOAD'
+set -u
+docker run --rm -i "${XB_IMAGE}" bash -s "${EXPECT_MACHINE}" <<'INNER'
+set -u
+EXPECT="${1:-}"
+row(){ printf '%s | %s | %s\n' "$1" "$2" "$3"; }
+fails=0
+LV=/out/lib/libvruntime.so
+EM=/out/tools/enpu-monitor
+chk_mode(){ local f="$1" want="$2" name="$3" m; if [ -f "$f" ]; then m=$(stat -c '%a' "$f"); [ "$m" = "$want" ] && row PASS "$name mode $want" "$m" || { row FAIL "$name mode $want" "$m"; fails=$((fails+1)); }; else row FAIL "$name exists" missing; fails=$((fails+1)); fi; }
+chk_mode "$LV" 644 "libvruntime.so"
+chk_mode "$EM" 755 "enpu-monitor"
+for f in "$LV" "$EM"; do
+  mach=$(readelf -h "$f" 2>/dev/null | awk -F: '/Machine/{print $2}' | xargs)
+  if [ -z "$EXPECT" ] || echo "$mach" | grep -q "$EXPECT"; then row PASS "arch $(basename $f)" "$mach"; else row FAIL "arch $(basename $f)==$EXPECT" "$mach"; fails=$((fails+1)); fi
+done
+for f in "$LV" "$EM"; do
+  nd=$(readelf -d "$f" 2>/dev/null | grep NEEDED)
+  for need in libc_sec.so libascendcl.so; do
+    echo "$nd" | grep -q "$need" && row PASS "NEEDED $need ($(basename $f))" ok || { row FAIL "NEEDED $need ($(basename $f))" absent; fails=$((fails+1)); }
+  done
+done
+w=$(readelf -sW "$EM" 2>/dev/null | grep -cE "WEAK +DEFAULT +UND +dcmi" || true)
+row INFO "enpu-monitor weak UND dcmi syms" "${w} (=> libdcmi must be preloaded at runtime, see ASCEND-CASE 2)"
+echo "FAILS=${fails}"
+INNER
+PAYLOAD
+)"
+echo "${out}"
+echo "${out}" | grep -q 'FAILS=0' && { echo "ASCEND-CASE 1: PASS"; exit 0; } || { echo "ASCEND-CASE 1: FAIL"; exit 1; }
@@ -0,0 +1,80 @@
+#!/usr/bin/env bash
+#
+# ASCEND-CASE 2 — Inject + enpu-monitor   (needs a real Ascend NPU host)
+#
+#   ascend-case-2.sh [TARGET]
+#
+# Reproduces the GPUStack soft-slicing injection by hand and confirms vcann-rt
+# initializes and reports the configured quota inside a container:
+#   1. read the card VDie-ID (npu-smi info -t board) -> shm-id (spaces -> '-')
+#   2. render npu_info.config (0644) — the same 6 fields the Ascend allocator emits
+#      (renderNPUInfoConfig in pkg/devicemanager/allocator/ascend/deviceplugin.go)
+#   3. write ld.so.preload listing the two host-injected libdcmi paths BEFORE
+#      libvruntime.so (libdcmi must be loaded or the weak dcmi_* symbols segfault —
+#      see references/ascend-ld-preload-and-libdcmi.md)
+#   4. docker run (ascend runtime, ASCEND_VISIBLE_DEVICES) the staged enpu-monitor
+#   5. assert: all 6 config fields loaded, "Successfully to initialize", and the
+#      enpu-monitor Aicore/Memory quota lines match the config
+#
+# Env: XB_WORKLOAD_IMAGE (default XB_IMAGE; must be CANN-based), XB_STAGE
+#      (/opt/enpu/vcann-rt), XB_NPU (0), XB_CHIP (0), XB_VNPU (0), XB_AICORE (20),
+#      XB_MEM (1024 MB). Prints a STATUS|CHECK|DETAIL table; non-zero on FAIL.
+set -uo pipefail
+HERE="$(cd "$(dirname "$0")" && pwd)"
+# shellcheck source=../scripts/lib.sh
+. "${HERE}/../scripts/lib.sh"
+
+TARGET="${1:-}"
+IMG="${XB_WORKLOAD_IMAGE:-${XB_IMAGE:-}}"
+[ -z "${IMG}" ] && [ -n "${TARGET}" ] && IMG="vcann-build:${TARGET#xbuild-ascend-cann-}"
+[ -n "${IMG}" ] || { echo "case-2: pass a TARGET (e.g. xbuild-ascend-cann-8-910b) or set XB_WORKLOAD_IMAGE"; exit 2; }
+
+echo "# ASCEND-CASE 2 — inject + enpu-monitor (image ${IMG}, npu ${XB_NPU:-0}) on $(xtarget_desc)"
+
+out="$(xsh \
+  IMG="${IMG}" STAGE="${XB_STAGE:-/opt/enpu/vcann-rt}" \
+  NPU="${XB_NPU:-0}" CHIP="${XB_CHIP:-0}" VNPU="${XB_VNPU:-0}" \
+  AICORE="${XB_AICORE:-20}" MEM="${XB_MEM:-1024}" <<'PAYLOAD'
+set -u
+row(){ printf '%s | %s | %s\n' "$1" "$2" "$3"; }
+fails=0
+
+vdie="$(npu-smi info -t board -i "${NPU}" -c "${CHIP}" 2>/dev/null | awk -F: '/VDie ID/{print $2}' | xargs)"
+if [ -n "${vdie}" ]; then shmid="$(echo "${vdie}" | tr ' ' '-')"; row PASS "VDie-ID -> shm-id" "${shmid}";
+else shmid="UNKNOWN-VDIE"; row WARN "VDie-ID" "not found via npu-smi; using ${shmid}"; fi
+
+mkdir -p "${STAGE}/test"
+cfg="${STAGE}/test/npu_info.config"
+printf 'physical-npu-id=%s\nvirtual-npu-id=%s\naicore-quota=%s\nmemory-quota=%s\nshm-id=%s\nscheduling-policy=2\n' \
+  "${NPU}" "${VNPU}" "${AICORE}" "${MEM}" "${shmid}" > "${cfg}"
+chmod 0644 "${cfg}"
+[ "$(stat -c '%a' "${cfg}")" = 644 ] && row PASS "npu_info.config mode 0644" ok || { row FAIL "npu_info.config mode 0644" "$(stat -c '%a' "${cfg}")"; fails=$((fails+1)); }
+
+pre="${STAGE}/test/ld.so.preload"
+printf '/usr/local/dcmi/libdcmi.so\n/usr/local/Ascend/driver/lib64/driver/libdcmi.so\n/opt/enpu/vcann-rt/lib/libvruntime.so\n' > "${pre}"
+chmod 0644 "${pre}"
+
+log="$(docker run --rm --runtime=ascend -e ASCEND_VISIBLE_DEVICES="${NPU}" -e ENPU_LOG_LEVEL=3 \
+  -v "${STAGE}/lib/libvruntime.so:/opt/enpu/vcann-rt/lib/libvruntime.so:ro" \
+  -v "${STAGE}/tools/enpu-monitor:/opt/enpu/vcann-rt/tools/enpu-monitor:ro" \
+  -v "${cfg}:/etc/enpu/vcann-rt/npu_info.config:ro" \
+  -v "${pre}:/etc/ld.so.preload:ro" \
+  -v /dev/shm:/dev/shm \
+  "${IMG}" /opt/enpu/vcann-rt/tools/enpu-monitor 2>&1)"
+
+n="$(echo "${log}" | grep -c 'Success to load config')"
+[ "${n}" -ge 6 ] && row PASS "config fields loaded" "${n}/6" || { row FAIL "config fields loaded" "${n}/6"; fails=$((fails+1)); }
+echo "${log}" | grep -q 'Successfully to initialize' && row PASS "vcann-rt initialize" ok || { row FAIL "vcann-rt initialize" "no init log"; fails=$((fails+1)); }
+echo "${log}" | grep -q 'Successfully to initialize all module' && row PASS "all-module init" ok || row WARN "all-module init" "only vnpu-device init seen (enpu-monitor standalone)"
+ac="$(echo "${log}" | awk -F: '/Aicore Limit Quota/{gsub(/ /,"",$2);print $2}')"
+[ "${ac}" = "${AICORE}" ] && row PASS "Aicore Limit Quota" "${ac}" || { row FAIL "Aicore Limit Quota == ${AICORE}" "${ac:-none}"; fails=$((fails+1)); }
+mq="$(echo "${log}" | awk -F: '/Memory Limit quota/{gsub(/ /,"",$2);print $2}')"
+[ "${mq}" = "${MEM}" ] && row PASS "Memory Limit quota(MB)" "${mq}" || { row FAIL "Memory Limit quota == ${MEM}" "${mq:-none}"; fails=$((fails+1)); }
+
+echo "--- enpu-monitor output ---"
+echo "${log}" | grep -iE 'Quota|Usage|initialize all module' || true
+echo "FAILS=${fails}"
+PAYLOAD
+)"
+echo "${out}"
+echo "${out}" | grep -q 'FAILS=0' && { echo "ASCEND-CASE 2: PASS"; exit 0; } || { echo "ASCEND-CASE 2: FAIL"; exit 1; }