Skip to content

feat: accelerator soft-slicing runtime isolation (NVIDIA HAMi-core / Ascend vcann-rt)#5

Merged
thxCode merged 17 commits into
mainfrom
feat/accelerator-soft-slicing-runtime-isolation-v2
Jun 26, 2026
Merged

feat: accelerator soft-slicing runtime isolation (NVIDIA HAMi-core / Ascend vcann-rt)#5
thxCode merged 17 commits into
mainfrom
feat/accelerator-soft-slicing-runtime-isolation-v2

Conversation

@thxCode

@thxCode thxCode commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Summary

Implements the deferred runtime isolation Non-Goal of the shipped accelerator-resource-modes-refactor
spec, for the soft-slicing path only (hard slicing — MIG / Ascend vNPU — stays out of scope). The
DeviceManager allocator's GetContainerAllocateResponse turns a bookkeeping-only response into a real
injection: a sliced container starts with a vendor preload library activated via /etc/ld.so.preload
(NVIDIA HAMi-core libvgpu.so; Ascend vcann-rt libvruntime.so + enpu-monitor) and per-container
VRAM/compute quota derived from its .sliced.units request. The preload libraries are compiled into the
operator image per runtime version, staged onto the host by a device-manager init container, and per-pod
working directories are reclaimed by a reconciler-fed GC.

Spec: specs/2026-06-25-accelerator-soft-slicing-runtime-isolation.md.

Packaging — inline git clone (no submodule)

The preload sources are cloned inline at pinned commits by the Docker build stages
(ARG LIB_UBS_VIRT_ENPU_COMMIT / ARG LIB_HAMI_CORE_COMMIT; git fetch --depth 1 <SHA>) — no git
submodule, no .gitmodules. Both paths were validated locally with docker build --target:

  • nvbuild-12 → HAMi-core cloned from GitHub, libvgpu.so produced. ✅
  • cannbuild-8-910b → vcann-rt cloned from gitcode, libvruntime.so + enpu-monitor produced. ✅

Both GitHub and gitcode accept git fetch --depth 1 origin <SHA> directly.

Tasks (linear T1–T10)

  • T1 — copy-dir.sh host-staging helper + GPUSTACK_LIB_DIR
  • T2 — soft-slice quota/version helpers (SliceRatio/FloorPercent; device.RuntimeMajor)
  • T3 — CANN builder stages: clone + build vcann-rt per CANN/family
  • T4 — Ascend ld.so.preload asset
  • T5 — device-manager DaemonSet host staging (init container)
  • T6 — Ascend sliced GetContainerAllocateResponse (npu_info.config + mounts)
  • T7 — per-pod working-dir GC via live pod-UUID notifier
  • T8 — CUDA builder stages: clone + build HAMi-core per CUDA major
  • T9 — NVIDIA ld.so.preload asset
  • T10 — NVIDIA sliced GetContainerAllocateResponse (env + mounts)

Plus: CI runs on self-hosted -x4 runners.

Tests & validation

  • Allocator injection (env / npu_info.config / mounts) covered by table-driven unit tests
    (pkg/devicemanager/allocator/{ascend,nvidia}/deviceplugin_test.go) — green.
  • Both inline-clone Docker build paths validated via --target (above).
  • ⚠️ The full multi-target image build (5 CANN + 2 CUDA stages) needs CI / a disk-rich host.

Deferred to a follow-up spec

In-cluster soft-slicing injection e2e on a GPU-less node: the device-plugin Allocate path is gated on
real hardware detection (the detector probes dcmi/NVML), so a mocked Devices CR can't drive it. A
DeviceManager detector simulation mode (read a fixture instead of probing hardware) is the next-spec
seed (recorded in the spec's Follow-ups).

Copilot AI review requested due to automatic review settings June 26, 2026 05:33
thxCode added a commit that referenced this pull request Jun 26, 2026
Status: Shipped — #5
Signed-off-by: thxCode <thxcode0824@gmail.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements soft-slicing runtime isolation in the GPUStack Operator’s device-manager allocator path by shipping vendor preload runtimes (NVIDIA HAMi-core, Ascend vcann-rt) inside the operator image, staging them onto the host via a DaemonSet init container, and injecting /etc/ld.so.preload + per-container quota/mounts during device-plugin Allocate. Adds per-pod working-dir GC and supporting helpers/tests.

Changes:

  • Build and package vendor preload artifacts into the operator image (multi-stage Docker builds + pinned inline clones), plus /etc/gpustack/lib/*/ld.so.preload assets and a checksum-guarded host staging helper.
  • Extend device-plugin allocation responders to inject preload/quota/mounts for sliced mode (Ascend + NVIDIA) and add a notifier-fed per-pod working-dir GC.
  • Update Helm chart DaemonSet to stage libraries onto the host and mount host /tmp; update docs/spec/test fixtures and CI runner sizing.

Reviewed changes

Copilot reviewed 37 out of 38 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
testing/sample/devices/ascend-910b.yaml Adds Ascend 910B sample Devices fixture used for allocator testing/validation.
specs/2026-06-25-accelerator-soft-slicing-runtime-isolation.md Full spec for soft-slicing runtime isolation packaging/injection/GC.
pkg/utils/osx/file.go Adds MkdirAll helper that enforces leaf permissions (umask-independent).
pkg/utils/osx/file_test.go Unit test for new osx.MkdirAll.
pkg/system/storage.go Adjusts host path layout under .../operator/...; adds privileged/container detection helpers.
pkg/internalprocesses/kubernetes/embed.go Uses system.IsRunningAsPrivileged(); updates air-gapped image link comment.
pkg/deviceplugin/types.go Extends responder interface to include the allocating *core.Container.
pkg/deviceplugin/server.go Threads container through Allocate/PreferredAllocation; adds sliced-mode pod-dir GC hook.
pkg/deviceplugin/server_test.go Updates stub responder signature for new interface.
pkg/deviceplugin/helper.go Adds soft-slicing host path helpers + SliceRatio/FloorPercent.
pkg/deviceplugin/helper_test.go Unit tests for SliceRatio and FloorPercent.
pkg/deviceplugin/gc.go Implements per-pod working-dir GC for sliced mode.
pkg/deviceplugin/gc_test.go Unit tests for GC miss-streak behavior and nil/empty live sets.
pkg/deviceplugin/controller.go Notifier payload widened to []string of live pod UIDs; reconciler collects and sends UIDs.
pkg/devicemanager/detector/ascend/device.go Updates Ascend family mapping comments/values (910B/910C/950).
pkg/devicemanager/allocator/thead/deviceplugin.go Signature update for responder interface (container param).
pkg/devicemanager/allocator/nvidia/deviceplugin.go Adds sliced-mode injection (HAMi-core preload + quota + mounts + dirs).
pkg/devicemanager/allocator/nvidia/deviceplugin_test.go Table-driven tests for NVIDIA sliced injection behavior.
pkg/devicemanager/allocator/mthreads/deviceplugin.go Signature update for responder interface (container param).
pkg/devicemanager/allocator/metax/deviceplugin.go Signature update for responder interface (container param).
pkg/devicemanager/allocator/iluvatar/deviceplugin.go Signature update for responder interface (container param).
pkg/devicemanager/allocator/hygon/deviceplugin.go Signature update for responder interface (container param).
pkg/devicemanager/allocator/cambricon/deviceplugin.go Signature update for responder interface (container param).
pkg/devicemanager/allocator/ascend/deviceplugin.go Adds sliced-mode injection (vcann-rt preload + npu_info.config + mounts).
pkg/devicemanager/allocator/ascend/deviceplugin_test.go Table-driven tests for Ascend sliced injection and vNPU ID assignment.
pkg/devicemanager/allocator/amd/deviceplugin.go Signature update for responder interface (container param).
pkg/device/helper.go Adds device.RuntimeMajor helper for runtime-version directory selection.
pkg/device/helper_test.go Unit tests for device.RuntimeMajor.
pack/gpustack-operator/rootfs/usr/bin/copy-dir.sh New checksum-guarded recursive staging script used by init container.
pack/gpustack-operator/rootfs/etc/gpustack/lib/nvidia/ld.so.preload Adds NVIDIA preload asset pointing at /usr/local/vgpu/libvgpu.so.
pack/gpustack-operator/rootfs/etc/gpustack/lib/ascend/ld.so.preload Adds Ascend preload asset pointing at /opt/enpu/vcann-rt/lib/libvruntime.so.
pack/gpustack-operator/external/nvidia/build-libvgpu.sh Build script for HAMi-core libvgpu.so (build-stage only).
pack/gpustack-operator/external/ascend/build-libvnpu.sh Build script for vcann-rt artifacts with dcmi/HAL build-time stubbing.
pack/gpustack-operator/Dockerfile Adds GPUSTACK_LIB_DIR, pinned inline clones, nvbuild/cannbuild stages, and installs assets/scripts.
docs/architecture.md Documents sliced-mode runtime isolation injection + packaging/staging/GC behavior.
deploy/gpustack-operator/chart/templates/device-manager/daemonset.yaml Adds init container staging /etc/gpustack/lib to host; mounts host /tmp.
.github/workflows/ci.yml Moves CI builds to *-x4 runners for larger multi-stage image builds.
.claude/skills/gpustack-operator-chart-e2e/SKILL.md Clarifies chart-e2e skill scope (chart contract vs feature e2e).

Comment thread pkg/devicemanager/allocator/ascend/deviceplugin.go
Comment thread docs/architecture.md Outdated
Comment thread pkg/system/storage.go
Comment thread pkg/internalprocesses/kubernetes/embed.go
Copilot AI review requested due to automatic review settings June 26, 2026 05:38

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 37 out of 38 changed files in this pull request and generated 4 comments.

Comment thread pkg/internalprocesses/kubernetes/embed.go
Comment thread docs/architecture.md Outdated
Comment thread pkg/devicemanager/allocator/ascend/deviceplugin.go
Comment thread pkg/devicemanager/allocator/nvidia/deviceplugin.go Outdated
thxCode added 6 commits June 26, 2026 13:53
Signed-off-by: thxCode <thxcode0824@gmail.com>
Signed-off-by: thxCode <thxcode0824@gmail.com>
Signed-off-by: thxCode <thxcode0824@gmail.com>
Signed-off-by: thxCode <thxcode0824@gmail.com>
Signed-off-by: thxCode <thxcode0824@gmail.com>
Add a recursive, sha256-guarded copy-dir.sh into the operator image rootfs
(installed to /usr/bin, 0755) for the device-manager init container to stage
the soft-slicing library tree onto the host idempotently. Declare the new
GPUSTACK_LIB_DIR build arg (= GPUSTACK_CONF_DIR/lib).

Task 1 of accelerator-soft-slicing-runtime-isolation.
Signed-off-by: thxCode <thxcode0824@gmail.com>
thxCode added a commit that referenced this pull request Jun 26, 2026
Status: Shipped — #5
Signed-off-by: thxCode <thxcode0824@gmail.com>
@thxCode thxCode force-pushed the feat/accelerator-soft-slicing-runtime-isolation-v2 branch from 97a433c to ed6c627 Compare June 26, 2026 06:08
@thxCode thxCode requested a review from Copilot June 26, 2026 06:11

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 37 out of 38 changed files in this pull request and generated 2 comments.

Comment thread pkg/deviceplugin/controller.go Outdated
Comment thread pkg/deviceplugin/server.go
@thxCode thxCode force-pushed the feat/accelerator-soft-slicing-runtime-isolation-v2 branch from ed6c627 to c17541b Compare June 26, 2026 06:29
Copilot AI review requested due to automatic review settings June 26, 2026 07:37
@thxCode thxCode force-pushed the feat/accelerator-soft-slicing-runtime-isolation-v2 branch from c17541b to d017f64 Compare June 26, 2026 07:37

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 38 out of 39 changed files in this pull request and generated 2 comments.

Comment thread .github/workflows/ci-chart.yml Outdated
Comment thread pkg/utils/osx/file.go Outdated
@thxCode thxCode force-pushed the feat/accelerator-soft-slicing-runtime-isolation-v2 branch from d017f64 to 76b62e9 Compare June 26, 2026 11:58
@thxCode thxCode requested a review from Copilot June 26, 2026 12:00

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 54 out of 55 changed files in this pull request and generated 3 comments.

Comment thread .claude/skills/gpustack-operator-xbuild-and-verify/scripts/preflight.sh Outdated
Comment thread specs/2026-06-25-accelerator-soft-slicing-runtime-isolation.md Outdated
Comment thread specs/2026-06-25-accelerator-soft-slicing-runtime-isolation.md Outdated
thxCode added 3 commits June 26, 2026 20:41
Ship /etc/gpustack/lib/ascend/ld.so.preload (0644) pointing at the in-container
vcann-rt mount /opt/enpu/vcann-rt/lib/libvruntime.so, installed in the final
stage alongside the cann-* libs so the device-manager init container stages the
whole /etc/gpustack/lib tree onto the host.

Task 4 of accelerator-soft-slicing-runtime-isolation.
Signed-off-by: thxCode <thxcode0824@gmail.com>
…t container

Add a stage-libs init container that runs copy-dir.sh to idempotently stage the
in-image /etc/gpustack/lib tree onto the host at /var/lib/gpustack/operator/lib,
and mount host /tmp into the device-manager so the allocator can create the
shared /tmp/vgpulock the NVIDIA soft-slicing runtime expects.

Task 5 of accelerator-soft-slicing-runtime-isolation.
Signed-off-by: thxCode <thxcode0824@gmail.com>
…ntainers

Register the Ascend Sliced device-plugin server and rewrite its
GetContainerAllocateResponse to apply real soft-slicing isolation:
- render a per-container npu_info.config (aicore/memory quota from the
  .sliced.units request, shm-id from the accelerator ID, scheduling-policy=2);
- assign the lowest-free virtual-npu-id per physical NPU by scanning on-disk
  configs (level-based, idempotent on re-allocation);
- mount the staged libvruntime.so + enpu-monitor + ld.so.preload + the config
  + /dev/shm into the container.

vcann-rt's npu_info.config is single-NPU, so a sliced Ascend container maps to
one card. Shared OperatorLibDir/OperatorPodsDir/PodWorkDir helpers land in
pkg/deviceplugin for the NVIDIA path to reuse.

Task 6 of accelerator-soft-slicing-runtime-isolation.
Signed-off-by: thxCode <thxcode0824@gmail.com>
Copilot AI review requested due to automatic review settings June 26, 2026 12:43
@thxCode thxCode force-pushed the feat/accelerator-soft-slicing-runtime-isolation-v2 branch from 0f4536a to bbb65fb Compare June 26, 2026 12:43

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 54 out of 55 changed files in this pull request and generated 2 comments.

Comment thread pkg/deviceplugin/controller.go Outdated
Comment thread pkg/deviceplugin/gc.go
@thxCode thxCode force-pushed the feat/accelerator-soft-slicing-runtime-isolation-v2 branch from bbb65fb to 4bb7501 Compare June 26, 2026 13:21
thxCode added 5 commits June 26, 2026 21:33
…ifier

Widen the DevicesReconciler notifier from chan struct{} to chan []string so it
carries the node's live pod-UUID set on each reconcile, and add a level-based
podDirGC that the Sliced server runs in ListAndWatch: it seeds from on-disk
pod directories and removes one only after its UUID has been absent from the
live set for 3 consecutive reconciles. Exclusive/Shared servers keep emitting
the ListAndWatch response and ignore the payload.

Task 7 of accelerator-soft-slicing-runtime-isolation.
Signed-off-by: thxCode <thxcode0824@gmail.com>
Add the nvbuild-12 / nvbuild-13 stages that compile HAMi-core (libvgpu.so)
against the nvidia/cuda 12.9.2 / 13.0.3 cudnn-devel-ubi8 images (install cmake,
run HAMi-core build.sh via build-libvgpu.sh) and copy the products into
${GPUSTACK_LIB_DIR}/nvidia/cuda-{12,13} in the final image.

Both stages verified-built locally (libvgpu.so ~700KB each); HAMi-core compiles
cleanly against cuda-13.

Task 8 of accelerator-soft-slicing-runtime-isolation.
Signed-off-by: thxCode <thxcode0824@gmail.com>
Ship /etc/gpustack/lib/nvidia/ld.so.preload (0644) pointing at the in-container
HAMi-core mount /usr/local/vgpu/libvgpu.so, installed in the final stage
alongside the cuda-* libs so the device-manager init container stages it onto
the host.

Task 9 of accelerator-soft-slicing-runtime-isolation.
Signed-off-by: thxCode <thxcode0824@gmail.com>
…ontainers

Rewrite the NVIDIA Sliced GetContainerAllocateResponse to apply real
soft-slicing isolation:
- set CUDA_DEVICE_SM_LIMIT (floor(R*100)) and a per-card CUDA_DEVICE_MEMORY_LIMIT_<i>
  (MiB->KiB, scaled by R) from the .sliced.units request, plus the shared cache env;
- create the per-container /tmp/vgpulock, pods/<X>, and pods/<X>/tmp/vgpu dirs (0777);
- mount the staged libvgpu.so + ld.so.preload + the vgpu cache + /dev/shm, keeping
  the GPU visible via NVIDIA_VISIBLE_DEVICES so HAMi-core enforces the limits.

Completes the soft-slicing runtime-isolation spec (Status: Built).

Task 10 of accelerator-soft-slicing-runtime-isolation.
Signed-off-by: thxCode <thxcode0824@gmail.com>
- chart-e2e: clarify it validates the chart contract (install/startup/image/
  version), not feature/behavioral e2e
- architecture.md: note the device-plugin allocator soft-slicing runtime
  isolation in Stage 2
- spec: align task descriptions with the inline-clone packaging (no submodule)
  and record the GPU-less injection-e2e blocker as a next-spec follow-up

The sliced soft-slicing injection e2e (CASE 6) is intentionally NOT added here:
GPU-less simulation can't drive the device-plugin Allocate (detector is
hardware-gated), so it's deferred to a spec that adds a detector simulation mode.

Signed-off-by: thxCode <thxcode0824@gmail.com>
Copilot AI review requested due to automatic review settings June 26, 2026 13:33
@thxCode thxCode force-pushed the feat/accelerator-soft-slicing-runtime-isolation-v2 branch from 4bb7501 to babb2cd Compare June 26, 2026 13:33

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 54 out of 55 changed files in this pull request and generated 3 comments.

command -v git >/dev/null 2>&1 || { apt-get update -y && apt-get install -y --no-install-recommends git ca-certificates && rm -rf /var/lib/apt/lists/*; }
rm -rf /tmp/ubs-virt && git init -q /tmp/ubs-virt && cd /tmp/ubs-virt
git remote add origin https://gitcode.com/openeuler/ubs-virt.git
git fetch --depth 1 origin "${LIB_UBS_VIRT_ENPU_COMMIT}" || git fetch origin
set -exo pipefail
rm -rf /tmp/hami-core-src && git init -q /tmp/hami-core-src && cd /tmp/hami-core-src
git remote add origin https://github.com/Project-HAMi/HAMi-core.git
git fetch --depth 1 origin "${LIB_HAMI_CORE_COMMIT}" || git fetch origin
Comment on lines +87 to +113
// Sliced containers leave per-pod working directories on the host; reclaim them
// as their pods disappear from the reconciler's live pod-UUID set.
var gc *podDirGC
if s.AllocationMode == workercore.DeviceAllocationModeSliced {
gc = newPodDirGC(OperatorPodsDir)
}

// Watch for updates and send ListAndWatch response whenever there's a change.
s.Logger.Info("watching for device updates")
for {
select {
case <-ctx.Done():
return ctx.Err()
case <-notifier:
case livePodUIDs := <-notifier:
resp, err := s.getListAndWatchResponse(ctx)
if err != nil {
s.Logger.Error(err, "get list and watch response")
s.Logger.Error(err, "get list and watch response on update")
return err
}
if err = srv.Send(resp); err != nil {
s.Logger.Error(err, "send list and watch response")
return err
}
s.Logger.Info("sent list and watch response")
if gc != nil {
gc.reconcile(livePodUIDs)
}
@thxCode thxCode force-pushed the feat/accelerator-soft-slicing-runtime-isolation-v2 branch from babb2cd to 9640796 Compare June 26, 2026 16:09
Copilot AI review requested due to automatic review settings June 26, 2026 16:12
@thxCode thxCode force-pushed the feat/accelerator-soft-slicing-runtime-isolation-v2 branch from 9640796 to 9dc17b9 Compare June 26, 2026 16:12
@thxCode thxCode force-pushed the feat/accelerator-soft-slicing-runtime-isolation-v2 branch from 9dc17b9 to 4a8136e Compare June 26, 2026 16:14

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@thxCode thxCode force-pushed the feat/accelerator-soft-slicing-runtime-isolation-v2 branch from 4a8136e to e0e7386 Compare June 26, 2026 16:28
Signed-off-by: thxCode <thxcode0824@gmail.com>
Copilot AI review requested due to automatic review settings June 26, 2026 16:37
@thxCode thxCode force-pushed the feat/accelerator-soft-slicing-runtime-isolation-v2 branch from e0e7386 to 250d2d2 Compare June 26, 2026 16:37
@thxCode thxCode merged commit 5829fd5 into main Jun 26, 2026
7 checks passed

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 54 out of 55 changed files in this pull request and generated 4 comments.

Comment on lines +261 to +265
- name: gpustack-nvidia-vgpulock-dir
hostPath:
path: /tmp/vgpulock
type: Directory
{{- end }}
Comment on lines +139 to +143
// FloorPercent converts a per-card fraction R into the integer compute percent the
// soft-slicing runtimes expect (HAMi-core CUDA_DEVICE_SM_LIMIT, vcann-rt aicore-quota),
// rounding down: floor(R*100).
func FloorPercent(r float64) int {
return int(r * 100)
Comment on lines +205 to +209
"CUDA_DEVICE_SM_LIMIT": strconv.Itoa(deviceplugin.FloorPercent(ratio)),
"CUDA_DEVICE_MEMORY_SHARED_CACHE": ctrVgpuSharedCache,
}
for i := range accels {
limit := int64(float64(accels[i].group.Memory) * ratio)
Comment on lines +266 to +270
_, _ = fmt.Fprintf(&b, "physical-npu-id=%d\n", npuId)
_, _ = fmt.Fprintf(&b, "virtual-npu-id=%d\n", vnpuId)
_, _ = fmt.Fprintf(&b, "aicore-quota=%d\n", deviceplugin.FloorPercent(ratio))
_, _ = fmt.Fprintf(&b, "memory-quota=%d\n", int64(float64(memoryMiB)*ratio))
_, _ = fmt.Fprintf(&b, "shm-id=%s\n", shmID)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants