feat: accelerator soft-slicing runtime isolation (NVIDIA HAMi-core / Ascend vcann-rt)#5
Merged
thxCode merged 17 commits intoJun 26, 2026
Conversation
thxCode
added a commit
that referenced
this pull request
Jun 26, 2026
Status: Shipped — #5 Signed-off-by: thxCode <thxcode0824@gmail.com>
There was a problem hiding this comment.
Pull request overview
Implements soft-slicing runtime isolation in the GPUStack Operator’s device-manager allocator path by shipping vendor preload runtimes (NVIDIA HAMi-core, Ascend vcann-rt) inside the operator image, staging them onto the host via a DaemonSet init container, and injecting /etc/ld.so.preload + per-container quota/mounts during device-plugin Allocate. Adds per-pod working-dir GC and supporting helpers/tests.
Changes:
- Build and package vendor preload artifacts into the operator image (multi-stage Docker builds + pinned inline clones), plus
/etc/gpustack/lib/*/ld.so.preloadassets and a checksum-guarded host staging helper. - Extend device-plugin allocation responders to inject preload/quota/mounts for sliced mode (Ascend + NVIDIA) and add a notifier-fed per-pod working-dir GC.
- Update Helm chart DaemonSet to stage libraries onto the host and mount host
/tmp; update docs/spec/test fixtures and CI runner sizing.
Reviewed changes
Copilot reviewed 37 out of 38 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| testing/sample/devices/ascend-910b.yaml | Adds Ascend 910B sample Devices fixture used for allocator testing/validation. |
| specs/2026-06-25-accelerator-soft-slicing-runtime-isolation.md | Full spec for soft-slicing runtime isolation packaging/injection/GC. |
| pkg/utils/osx/file.go | Adds MkdirAll helper that enforces leaf permissions (umask-independent). |
| pkg/utils/osx/file_test.go | Unit test for new osx.MkdirAll. |
| pkg/system/storage.go | Adjusts host path layout under .../operator/...; adds privileged/container detection helpers. |
| pkg/internalprocesses/kubernetes/embed.go | Uses system.IsRunningAsPrivileged(); updates air-gapped image link comment. |
| pkg/deviceplugin/types.go | Extends responder interface to include the allocating *core.Container. |
| pkg/deviceplugin/server.go | Threads container through Allocate/PreferredAllocation; adds sliced-mode pod-dir GC hook. |
| pkg/deviceplugin/server_test.go | Updates stub responder signature for new interface. |
| pkg/deviceplugin/helper.go | Adds soft-slicing host path helpers + SliceRatio/FloorPercent. |
| pkg/deviceplugin/helper_test.go | Unit tests for SliceRatio and FloorPercent. |
| pkg/deviceplugin/gc.go | Implements per-pod working-dir GC for sliced mode. |
| pkg/deviceplugin/gc_test.go | Unit tests for GC miss-streak behavior and nil/empty live sets. |
| pkg/deviceplugin/controller.go | Notifier payload widened to []string of live pod UIDs; reconciler collects and sends UIDs. |
| pkg/devicemanager/detector/ascend/device.go | Updates Ascend family mapping comments/values (910B/910C/950). |
| pkg/devicemanager/allocator/thead/deviceplugin.go | Signature update for responder interface (container param). |
| pkg/devicemanager/allocator/nvidia/deviceplugin.go | Adds sliced-mode injection (HAMi-core preload + quota + mounts + dirs). |
| pkg/devicemanager/allocator/nvidia/deviceplugin_test.go | Table-driven tests for NVIDIA sliced injection behavior. |
| pkg/devicemanager/allocator/mthreads/deviceplugin.go | Signature update for responder interface (container param). |
| pkg/devicemanager/allocator/metax/deviceplugin.go | Signature update for responder interface (container param). |
| pkg/devicemanager/allocator/iluvatar/deviceplugin.go | Signature update for responder interface (container param). |
| pkg/devicemanager/allocator/hygon/deviceplugin.go | Signature update for responder interface (container param). |
| pkg/devicemanager/allocator/cambricon/deviceplugin.go | Signature update for responder interface (container param). |
| pkg/devicemanager/allocator/ascend/deviceplugin.go | Adds sliced-mode injection (vcann-rt preload + npu_info.config + mounts). |
| pkg/devicemanager/allocator/ascend/deviceplugin_test.go | Table-driven tests for Ascend sliced injection and vNPU ID assignment. |
| pkg/devicemanager/allocator/amd/deviceplugin.go | Signature update for responder interface (container param). |
| pkg/device/helper.go | Adds device.RuntimeMajor helper for runtime-version directory selection. |
| pkg/device/helper_test.go | Unit tests for device.RuntimeMajor. |
| pack/gpustack-operator/rootfs/usr/bin/copy-dir.sh | New checksum-guarded recursive staging script used by init container. |
| pack/gpustack-operator/rootfs/etc/gpustack/lib/nvidia/ld.so.preload | Adds NVIDIA preload asset pointing at /usr/local/vgpu/libvgpu.so. |
| pack/gpustack-operator/rootfs/etc/gpustack/lib/ascend/ld.so.preload | Adds Ascend preload asset pointing at /opt/enpu/vcann-rt/lib/libvruntime.so. |
| pack/gpustack-operator/external/nvidia/build-libvgpu.sh | Build script for HAMi-core libvgpu.so (build-stage only). |
| pack/gpustack-operator/external/ascend/build-libvnpu.sh | Build script for vcann-rt artifacts with dcmi/HAL build-time stubbing. |
| pack/gpustack-operator/Dockerfile | Adds GPUSTACK_LIB_DIR, pinned inline clones, nvbuild/cannbuild stages, and installs assets/scripts. |
| docs/architecture.md | Documents sliced-mode runtime isolation injection + packaging/staging/GC behavior. |
| deploy/gpustack-operator/chart/templates/device-manager/daemonset.yaml | Adds init container staging /etc/gpustack/lib to host; mounts host /tmp. |
| .github/workflows/ci.yml | Moves CI builds to *-x4 runners for larger multi-stage image builds. |
| .claude/skills/gpustack-operator-chart-e2e/SKILL.md | Clarifies chart-e2e skill scope (chart contract vs feature e2e). |
Signed-off-by: thxCode <thxcode0824@gmail.com>
Signed-off-by: thxCode <thxcode0824@gmail.com>
Signed-off-by: thxCode <thxcode0824@gmail.com>
Signed-off-by: thxCode <thxcode0824@gmail.com>
Signed-off-by: thxCode <thxcode0824@gmail.com>
Add a recursive, sha256-guarded copy-dir.sh into the operator image rootfs (installed to /usr/bin, 0755) for the device-manager init container to stage the soft-slicing library tree onto the host idempotently. Declare the new GPUSTACK_LIB_DIR build arg (= GPUSTACK_CONF_DIR/lib). Task 1 of accelerator-soft-slicing-runtime-isolation. Signed-off-by: thxCode <thxcode0824@gmail.com>
thxCode
added a commit
that referenced
this pull request
Jun 26, 2026
Status: Shipped — #5 Signed-off-by: thxCode <thxcode0824@gmail.com>
97a433c to
ed6c627
Compare
ed6c627 to
c17541b
Compare
c17541b to
d017f64
Compare
d017f64 to
76b62e9
Compare
Ship /etc/gpustack/lib/ascend/ld.so.preload (0644) pointing at the in-container vcann-rt mount /opt/enpu/vcann-rt/lib/libvruntime.so, installed in the final stage alongside the cann-* libs so the device-manager init container stages the whole /etc/gpustack/lib tree onto the host. Task 4 of accelerator-soft-slicing-runtime-isolation. Signed-off-by: thxCode <thxcode0824@gmail.com>
…t container Add a stage-libs init container that runs copy-dir.sh to idempotently stage the in-image /etc/gpustack/lib tree onto the host at /var/lib/gpustack/operator/lib, and mount host /tmp into the device-manager so the allocator can create the shared /tmp/vgpulock the NVIDIA soft-slicing runtime expects. Task 5 of accelerator-soft-slicing-runtime-isolation. Signed-off-by: thxCode <thxcode0824@gmail.com>
…ntainers Register the Ascend Sliced device-plugin server and rewrite its GetContainerAllocateResponse to apply real soft-slicing isolation: - render a per-container npu_info.config (aicore/memory quota from the .sliced.units request, shm-id from the accelerator ID, scheduling-policy=2); - assign the lowest-free virtual-npu-id per physical NPU by scanning on-disk configs (level-based, idempotent on re-allocation); - mount the staged libvruntime.so + enpu-monitor + ld.so.preload + the config + /dev/shm into the container. vcann-rt's npu_info.config is single-NPU, so a sliced Ascend container maps to one card. Shared OperatorLibDir/OperatorPodsDir/PodWorkDir helpers land in pkg/deviceplugin for the NVIDIA path to reuse. Task 6 of accelerator-soft-slicing-runtime-isolation. Signed-off-by: thxCode <thxcode0824@gmail.com>
0f4536a to
bbb65fb
Compare
bbb65fb to
4bb7501
Compare
…ifier
Widen the DevicesReconciler notifier from chan struct{} to chan []string so it
carries the node's live pod-UUID set on each reconcile, and add a level-based
podDirGC that the Sliced server runs in ListAndWatch: it seeds from on-disk
pod directories and removes one only after its UUID has been absent from the
live set for 3 consecutive reconciles. Exclusive/Shared servers keep emitting
the ListAndWatch response and ignore the payload.
Task 7 of accelerator-soft-slicing-runtime-isolation.
Signed-off-by: thxCode <thxcode0824@gmail.com>
Add the nvbuild-12 / nvbuild-13 stages that compile HAMi-core (libvgpu.so)
against the nvidia/cuda 12.9.2 / 13.0.3 cudnn-devel-ubi8 images (install cmake,
run HAMi-core build.sh via build-libvgpu.sh) and copy the products into
${GPUSTACK_LIB_DIR}/nvidia/cuda-{12,13} in the final image.
Both stages verified-built locally (libvgpu.so ~700KB each); HAMi-core compiles
cleanly against cuda-13.
Task 8 of accelerator-soft-slicing-runtime-isolation.
Signed-off-by: thxCode <thxcode0824@gmail.com>
Ship /etc/gpustack/lib/nvidia/ld.so.preload (0644) pointing at the in-container HAMi-core mount /usr/local/vgpu/libvgpu.so, installed in the final stage alongside the cuda-* libs so the device-manager init container stages it onto the host. Task 9 of accelerator-soft-slicing-runtime-isolation. Signed-off-by: thxCode <thxcode0824@gmail.com>
…ontainers Rewrite the NVIDIA Sliced GetContainerAllocateResponse to apply real soft-slicing isolation: - set CUDA_DEVICE_SM_LIMIT (floor(R*100)) and a per-card CUDA_DEVICE_MEMORY_LIMIT_<i> (MiB->KiB, scaled by R) from the .sliced.units request, plus the shared cache env; - create the per-container /tmp/vgpulock, pods/<X>, and pods/<X>/tmp/vgpu dirs (0777); - mount the staged libvgpu.so + ld.so.preload + the vgpu cache + /dev/shm, keeping the GPU visible via NVIDIA_VISIBLE_DEVICES so HAMi-core enforces the limits. Completes the soft-slicing runtime-isolation spec (Status: Built). Task 10 of accelerator-soft-slicing-runtime-isolation. Signed-off-by: thxCode <thxcode0824@gmail.com>
- chart-e2e: clarify it validates the chart contract (install/startup/image/ version), not feature/behavioral e2e - architecture.md: note the device-plugin allocator soft-slicing runtime isolation in Stage 2 - spec: align task descriptions with the inline-clone packaging (no submodule) and record the GPU-less injection-e2e blocker as a next-spec follow-up The sliced soft-slicing injection e2e (CASE 6) is intentionally NOT added here: GPU-less simulation can't drive the device-plugin Allocate (detector is hardware-gated), so it's deferred to a spec that adds a detector simulation mode. Signed-off-by: thxCode <thxcode0824@gmail.com>
4bb7501 to
babb2cd
Compare
| command -v git >/dev/null 2>&1 || { apt-get update -y && apt-get install -y --no-install-recommends git ca-certificates && rm -rf /var/lib/apt/lists/*; } | ||
| rm -rf /tmp/ubs-virt && git init -q /tmp/ubs-virt && cd /tmp/ubs-virt | ||
| git remote add origin https://gitcode.com/openeuler/ubs-virt.git | ||
| git fetch --depth 1 origin "${LIB_UBS_VIRT_ENPU_COMMIT}" || git fetch origin |
| set -exo pipefail | ||
| rm -rf /tmp/hami-core-src && git init -q /tmp/hami-core-src && cd /tmp/hami-core-src | ||
| git remote add origin https://github.com/Project-HAMi/HAMi-core.git | ||
| git fetch --depth 1 origin "${LIB_HAMI_CORE_COMMIT}" || git fetch origin |
Comment on lines
+87
to
+113
| // Sliced containers leave per-pod working directories on the host; reclaim them | ||
| // as their pods disappear from the reconciler's live pod-UUID set. | ||
| var gc *podDirGC | ||
| if s.AllocationMode == workercore.DeviceAllocationModeSliced { | ||
| gc = newPodDirGC(OperatorPodsDir) | ||
| } | ||
|
|
||
| // Watch for updates and send ListAndWatch response whenever there's a change. | ||
| s.Logger.Info("watching for device updates") | ||
| for { | ||
| select { | ||
| case <-ctx.Done(): | ||
| return ctx.Err() | ||
| case <-notifier: | ||
| case livePodUIDs := <-notifier: | ||
| resp, err := s.getListAndWatchResponse(ctx) | ||
| if err != nil { | ||
| s.Logger.Error(err, "get list and watch response") | ||
| s.Logger.Error(err, "get list and watch response on update") | ||
| return err | ||
| } | ||
| if err = srv.Send(resp); err != nil { | ||
| s.Logger.Error(err, "send list and watch response") | ||
| return err | ||
| } | ||
| s.Logger.Info("sent list and watch response") | ||
| if gc != nil { | ||
| gc.reconcile(livePodUIDs) | ||
| } |
babb2cd to
9640796
Compare
9640796 to
9dc17b9
Compare
9dc17b9 to
4a8136e
Compare
4a8136e to
e0e7386
Compare
Signed-off-by: thxCode <thxcode0824@gmail.com>
e0e7386 to
250d2d2
Compare
Comment on lines
+261
to
+265
| - name: gpustack-nvidia-vgpulock-dir | ||
| hostPath: | ||
| path: /tmp/vgpulock | ||
| type: Directory | ||
| {{- end }} |
Comment on lines
+139
to
+143
| // FloorPercent converts a per-card fraction R into the integer compute percent the | ||
| // soft-slicing runtimes expect (HAMi-core CUDA_DEVICE_SM_LIMIT, vcann-rt aicore-quota), | ||
| // rounding down: floor(R*100). | ||
| func FloorPercent(r float64) int { | ||
| return int(r * 100) |
Comment on lines
+205
to
+209
| "CUDA_DEVICE_SM_LIMIT": strconv.Itoa(deviceplugin.FloorPercent(ratio)), | ||
| "CUDA_DEVICE_MEMORY_SHARED_CACHE": ctrVgpuSharedCache, | ||
| } | ||
| for i := range accels { | ||
| limit := int64(float64(accels[i].group.Memory) * ratio) |
Comment on lines
+266
to
+270
| _, _ = fmt.Fprintf(&b, "physical-npu-id=%d\n", npuId) | ||
| _, _ = fmt.Fprintf(&b, "virtual-npu-id=%d\n", vnpuId) | ||
| _, _ = fmt.Fprintf(&b, "aicore-quota=%d\n", deviceplugin.FloorPercent(ratio)) | ||
| _, _ = fmt.Fprintf(&b, "memory-quota=%d\n", int64(float64(memoryMiB)*ratio)) | ||
| _, _ = fmt.Fprintf(&b, "shm-id=%s\n", shmID) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the deferred runtime isolation Non-Goal of the shipped
accelerator-resource-modes-refactorspec, for the soft-slicing path only (hard slicing — MIG / Ascend vNPU — stays out of scope). The
DeviceManager allocator's
GetContainerAllocateResponseturns a bookkeeping-only response into a realinjection: a sliced container starts with a vendor preload library activated via
/etc/ld.so.preload(NVIDIA HAMi-core
libvgpu.so; Ascend vcann-rtlibvruntime.so+enpu-monitor) and per-containerVRAM/compute quota derived from its
.sliced.unitsrequest. The preload libraries are compiled into theoperator image per runtime version, staged onto the host by a device-manager init container, and per-pod
working directories are reclaimed by a reconciler-fed GC.
Spec:
specs/2026-06-25-accelerator-soft-slicing-runtime-isolation.md.Packaging — inline
git clone(no submodule)The preload sources are cloned inline at pinned commits by the Docker build stages
(
ARG LIB_UBS_VIRT_ENPU_COMMIT/ARG LIB_HAMI_CORE_COMMIT;git fetch --depth 1 <SHA>) — no gitsubmodule, no
.gitmodules. Both paths were validated locally withdocker build --target:nvbuild-12→ HAMi-core cloned from GitHub,libvgpu.soproduced. ✅cannbuild-8-910b→ vcann-rt cloned from gitcode,libvruntime.so+enpu-monitorproduced. ✅Both GitHub and gitcode accept
git fetch --depth 1 origin <SHA>directly.Tasks (linear T1–T10)
copy-dir.shhost-staging helper +GPUSTACK_LIB_DIRSliceRatio/FloorPercent;device.RuntimeMajor)ld.so.preloadassetGetContainerAllocateResponse(npu_info.config + mounts)ld.so.preloadassetGetContainerAllocateResponse(env + mounts)Plus: CI runs on self-hosted
-x4runners.Tests & validation
npu_info.config/ mounts) covered by table-driven unit tests(
pkg/devicemanager/allocator/{ascend,nvidia}/deviceplugin_test.go) — green.--target(above).Deferred to a follow-up spec
In-cluster soft-slicing injection e2e on a GPU-less node: the device-plugin
Allocatepath is gated onreal hardware detection (the detector probes dcmi/NVML), so a mocked
DevicesCR can't drive it. ADeviceManager detector simulation mode (read a fixture instead of probing hardware) is the next-spec
seed (recorded in the spec's Follow-ups).