Skip to content

Commit 950f4b4

Browse files
Add GPU/Device Support and Fix Symlink Deduplication Issues (#176)
* Implement docker device request in CLI Now one can pass in the cro runtime and GPU like --cro-device-request '{"Count":-1, \ "Capabilities":[["gpu"]]}' \ --cro-runtime nvidia \ ... * Fix issues with duplicate symlink processing Correct issues with duplicate paths from symlink processing, when they both point to the same inode. This fixes a behavior where slimtookit would randomly generate o-byte files for the actual file referenced by the symlink. * fix: resolve ptrace monitor bugs causing hangs with multiprocess apps Infinite loop in getStringParam: when a syscall had a NULL path argument, PtracePeekData returned EIO with count=0. The function silenced EIO, never advanced the pointer, and had no exit condition, burning 100% CPU until exit. Fix: return early on NULL pointer and on any PtracePeekData error. * fix null ref crash when ptrace disabled * fix: prevent namespace package directories from being excluded by IsSubdir false positive OKReturnStatus for checkFile syscalls intentionally accepts ENOENT (-2) to track Python import search paths. When Python imports a namespace package (a directory without __init__.py), it probes for several __init__.* variants -- all returning ENOENT -- before stat'ing the directory itself (success). The radix tree walk in FileActivity() then sees the directory as a prefix of these ghost child paths and marks it IsSubdir=true, excluding it from the slim image. Since the ghost children don't exist on disk, neither the directory nor any files are preserved. Add HasSuccessfulAccess to FSActivityInfo, set it only when retVal==0, and require it in the IsSubdir determination so that ENOENT-only child paths cannot cause a parent directory to be dropped. * example use case with nvidia runtime * Fix HasSuccessfulAccess for open-type syscalls where success is fd >= 0, not retVal == 0 * Key deduplicateFileMap by (dev, inode) instead of inode alone to avoid false dedup across mounts * Fail fast on invalid --cro-device-request JSON instead of silently dropping the device config * Remove unused SIGNAL_PIPE named pipe from test_vllm.sh * Update pkg/monitor/ptrace/ptrace.go Co-authored-by: kilo-code-bot[bot] <240665456+kilo-code-bot[bot]@users.noreply.github.com> * Update pkg/monitor/ptrace/ptrace.go Co-authored-by: kilo-code-bot[bot] <240665456+kilo-code-bot[bot]@users.noreply.github.com> * patch up kilo-code-bot fix * correct indentation * Revert HasSuccessfulAccess to use retVal==0 check instead of OKReturnStatus The kilo-code-bot suggested changing the HasSuccessfulAccess condition from `e.retVal == 0 || p.SyscallType() == OpenFileType` to `p.OKReturnStatus(e.retVal)`, arguing that the former incorrectly treated all OpenFileType syscalls as successful. However, the original expression was correct: - For OpenFileType (open, openat, readlink): the outer condition on line 276-278 already filters to successful calls via `p.OKReturnStatus(e.retVal)`, which requires fd >= 0. The `|| p.SyscallType() == OpenFileType` clause is redundant, not wrong -- it simply ensures HasSuccessfulAccess=true for events that already passed the success filter. - For CheckFileType (stat, access, lstat): `e.retVal == 0` correctly marks only files that actually exist on disk. The replacement `p.OKReturnStatus(e.retVal)` also accepts ENOENT (-2) and ENOTDIR (-20), causing non-existent "ghost" paths to be marked as HasSuccessfulAccess=true. This broke the namespace package fix (da8eb2c): when Python probes for __init__.cpython-311.so, __init__.so, __init__.py in a directory that has none of them (namespace package), those ENOENT stat results created ghost children with HasSuccessfulAccess=true. The FileActivity() radix tree walk then marked the parent directory as IsSubdir and excluded it from the slim image. Since the ghost children don't exist on disk, neither the directory nor its contents were preserved, causing libraries to go missing. * chore: moved to vLLM v0.17.1-cu130 for testing * lint: fixed indenting in ptrace.go with gofmt * add unit tests for bugs related to ghost paths * Revert OKReturnStatus to success-only and remove HasSuccessfulAccess OKReturnStatus for checkFileSyscallProcessor was broadened in 9aca3e9 to accept ENOENT (-2) and ENOTDIR (-20) for Python import tracking. However, ghost paths (non-existent files) that entered fsActivity via this gate were never used by the artifact pipeline — prepareArtifact() discards them at os.Lstat(). Meanwhile they caused three side effects: 1. Ghost leaf paths leaked into Report.FSActivity and triggered prepareArtifact() calls for non-existent files 2. Ghost paths were published as MDETypeArtifact events 3. The permissive OKReturnStatus semantics caused a naming trap that led to a regression (3f8dcb4) where HasSuccessfulAccess was set using OKReturnStatus, breaking namespace package directories This reverts OKReturnStatus to the upstream original (retVal == 0), removes the HasSuccessfulAccess field and its plumbing (no longer needed since only real paths enter fsActivity), and simplifies the FileActivity() radix tree walk. Tests updated to verify: ENOENT paths are rejected, no ghost paths in reports, namespace package directories are preserved, IsSubdir works correctly for real children. --------- Co-authored-by: kilo-code-bot[bot] <240665456+kilo-code-bot[bot]@users.noreply.github.com>
1 parent 5117670 commit 950f4b4

23 files changed

Lines changed: 2778 additions & 36 deletions

File tree

examples/nvidia_runtime/.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
host-config.json
2+
vllm_test_results.json
3+
vllm_test_results_slim.json
4+
original_log.txt
5+
slim_log.txt

examples/nvidia_runtime/Dockerfile

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
FROM nvcr.io/nvidia/pytorch:25.04-py3
2+
3+
# Add the tests to the entrypoint set. Docker Slim only traces/monitors the processes started by the entrypoint.
4+
RUN echo "pytest /opt/pytorch/pytorch/test/test_cuda.py::TestCuda::test_graph_cudnn_dropout" > /opt/nvidia/entrypoint.d/99-trace.sh
5+
RUN chmod +x /opt/nvidia/entrypoint.d/99-trace.sh
6+

examples/nvidia_runtime/README.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
2+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
# SPDX-License-Identifier: Apache-2.0
4+
#
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
As a pre-requisite, install nvidia-container toolkit, including adding the nvidia runtime. Then you should be able to translate runtime and capabilities from a OCI/Docker string like `--runtime=nvidia --gpus all` to `--cro-device-request '{"Count":-1, "Capabilities":[["gpu"]]}' --cro-runtime nvidia`
17+
18+
See the example `test_nvidia_smi.sh`, which slims ubuntu to just the files necessary to run the runtime mounted nvidia-smi. Similarly, see `test_nvidia_pytorch.sh` which minimizes nvidia-pytorch to run a subset of the CUDA tests.
19+
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
# Create host config file with ulimit settings and capabilities
17+
cat > host-config.json <<'EOF'
18+
{
19+
"IpcMode": "host",
20+
"CapAdd": ["SYS_ADMIN"],
21+
"Ulimits": [
22+
{
23+
"Name": "memlock",
24+
"Soft": -1,
25+
"Hard": -1
26+
},
27+
{
28+
"Name": "stack",
29+
"Soft": 67108864,
30+
"Hard": 67108864
31+
},
32+
{
33+
"Name": "nofile",
34+
"Soft": 1048576,
35+
"Hard": 1048576
36+
}
37+
]
38+
}
39+
EOF
40+
41+
# Build the slim image
42+
# CAP_SYS_ADMIN is added via host-config.json for fanotify support (required for filesystem monitoring)
43+
# Build custom image with test in entrypoint first
44+
echo "Building custom test image with pytest in entrypoint..."
45+
docker build -t nvcr.io/nvidia/pytorch:25.04-py3-test -f Dockerfile .
46+
47+
echo "Running mint on the test image..."
48+
mint build \
49+
--target nvcr.io/nvidia/pytorch:25.04-py3-test \
50+
--tag nvcr.io/nvidia/pytorch:25.04-py3-slim \
51+
--cro-host-config-file host-config.json \
52+
--cro-shm-size 1200 \
53+
--cro-device-request '{"Count":-1, "Capabilities":[["gpu"]]}' \
54+
--cro-runtime nvidia \
55+
--http-probe=false \
56+
--continue-after 10 \
57+
--preserve-path /etc/ld.so.conf \
58+
--preserve-path /etc/ld.so.conf.d \
59+
.
60+
61+
# Get output of original and slim images stored in a log file
62+
echo "Running original image..."
63+
docker run --rm --runtime nvidia --gpus all nvcr.io/nvidia/pytorch:25.04-py3-test > original_log.txt 2>&1
64+
echo "Running slim image..."
65+
docker run --rm --runtime nvidia --gpus all nvcr.io/nvidia/pytorch:25.04-py3-slim > slim_log.txt 2>&1
66+
67+
# Verify that both logs contain the pytest success message (ignoring timing)
68+
echo "Checking test results..."
69+
70+
# Look for "X passed" pattern in both logs
71+
original_passed=$(grep -oE "[0-9]+ passed" original_log.txt | head -1)
72+
slim_passed=$(grep -oE "[0-9]+ passed" slim_log.txt | head -1)
73+
74+
if [ -z "$original_passed" ]; then
75+
echo "Error: Original image test did not pass"
76+
echo "Original log tail:"
77+
tail -20 original_log.txt
78+
exit 1
79+
fi
80+
81+
if [ -z "$slim_passed" ]; then
82+
echo "Error: Slim image test did not pass"
83+
echo "Slim log tail:"
84+
tail -20 slim_log.txt
85+
exit 1
86+
fi
87+
88+
echo "Original image: $original_passed"
89+
echo "Slim image: $slim_passed"
90+
91+
if [ "$original_passed" = "$slim_passed" ]; then
92+
echo "SUCCESS: Both images passed the same number of tests!"
93+
else
94+
echo "Warning: Different number of tests passed (original: $original_passed, slim: $slim_passed)"
95+
fi
96+
97+
echo "Successfully minimized nvidia-pytorch to run a subset of the CUDA tests"
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
# Build the slim image
17+
mint build --target ubuntu:24.04 --tag ubuntu:24.04-slim --cro-shm-size 1200 --cro-device-request '{"Count":-1, "Capabilities":[["gpu"]]}' --cro-runtime nvidia --http-probe=false --exec "/usr/bin/nvidia-smi" .
18+
19+
# Get output of original and slim images stored in a log file
20+
docker run --rm --runtime nvidia --gpus all ubuntu:24.04 nvidia-smi > original_log.txt
21+
docker run --rm --runtime nvidia --gpus all ubuntu:24.04-slim nvidia-smi > slim_log.txt
22+
23+
# verify that both logs include the nvidia-smi output with an assert
24+
assert_contains() {
25+
if ! grep -q "$1" "$2"; then
26+
echo "Error: '$1' not found in $2"
27+
exit 1
28+
fi
29+
}
30+
31+
# verify that both logs include the nvidia-smi output with an assert
32+
assert_contains "NVIDIA-SMI" original_log.txt
33+
assert_contains "NVIDIA-SMI" slim_log.txt
34+

0 commit comments

Comments
 (0)