Skip to content

Commit 9b449e0

Browse files
committed
test: strip GPU monitor sidecar from cluster e2e overlay
The GCE cluster e2e jobs now deploy NPD via test/e2e/manifests/deploy-npd.sh, which kustomize-builds the production deployment/ manifests and waits for the node-problem-detector DaemonSet to roll out. This PR's deployment/node-problem-detector.yaml wires in the example GPU monitor sidecar (gpu-monitor:latest image, NVIDIA host devices, a gRPC socket) and adds --config.external-monitor on the main container. None of that is available on the GPU-less GCE e2e nodes: the sidecar image cannot be pulled, so the DaemonSet never becomes Ready and deploy-npd.sh's rollout wait fails. Patch the e2e overlay to drop the GPU sidecar, its NVIDIA volumes, and the --config.external-monitor flag so e2e exercises only the standard system-log monitors. The production deployment manifest is unchanged. Signed-off-by: Davanum Srinivas <davanum@gmail.com>
1 parent 2ae6849 commit 9b449e0

1 file changed

Lines changed: 35 additions & 0 deletions

File tree

test/e2e/manifests/kustomization.yaml

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,3 +10,38 @@ images:
1010
- name: registry.k8s.io/node-problem-detector/node-problem-detector
1111
newName: gcr.io/k8s-staging-npd/node-problem-detector
1212
newTag: master
13+
14+
# The production manifest wires in the example GPU monitor sidecar
15+
# (examples/external-plugins/gpu-monitor): an image NPD does not build, NVIDIA
16+
# host devices, and a gRPC socket. None of that exists on the GPU-less GCE e2e
17+
# nodes, so the DaemonSet would never become Ready. Strip the GPU bits here so
18+
# e2e exercises just the standard system-log monitors; production is unchanged.
19+
patches:
20+
- target:
21+
group: apps
22+
version: v1
23+
kind: DaemonSet
24+
name: node-problem-detector
25+
patch: |-
26+
apiVersion: apps/v1
27+
kind: DaemonSet
28+
metadata:
29+
name: node-problem-detector
30+
spec:
31+
template:
32+
spec:
33+
containers:
34+
- name: node-problem-detector
35+
command:
36+
- /node-problem-detector
37+
- --logtostderr
38+
- --config.system-log-monitor=/config/kernel-monitor.json,/config/readonly-monitor.json,/config/docker-monitor.json
39+
- name: gpu-monitor
40+
$patch: delete
41+
volumes:
42+
- name: nvidia-dev
43+
$patch: delete
44+
- name: nvidia-uvm
45+
$patch: delete
46+
- name: nvidia-ml
47+
$patch: delete

0 commit comments

Comments
 (0)