Add GKE COS NVIDIA library mounts#3213
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3213 +/- ##
==========================================
+ Coverage 41.76% 41.79% +0.02%
==========================================
Files 336 336
Lines 28756 28769 +13
==========================================
+ Hits 12011 12024 +13
Misses 15941 15941
Partials 804 804
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report in Codecov by Harness.
🚀 New features to boost your workflow:
|
|
🎯 Code Coverage (details) 🔗 Commit SHA: f7d6f98 | Docs | Datadog PR Page | Give us feedback! |
On GKE Container-Optimized OS nodes the NVIDIA driver libraries live under /home/kubernetes/bin/nvidia/lib64 rather than the path the nvidia-container-runtime expects. Mount them into the agent (and system-probe when privileged mode is enabled) via the ProviderAwareFeature capabilities API so the mount is applied only on the gke-cos provider.
c94495b to
f7d6f98
Compare
mbertrone
left a comment
There was a problem hiding this comment.
LGTM! One non-blocking nit inline.
| containers := []apicommon.AgentContainerName{apicommon.CoreAgentContainerName} | ||
| if f.isPrivilegedModeEnabled { | ||
| containers = append(containers, apicommon.SystemProbeContainerName) | ||
| } |
There was a problem hiding this comment.
nit (non-blocking): this container-list build duplicates the identical block in Configure (feature.go:83-86), where f.isPrivilegedModeEnabled gates appending SystemProbe the same way. No issue today, but the two can silently drift if the privileged-mode container set ever changes. A small func (f *gpuFeature) targetContainers() []apicommon.AgentContainerName shared by both would keep them in lockstep; the current form reads fine if you'd rather leave it.
|
/merge |
|
View all feedbacks in Devflow UI.
The expected merge time in
|
626f773
into
main
What does this PR do?
Mounts GKE COS NVIDIA driver libraries into the Agent containers when GPU monitoring runs on GKE COS nodes.
Motivation
GPU monitoring needs these libraries on GKE COS, where NVIDIA driver libraries live under
/home/kubernetes/bin/nvidia/lib64.Additional Notes
Mirrors DataDog/helm-charts#2764 for the Operator.
The mount is applied for the
gke-cosprovider. The operator auto-detects this from the node labelcloud.google.com/gke-os-distribution=cos, but it can also be forced with the annotation on the DatadogAgent:Minimum Agent Versions
Describe your test plan
go test ./internal/controller/datadogagent/feature/gpuValidated end-to-end on a GKE COS test cluster with a Tesla T4 node:
datadoghq.com/provider: gke-cosannotation/home/kubernetes/bin/nvidia/lib64at/host/run/nvidia/driver/usr/lib/x86_64-linux-gnuon the core-agent and system-probe containersfeatures.gpu.privilegedMode: true,features.gpu.patchCgroupPermissions: trueandoverride.nodeAgent.hostPID: true(which should be there by default with GPU monitoring enabled, will be addressed in a separate PR), the core-agentgpucheck reports[OK]and discoversGPU 0: Tesla T4Checklist
bug,enhancement,refactoring,documentation,tooling, and/ordependenciesqa/skip-qalabel