Skip to content

Add GKE COS NVIDIA library mounts#3213

Merged
gh-worker-dd-mergequeue-cf854d[bot] merged 1 commit into
mainfrom
guillermo.julian/add-gke-cos-gpu-library-mounts
Jul 3, 2026
Merged

Add GKE COS NVIDIA library mounts#3213
gh-worker-dd-mergequeue-cf854d[bot] merged 1 commit into
mainfrom
guillermo.julian/add-gke-cos-gpu-library-mounts

Conversation

@gjulianm

@gjulianm gjulianm commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Mounts GKE COS NVIDIA driver libraries into the Agent containers when GPU monitoring runs on GKE COS nodes.

Motivation

GPU monitoring needs these libraries on GKE COS, where NVIDIA driver libraries live under /home/kubernetes/bin/nvidia/lib64.

Additional Notes

Mirrors DataDog/helm-charts#2764 for the Operator.

The mount is applied for the gke-cos provider. The operator auto-detects this from the node label cloud.google.com/gke-os-distribution=cos, but it can also be forced with the annotation on the DatadogAgent:

metadata:
  annotations:
    datadoghq.com/provider: gke-cos

Minimum Agent Versions

  • Agent: N/A
  • Cluster Agent: N/A

Describe your test plan

go test ./internal/controller/datadogagent/feature/gpu

Validated end-to-end on a GKE COS test cluster with a Tesla T4 node:

  • provider forced via the datadoghq.com/provider: gke-cos annotation
  • generated DaemonSet mounts /home/kubernetes/bin/nvidia/lib64 at /host/run/nvidia/driver/usr/lib/x86_64-linux-gnu on the core-agent and system-probe containers
  • system-probe finds the NVML library at the mounted path
  • with features.gpu.privilegedMode: true, features.gpu.patchCgroupPermissions: true and override.nodeAgent.hostPID: true (which should be there by default with GPU monitoring enabled, will be addressed in a separate PR), the core-agent gpu check reports [OK] and discovers GPU 0: Tesla T4

Checklist

  • PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
  • PR has a milestone or the qa/skip-qa label
  • All commits are signed (see: signing commits)

@gjulianm gjulianm self-assigned this Jul 2, 2026
@codecov-commenter

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 41.79%. Comparing base (da5cd49) to head (c94495b).
⚠️ Report is 103 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3213      +/-   ##
==========================================
+ Coverage   41.76%   41.79%   +0.02%     
==========================================
  Files         336      336              
  Lines       28756    28769      +13     
==========================================
+ Hits        12011    12024      +13     
  Misses      15941    15941              
  Partials      804      804              
Flag Coverage Δ
unittests 41.79% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...nal/controller/datadogagent/feature/gpu/feature.go 88.14% <100.00%> (+1.26%) ⬆️

Continue to review full report in Codecov by Harness.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update da5cd49...c94495b. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@datadog-datadog-us1-prod

datadog-datadog-us1-prod Bot commented Jul 2, 2026

Copy link
Copy Markdown

Code Coverage

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 45.23% (+0.03%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: f7d6f98 | Docs | Datadog PR Page | Give us feedback!

On GKE Container-Optimized OS nodes the NVIDIA driver libraries live under
/home/kubernetes/bin/nvidia/lib64 rather than the path the
nvidia-container-runtime expects. Mount them into the agent (and system-probe
when privileged mode is enabled) via the ProviderAwareFeature capabilities API
so the mount is applied only on the gke-cos provider.
@gjulianm gjulianm force-pushed the guillermo.julian/add-gke-cos-gpu-library-mounts branch from c94495b to f7d6f98 Compare July 3, 2026 08:25
@gjulianm gjulianm marked this pull request as ready for review July 3, 2026 10:31
@gjulianm gjulianm requested a review from a team July 3, 2026 10:31
@gjulianm gjulianm requested a review from a team as a code owner July 3, 2026 10:31

@mbertrone mbertrone left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! One non-blocking nit inline.

Comment on lines +53 to +56
containers := []apicommon.AgentContainerName{apicommon.CoreAgentContainerName}
if f.isPrivilegedModeEnabled {
containers = append(containers, apicommon.SystemProbeContainerName)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (non-blocking): this container-list build duplicates the identical block in Configure (feature.go:83-86), where f.isPrivilegedModeEnabled gates appending SystemProbe the same way. No issue today, but the two can silently drift if the privileged-mode container set ever changes. A small func (f *gpuFeature) targetContainers() []apicommon.AgentContainerName shared by both would keep them in lockstep; the current form reads fine if you'd rather leave it.

@tbavelier tbavelier added enhancement New feature or request and removed qa/skip-qa labels Jul 3, 2026
@tbavelier tbavelier added this to the v1.29.0 milestone Jul 3, 2026

gjulianm commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

/merge

@gh-worker-devflow-routing-ef8351

gh-worker-devflow-routing-ef8351 Bot commented Jul 3, 2026

Copy link
Copy Markdown

View all feedbacks in Devflow UI.

2026-07-03 11:47:29 UTC ℹ️ Start processing command /merge


2026-07-03 11:47:34 UTC ℹ️ MergeQueue: pull request added to the queue

The expected merge time in main is approximately 2h (p90).


2026-07-03 12:56:23 UTC ℹ️ MergeQueue: This merge request was merged

@gh-worker-dd-mergequeue-cf854d gh-worker-dd-mergequeue-cf854d Bot merged commit 626f773 into main Jul 3, 2026
67 of 71 checks passed
@gh-worker-dd-mergequeue-cf854d gh-worker-dd-mergequeue-cf854d Bot deleted the guillermo.julian/add-gke-cos-gpu-library-mounts branch July 3, 2026 12:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants