Skip to content

Commit 8fe4112

Browse files
committed
gpumetrics: collect NVIDIA GPU metrics via NVML with pod enrichment
Add an NVML-based GPU metrics producer (a metricexport.Producer) and wire it into the agent behind --gpu-metrics-enable, exporting over the existing remote-store connection used for profiles. Collected per device: GPU/memory utilization (sampled, with a current-state fallback), power and the static power limit, graphics/SM/ memory/video clocks, temperature, and PCIe throughput; plus per-process GPU/memory utilization. Ported from Polar Signals' standalone gpu-metrics-agent NVML producer. Per-process metrics are enriched with the same container/pod labels the agent attaches to profiles (namespace, pod, container, ...) via the container metadata provider, so GPU usage and profiles join on identical identity and can be grouped by pod/namespace. This is the per-process attribution dcgm-exporter cannot do. NVML reports host PIDs, so this assumes the agent runs in the host PID namespace. go-nvml requires cgo and a dynamically linked binary; the producer lives behind the (default) absence of the "nonvml" tag, with a no-op stub (nvidia_stub.go) compiled into the static build. NVML init failure disables the producer rather than failing the agent.
1 parent 378fed7 commit 8fe4112

8 files changed

Lines changed: 948 additions & 0 deletions

File tree

flags/flags.go

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -163,9 +163,20 @@ type Flags struct {
163163

164164
MergeGpuProfiles bool `default:"false" help:"Report GPU kernel timing and GPU PC sampling under a single gpu_time/nanoseconds sample_type, differentiated by a gpu_view label (pc_sample|kernel_time). When false (the default), they are reported as separate sample_types (gpu_kernel_time/nanoseconds and gpu_pcsample/count) with no per-sample labels."`
165165

166+
GpuMetrics FlagsGpuMetrics `embed:"" prefix:"gpu-metrics-"`
167+
166168
OTLPLogging bool `default:"false" help:"Forward parca-agent's own logrus output to the remote-store as OTLP log records (in addition to local stderr). Requires a remote-store; ignored in offline mode."`
167169
}
168170

171+
// FlagsGpuMetrics configures NVML-based GPU metrics collection and OTLP egress
172+
// (utilization, power, temperature, clocks, PCIe throughput, and per-process
173+
// GPU utilization). Requires a remote-store and an NVIDIA driver; if NVML can't
174+
// be initialized the producer is disabled and the agent continues normally.
175+
type FlagsGpuMetrics struct {
176+
Enable bool `default:"false" help:"Enable NVML-based GPU metrics collection and OTLP export to the remote-store."`
177+
ExportInterval time.Duration `default:"10s" help:"How frequently collected GPU metrics are batched and exported over OTLP."`
178+
}
179+
169180
type ExitCode int
170181

171182
const (

go.mod

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ require (
77
buf.build/gen/go/parca-dev/parca/protocolbuffers/go v1.36.11-20260225102827-5fda07223114.1
88
buf.build/gen/go/prometheus/prometheus/protocolbuffers/go v1.36.6-20250320161912-af2aab87b1b3.1
99
github.com/KimMachineGun/automemlimit v0.7.3
10+
github.com/NVIDIA/go-nvml v0.12.4-1
1011
github.com/alecthomas/kong v1.12.1
1112
github.com/alecthomas/kong-yaml v0.2.0
1213
github.com/apache/arrow-go/v18 v18.5.2

go.sum

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,8 @@ github.com/Microsoft/go-winio v0.6.2 h1:F2VQgta7ecxGYO8k3ZZz3RS8fVIXVxONVUPlNERo
2424
github.com/Microsoft/go-winio v0.6.2/go.mod h1:yd8OoFMLzJbo9gZq8j5qaps8bJ9aShtEA8Ipt1oGCvU=
2525
github.com/Microsoft/hcsshim v0.12.9 h1:2zJy5KA+l0loz1HzEGqyNnjd3fyZA31ZBCGKacp6lLg=
2626
github.com/Microsoft/hcsshim v0.12.9/go.mod h1:fJ0gkFAna6ukt0bLdKB8djt4XIJhF/vEPuoIWYVvZ8Y=
27+
github.com/NVIDIA/go-nvml v0.12.4-1 h1:WKUvqshhWSNTfm47ETRhv0A0zJyr1ncCuHiXwoTrBEc=
28+
github.com/NVIDIA/go-nvml v0.12.4-1/go.mod h1:8Llmj+1Rr+9VGGwZuRer5N/aCjxGuR5nPb/9ebBiIEQ=
2729
github.com/alecthomas/assert/v2 v2.11.0 h1:2Q9r3ki8+JYXvGsDyBXwH3LcJ+WK5D0gc5E8vS6K3D0=
2830
github.com/alecthomas/assert/v2 v2.11.0/go.mod h1:Bze95FyfUr7x34QZrjL+XP+0qgp/zg8yS+TtBj1WA3k=
2931
github.com/alecthomas/kong v1.12.1 h1:iq6aMJDcFYP9uFrLdsiZQ2ZMmcshduyGv4Pek0MQPW0=

gpumetrics/enrich.go

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
// Copyright 2026 The Parca Authors
2+
// Licensed under the Apache License, Version 2.0 (the "License");
3+
// you may not use this file except in compliance with the License.
4+
// You may obtain a copy of the License at
5+
//
6+
// http://www.apache.org/licenses/LICENSE-2.0
7+
//
8+
// Unless required by applicable law or agreed to in writing, software
9+
// distributed under the License is distributed on an "AS IS" BASIS,
10+
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
11+
// See the License for the specific language governing permissions and
12+
// limitations under the License.
13+
14+
package gpumetrics
15+
16+
import (
17+
"context"
18+
19+
"github.com/prometheus/prometheus/model/labels"
20+
"go.opentelemetry.io/ebpf-profiler/libpf"
21+
22+
"github.com/parca-dev/parca-agent/reporter/metadata"
23+
)
24+
25+
// defaultEnrichmentLabels is the curated set of container/pod labels attached to
26+
// per-process GPU metrics. It is intentionally small: each label combination is
27+
// a distinct time series, so we keep only stable, low-churn identifiers useful
28+
// for grouping (namespace, pod, container) and deliberately exclude high-churn
29+
// or verbose labels (pod_container_image, pod_ip, ...) that would inflate
30+
// cardinality. These match labels parca-agent already attaches to profiles, so
31+
// GPU metrics and profiles join on the same pod/container identity.
32+
var defaultEnrichmentLabels = map[string]struct{}{
33+
"namespace": {},
34+
"pod": {},
35+
"pod_container_name": {},
36+
"pod_container_id": {},
37+
"pod_uid": {},
38+
"pod_controller_kind": {},
39+
"pod_controller_name": {},
40+
}
41+
42+
// ContainerLabelResolver enriches per-process GPU metrics with Kubernetes
43+
// container/pod labels, using parca-agent's container metadata provider. It
44+
// implements LabelResolver.
45+
type ContainerLabelResolver struct {
46+
ctx context.Context
47+
provider metadata.MetadataProvider
48+
allow map[string]struct{}
49+
}
50+
51+
// NewContainerLabelResolver builds a resolver backed by the container metadata
52+
// provider for the given Kubernetes node. The provider maintains its own caches,
53+
// so per-PID lookups on the hot path are cheap after the first resolution.
54+
func NewContainerLabelResolver(ctx context.Context, nodeName string) (*ContainerLabelResolver, error) {
55+
provider, err := metadata.NewContainerMetadataProvider(ctx, nodeName)
56+
if err != nil {
57+
return nil, err
58+
}
59+
return &ContainerLabelResolver{
60+
ctx: ctx,
61+
provider: provider,
62+
allow: defaultEnrichmentLabels,
63+
}, nil
64+
}
65+
66+
// LabelsForPID returns the curated container/pod labels for a host PID. PIDs
67+
// that don't belong to a container (or that can't be resolved) yield an empty
68+
// map, leaving the data point with only its pid/comm attributes.
69+
func (r *ContainerLabelResolver) LabelsForPID(pid uint32) map[string]string {
70+
lb := labels.NewBuilder(labels.EmptyLabels())
71+
r.provider.AddMetadata(r.ctx, libpf.PID(pid), lb)
72+
73+
out := make(map[string]string, len(r.allow))
74+
lb.Range(func(l labels.Label) {
75+
if _, ok := r.allow[l.Name]; ok && l.Value != "" {
76+
out[l.Name] = l.Value
77+
}
78+
})
79+
return out
80+
}

gpumetrics/gpumetrics.go

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
// Copyright 2026 The Parca Authors
2+
// Licensed under the Apache License, Version 2.0 (the "License");
3+
// you may not use this file except in compliance with the License.
4+
// You may obtain a copy of the License at
5+
//
6+
// http://www.apache.org/licenses/LICENSE-2.0
7+
//
8+
// Unless required by applicable law or agreed to in writing, software
9+
// distributed under the License is distributed on an "AS IS" BASIS,
10+
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
11+
// See the License for the specific language governing permissions and
12+
// limitations under the License.
13+
14+
// Package gpumetrics collects NVIDIA GPU metrics via NVML and renders them as
15+
// OTLP metrics for the metricexport egress path. It is a port of Polar Signals'
16+
// standalone gpu-metrics-agent NVML producer, adapted to run inside parca-agent
17+
// as a metricexport.Producer.
18+
//
19+
// The NVML producer (nvidia.go) requires cgo and a dynamically linked binary
20+
// so go-nvml can dlopen libnvidia-ml at runtime; it is excluded by the "nonvml"
21+
// build tag, under which nvidia_stub.go provides a no-op stand-in for the
22+
// fully-static build. This file holds the declarations shared by both builds.
23+
package gpumetrics
24+
25+
// ScopeName is the OTLP instrumentation scope GPU metrics are reported under.
26+
const ScopeName = "parca.nvidia_gpu_metrics"
27+
28+
// LabelResolver resolves additional attributes (e.g. Kubernetes namespace,
29+
// pod, container) for a process by its host PID. The returned labels are
30+
// attached to per-process GPU metric data points so they share identity with
31+
// parca-agent's profiles. NVML reports host PIDs, so the resolver must also
32+
// operate in the host PID namespace (the standard parca-agent deployment).
33+
type LabelResolver interface {
34+
LabelsForPID(pid uint32) map[string]string
35+
}

0 commit comments

Comments
 (0)