docs: Add K8s system telemetry tutorial for CPU/Memory/Network metrics by dzier · Pull Request #913 · ai-dynamo/aiperf

dzier · 2026-05-11T20:56:52Z

Summary

Adds comprehensive documentation for collecting system-level metrics (CPU, memory, network, disk I/O) from Kubernetes clusters during AIPerf benchmarks using Prometheus Node Exporter.

Changes

New tutorial: docs/tutorials/k8s-system-telemetry.md
- Three deployment options: DaemonSet (cluster-wide), sidecar (per-pod), and port forwarding (development)
- Key Node Exporter metrics reference (CPU, memory, network, disk, filesystem)
- Complete YAML examples for Kubernetes deployments
- Query examples using jq, pandas, and DuckDB
- Troubleshooting section and best practices
Updated: docs/index.yml - Added new tutorial to "Metrics & Analysis" section
Updated: README.md - Added tutorial link to "Analysis and Monitoring" section

Motivation

Users requested the ability to collect CPU utilization, memory utilization, network metrics, and other server component metrics during benchmarks, similar to the existing --gpu-telemetry functionality.

Solution

AIPerf's Server Metrics Manager already supports collecting from any Prometheus-compatible endpoint via the --server-metrics flag. Prometheus Node Exporter exposes comprehensive system metrics in Prometheus format. This tutorial documents how to:

Deploy Node Exporter in Kubernetes (DaemonSet, sidecar, or port forwarding)
Configure AIPerf to collect metrics using --server-metrics
Combine with GPU telemetry for complete observability
Query and analyze system metrics alongside inference performance

Testing

✅ YAML syntax validation passed
✅ Documentation index validation passed (tools/check_docs_index.py)
✅ File structure follows existing tutorial patterns
✅ Mermaid diagrams render correctly
✅ Links to related documentation verified

Example Usage

# Deploy Node Exporter as DaemonSet
kubectl apply -f node-exporter-daemonset.yaml

# Collect system metrics during benchmark
aiperf profile \
    --model Qwen/Qwen3-0.6B \
    --endpoint-type chat \
    --url http://inference-server:8000 \
    --concurrency 4 \
    --request-count 100 \
    --gpu-telemetry \
    --server-metrics http://node-exporter:9100

Checklist

Documentation follows project style guide
Added to docs/index.yml and README.md
Includes practical examples with working YAML manifests
Covers multiple deployment scenarios
Links to related documentation
Signed-off commits (DCO)

Linear Issue: AIP-898

Summary by CodeRabbit

Release Notes

Documentation
- Added comprehensive guide for collecting Kubernetes system-level metrics (CPU, memory, network, disk, filesystem) using Prometheus Node Exporter during benchmarks.
- Covers three deployment approaches, metric interpretation, data analysis techniques, multi-node configuration, and best practices.

Add comprehensive tutorial showing how to collect system-level metrics (CPU, memory, network, disk) from Kubernetes clusters during AIPerf benchmarks using Prometheus Node Exporter. - Document three deployment options: DaemonSet, sidecar, port forwarding - Explain key Node Exporter metrics for benchmarking - Show how to combine with GPU telemetry for complete observability - Include query examples using jq, pandas, and DuckDB - Add troubleshooting section and best practices Closes AIP-898 Signed-off-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: David Zier <dzier@users.noreply.github.com>

copy-pr-bot · 2026-05-11T20:56:56Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

for more information, see https://pre-commit.ci

github-actions · 2026-05-11T20:57:11Z

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@408e446e1487ead4d70496838c6bf7e061165004

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@408e446e1487ead4d70496838c6bf7e061165004

Last updated for commit: 408e446 • Browse code

github-actions · 2026-05-11T20:57:35Z

Fern Docs Preview: generation failed — see the Actions log for details. This does not block merge; ask a maintainer to retry if needed.

coderabbitai · 2026-05-11T20:58:09Z

Walkthrough

Added a new documentation tutorial page for collecting Kubernetes system-level metrics (CPU, memory, network, disk, filesystem) using Prometheus Node Exporter during AIPerf benchmarks. Updated README and docs navigation to surface the tutorial. Documentation covers architecture, three deployment approaches, metrics reference, output formats, querying workflows, multi-node configuration, GPU telemetry integration, troubleshooting, and a complete observability stack example.

Changes

Kubernetes System Telemetry Tutorial

Layer / File(s)	Summary
Documentation Discovery `README.md`, `docs/index.yml`	Added K8s System Telemetry tutorial entry to README's "Tutorials and Feature Guides" list and navigation index under "Metrics & Analysis".
Tutorial Introduction `docs/tutorials/k8s-system-telemetry.md`	Introduced tutorial scope covering CPU, memory, network, disk, and filesystem metrics collection from Kubernetes using Prometheus Node Exporter.
Architecture & AIPerf Integration `docs/tutorials/k8s-system-telemetry.md`	Described how AIPerf scrapes Node Exporter `/metrics` endpoints to collect system metrics alongside benchmark metrics.
Node Exporter Deployment `docs/tutorials/k8s-system-telemetry.md`	Documented three deployment approaches: cluster-wide DaemonSet with full manifest, per-pod sidecar with example inference-server pod, and port-forwarding for dev/testing, each with verification and example AIPerf runs.
System Metrics Reference `docs/tutorials/k8s-system-telemetry.md`	Listed key Node Exporter metrics across CPU, memory, load, network I/O, disk I/O, and filesystem with example calculations for CPU utilization, memory usage, and network throughput.
AIPerf Export Formats `docs/tutorials/k8s-system-telemetry.md`	Documented how Node Exporter metrics appear in AIPerf server metric exports with concrete JSON and CSV examples showing labeled series, stats, and column layout.
Multi-Node & GPU Integration `docs/tutorials/k8s-system-telemetry.md`	Added guidance for multi-node benchmarking with metric tagging and endpoint aggregation; showed combining system telemetry with GPU telemetry (DCGM Exporter) in a single AIPerf run.
Querying & Analysis `docs/tutorials/k8s-system-telemetry.md`	Provided example workflows for querying collected metrics using `jq` (JSON), pandas (CSV), and DuckDB (Parquet) with filtering and aggregation patterns.
Troubleshooting, Best Practices & Stack Example `docs/tutorials/k8s-system-telemetry.md`	Included troubleshooting table addressing reachability and metric issues, best-practices checklist for production/dev/analysis, related documentation links, and a complete end-to-end observability stack example integrating Node Exporter, DCGM Exporter, and AIPerf.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Poem

🐰 Hop along with Node Exporter's gleam,
CPU, memory, network—a telemetry dream!
AIPerf now gathers what Kubernetes reveals,
From DaemonSet to sidecar, the observability squeals!
System metrics and GPU dance in the logs, 🎯

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately summarizes the main change: adding a comprehensive Kubernetes system telemetry tutorial covering CPU/memory/network metrics collection for AIPerf benchmarks.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/tutorials/k8s-system-telemetry.md`:
- Around line 473-477: The CSV section assignments are reversed: currently
gauges = pd.read_csv(StringIO(sections[0])) and counters =
pd.read_csv(StringIO(sections[1])) but the file example has counters first and
gauges second; swap the assignments so counters =
pd.read_csv(StringIO(sections[0])) and gauges =
pd.read_csv(StringIO(sections[1])) to ensure cpu_stats and counter-derived data
are read from the correct section (refer to the variables sections, gauges,
counters in the snippet).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 14d286db-a406-4447-8d60-da99888129c5

📥 Commits

Reviewing files that changed from the base of the PR and between 79de74e and 408e446.

📒 Files selected for processing (3)

README.md
docs/index.yml
docs/tutorials/k8s-system-telemetry.md

coderabbitai · 2026-05-11T20:59:49Z

+with open("server_metrics_export.csv") as f:
+    sections = f.read().strip().split('\n\n')
+    gauges = pd.read_csv(StringIO(sections[0]))
+    counters = pd.read_csv(StringIO(sections[1]))
+


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Swap CSV section assignment in pandas example

Line 475 and Line 476 appear reversed relative to the CSV example above (counters first, gauges second). As written, cpu_stats may fail or return empty data because counter columns are read from the wrong section.

Proposed fix

with open("server_metrics_export.csv") as f: sections = f.read().strip().split('\n\n') - gauges = pd.read_csv(StringIO(sections[0])) - counters = pd.read_csv(StringIO(sections[1])) + counters = pd.read_csv(StringIO(sections[0])) + gauges = pd.read_csv(StringIO(sections[1]))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

with open("server_metrics_export.csv") as f:

sections = f.read().strip().split('\n\n')

gauges = pd.read_csv(StringIO(sections[0]))

counters = pd.read_csv(StringIO(sections[1]))

with open("server_metrics_export.csv") as f:

sections = f.read().strip().split('\n\n')

counters = pd.read_csv(StringIO(sections[0]))

gauges = pd.read_csv(StringIO(sections[1]))

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/tutorials/k8s-system-telemetry.md` around lines 473 - 477, The CSV section assignments are reversed: currently gauges = pd.read_csv(StringIO(sections[0])) and counters = pd.read_csv(StringIO(sections[1])) but the file example has counters first and gauges second; swap the assignments so counters = pd.read_csv(StringIO(sections[0])) and gauges = pd.read_csv(StringIO(sections[1])) to ensure cpu_stats and counter-derived data are read from the correct section (refer to the variables sections, gauges, counters in the snippet).

codecov · 2026-05-11T21:07:50Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

dynamo-ops · 2026-05-13T02:40:26Z

+    --server-metrics $NODE_EXPORTERS
+```
+
+### Option 2: Sidecar Container (Per-Pod Monitoring)


This sidecar section presents Node Exporter as per-pod monitoring, but the shown host /proc and /sys mounts expose node-level metrics, so users will attribute whole-node CPU/memory/network to one pod. Fix: relabel this option as node-host telemetry for the pod's node or document a pod-level source such as cAdvisor/kubelet metrics.

dynamo-ops · 2026-05-13T02:40:26Z

+Node Exporter metrics are included in the CSV export with labels expanded into columns:
+
+```csv
+endpoint_url,metric_name,metric_type,cpu,mode,total,rate,rate_avg,rate_min,rate_max,rate_std


The CSV sample uses lowercase endpoint_url/metric_name columns and omits the metadata comment block, but ServerMetricsCsvExporter writes sections with comments and Endpoint,Type,Metric,Unit,..., so the sample and pandas filters are not executable. Fix: update the CSV sample and pandas code to use the real exported headers and comment="#" when reading.

dynamo-ops · 2026-05-13T02:40:26Z

+```
+
+This will produce:
+- `gpu_telemetry_export.json` - GPU metrics (power, utilization, memory)


The listed GPU telemetry artifact has the wrong extension because AIPerf writes gpu_telemetry_export.jsonl, not gpu_telemetry_export.json, so users following this path will look for a non-existent file. Fix: change the filename to gpu_telemetry_export.jsonl.

[pre-commit.ci] auto fixes from pre-commit.com hooks

408e446

for more information, see https://pre-commit.ci

github-actions Bot added the docs label May 11, 2026

coderabbitai Bot reviewed May 11, 2026

View reviewed changes

dynamo-ops reviewed May 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Add K8s system telemetry tutorial for CPU/Memory/Network metrics#913

docs: Add K8s system telemetry tutorial for CPU/Memory/Network metrics#913
dzier wants to merge 2 commits into
mainfrom
cursor/k8s-system-telemetry-docs-69db

dzier commented May 11, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 11, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 11, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 11, 2026

Uh oh!

codecov Bot commented May 11, 2026

Uh oh!

dynamo-ops May 13, 2026

Uh oh!

dynamo-ops May 13, 2026

Uh oh!

dynamo-ops May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dzier commented May 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Motivation

Solution

Testing

Example Usage

Related Documentation

Checklist

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot Bot commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Try out this PR

Uh oh!

github-actions Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 11, 2026

Codecov Report

Uh oh!

dynamo-ops May 13, 2026

Choose a reason for hiding this comment

Uh oh!

dynamo-ops May 13, 2026

Choose a reason for hiding this comment

Uh oh!

dynamo-ops May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dzier commented May 11, 2026 •

edited by coderabbitai Bot

Loading

github-actions Bot commented May 11, 2026 •

edited

Loading

github-actions Bot commented May 11, 2026 •

edited

Loading

coderabbitai Bot commented May 11, 2026 •

edited

Loading