Skip to content

docs: Add K8s system telemetry tutorial for CPU/Memory/Network metrics#913

Open
dzier wants to merge 2 commits into
mainfrom
cursor/k8s-system-telemetry-docs-69db
Open

docs: Add K8s system telemetry tutorial for CPU/Memory/Network metrics#913
dzier wants to merge 2 commits into
mainfrom
cursor/k8s-system-telemetry-docs-69db

Conversation

@dzier
Copy link
Copy Markdown

@dzier dzier commented May 11, 2026

Summary

Adds comprehensive documentation for collecting system-level metrics (CPU, memory, network, disk I/O) from Kubernetes clusters during AIPerf benchmarks using Prometheus Node Exporter.

Changes

  • New tutorial: docs/tutorials/k8s-system-telemetry.md
    • Three deployment options: DaemonSet (cluster-wide), sidecar (per-pod), and port forwarding (development)
    • Key Node Exporter metrics reference (CPU, memory, network, disk, filesystem)
    • Complete YAML examples for Kubernetes deployments
    • Query examples using jq, pandas, and DuckDB
    • Troubleshooting section and best practices
  • Updated: docs/index.yml - Added new tutorial to "Metrics & Analysis" section
  • Updated: README.md - Added tutorial link to "Analysis and Monitoring" section

Motivation

Users requested the ability to collect CPU utilization, memory utilization, network metrics, and other server component metrics during benchmarks, similar to the existing --gpu-telemetry functionality.

Solution

AIPerf's Server Metrics Manager already supports collecting from any Prometheus-compatible endpoint via the --server-metrics flag. Prometheus Node Exporter exposes comprehensive system metrics in Prometheus format. This tutorial documents how to:

  1. Deploy Node Exporter in Kubernetes (DaemonSet, sidecar, or port forwarding)
  2. Configure AIPerf to collect metrics using --server-metrics
  3. Combine with GPU telemetry for complete observability
  4. Query and analyze system metrics alongside inference performance

Testing

  • ✅ YAML syntax validation passed
  • ✅ Documentation index validation passed (tools/check_docs_index.py)
  • ✅ File structure follows existing tutorial patterns
  • ✅ Mermaid diagrams render correctly
  • ✅ Links to related documentation verified

Example Usage

# Deploy Node Exporter as DaemonSet
kubectl apply -f node-exporter-daemonset.yaml

# Collect system metrics during benchmark
aiperf profile \
    --model Qwen/Qwen3-0.6B \
    --endpoint-type chat \
    --url http://inference-server:8000 \
    --concurrency 4 \
    --request-count 100 \
    --gpu-telemetry \
    --server-metrics http://node-exporter:9100

Related Documentation

Checklist

  • Documentation follows project style guide
  • Added to docs/index.yml and README.md
  • Includes practical examples with working YAML manifests
  • Covers multiple deployment scenarios
  • Links to related documentation
  • Signed-off commits (DCO)

Linear Issue: AIP-898

Open in Web Open in Cursor 

Summary by CodeRabbit

Release Notes

  • Documentation
    • Added comprehensive guide for collecting Kubernetes system-level metrics (CPU, memory, network, disk, filesystem) using Prometheus Node Exporter during benchmarks.
    • Covers three deployment approaches, metric interpretation, data analysis techniques, multi-node configuration, and best practices.

Review Change Stack

Add comprehensive tutorial showing how to collect system-level metrics
(CPU, memory, network, disk) from Kubernetes clusters during AIPerf
benchmarks using Prometheus Node Exporter.

- Document three deployment options: DaemonSet, sidecar, port forwarding
- Explain key Node Exporter metrics for benchmarking
- Show how to combine with GPU telemetry for complete observability
- Include query examples using jq, pandas, and DuckDB
- Add troubleshooting section and best practices

Closes AIP-898

Signed-off-by: Cursor Agent <cursoragent@cursor.com>

Co-authored-by: David Zier <dzier@users.noreply.github.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 11, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 11, 2026

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@408e446e1487ead4d70496838c6bf7e061165004

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@408e446e1487ead4d70496838c6bf7e061165004

Last updated for commit: 408e446Browse code

@github-actions github-actions Bot added the docs label May 11, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 11, 2026

Fern Docs Preview: generation failed — see the Actions log for details. This does not block merge; ask a maintainer to retry if needed.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 11, 2026

Walkthrough

Added a new documentation tutorial page for collecting Kubernetes system-level metrics (CPU, memory, network, disk, filesystem) using Prometheus Node Exporter during AIPerf benchmarks. Updated README and docs navigation to surface the tutorial. Documentation covers architecture, three deployment approaches, metrics reference, output formats, querying workflows, multi-node configuration, GPU telemetry integration, troubleshooting, and a complete observability stack example.

Changes

Kubernetes System Telemetry Tutorial

Layer / File(s) Summary
Documentation Discovery
README.md, docs/index.yml
Added K8s System Telemetry tutorial entry to README's "Tutorials and Feature Guides" list and navigation index under "Metrics & Analysis".
Tutorial Introduction
docs/tutorials/k8s-system-telemetry.md
Introduced tutorial scope covering CPU, memory, network, disk, and filesystem metrics collection from Kubernetes using Prometheus Node Exporter.
Architecture & AIPerf Integration
docs/tutorials/k8s-system-telemetry.md
Described how AIPerf scrapes Node Exporter /metrics endpoints to collect system metrics alongside benchmark metrics.
Node Exporter Deployment
docs/tutorials/k8s-system-telemetry.md
Documented three deployment approaches: cluster-wide DaemonSet with full manifest, per-pod sidecar with example inference-server pod, and port-forwarding for dev/testing, each with verification and example AIPerf runs.
System Metrics Reference
docs/tutorials/k8s-system-telemetry.md
Listed key Node Exporter metrics across CPU, memory, load, network I/O, disk I/O, and filesystem with example calculations for CPU utilization, memory usage, and network throughput.
AIPerf Export Formats
docs/tutorials/k8s-system-telemetry.md
Documented how Node Exporter metrics appear in AIPerf server metric exports with concrete JSON and CSV examples showing labeled series, stats, and column layout.
Multi-Node & GPU Integration
docs/tutorials/k8s-system-telemetry.md
Added guidance for multi-node benchmarking with metric tagging and endpoint aggregation; showed combining system telemetry with GPU telemetry (DCGM Exporter) in a single AIPerf run.
Querying & Analysis
docs/tutorials/k8s-system-telemetry.md
Provided example workflows for querying collected metrics using jq (JSON), pandas (CSV), and DuckDB (Parquet) with filtering and aggregation patterns.
Troubleshooting, Best Practices & Stack Example
docs/tutorials/k8s-system-telemetry.md
Included troubleshooting table addressing reachability and metric issues, best-practices checklist for production/dev/analysis, related documentation links, and a complete end-to-end observability stack example integrating Node Exporter, DCGM Exporter, and AIPerf.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Poem

🐰 Hop along with Node Exporter's gleam,
CPU, memory, network—a telemetry dream!
AIPerf now gathers what Kubernetes reveals,
From DaemonSet to sidecar, the observability squeals!
System metrics and GPU dance in the logs, 🎯

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the main change: adding a comprehensive Kubernetes system telemetry tutorial covering CPU/memory/network metrics collection for AIPerf benchmarks.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/tutorials/k8s-system-telemetry.md`:
- Around line 473-477: The CSV section assignments are reversed: currently
gauges = pd.read_csv(StringIO(sections[0])) and counters =
pd.read_csv(StringIO(sections[1])) but the file example has counters first and
gauges second; swap the assignments so counters =
pd.read_csv(StringIO(sections[0])) and gauges =
pd.read_csv(StringIO(sections[1])) to ensure cpu_stats and counter-derived data
are read from the correct section (refer to the variables sections, gauges,
counters in the snippet).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 14d286db-a406-4447-8d60-da99888129c5

📥 Commits

Reviewing files that changed from the base of the PR and between 79de74e and 408e446.

📒 Files selected for processing (3)
  • README.md
  • docs/index.yml
  • docs/tutorials/k8s-system-telemetry.md

Comment on lines +473 to +477
with open("server_metrics_export.csv") as f:
sections = f.read().strip().split('\n\n')
gauges = pd.read_csv(StringIO(sections[0]))
counters = pd.read_csv(StringIO(sections[1]))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Swap CSV section assignment in pandas example

Line 475 and Line 476 appear reversed relative to the CSV example above (counters first, gauges second). As written, cpu_stats may fail or return empty data because counter columns are read from the wrong section.

Proposed fix
 with open("server_metrics_export.csv") as f:
     sections = f.read().strip().split('\n\n')
-    gauges = pd.read_csv(StringIO(sections[0]))
-    counters = pd.read_csv(StringIO(sections[1]))
+    counters = pd.read_csv(StringIO(sections[0]))
+    gauges = pd.read_csv(StringIO(sections[1]))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
with open("server_metrics_export.csv") as f:
sections = f.read().strip().split('\n\n')
gauges = pd.read_csv(StringIO(sections[0]))
counters = pd.read_csv(StringIO(sections[1]))
with open("server_metrics_export.csv") as f:
sections = f.read().strip().split('\n\n')
counters = pd.read_csv(StringIO(sections[0]))
gauges = pd.read_csv(StringIO(sections[1]))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/tutorials/k8s-system-telemetry.md` around lines 473 - 477, The CSV
section assignments are reversed: currently gauges =
pd.read_csv(StringIO(sections[0])) and counters =
pd.read_csv(StringIO(sections[1])) but the file example has counters first and
gauges second; swap the assignments so counters =
pd.read_csv(StringIO(sections[0])) and gauges =
pd.read_csv(StringIO(sections[1])) to ensure cpu_stats and counter-derived data
are read from the correct section (refer to the variables sections, gauges,
counters in the snippet).

@codecov
Copy link
Copy Markdown

codecov Bot commented May 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

--server-metrics $NODE_EXPORTERS
```

### Option 2: Sidecar Container (Per-Pod Monitoring)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sidecar section presents Node Exporter as per-pod monitoring, but the shown host /proc and /sys mounts expose node-level metrics, so users will attribute whole-node CPU/memory/network to one pod. Fix: relabel this option as node-host telemetry for the pod's node or document a pod-level source such as cAdvisor/kubelet metrics.

Node Exporter metrics are included in the CSV export with labels expanded into columns:

```csv
endpoint_url,metric_name,metric_type,cpu,mode,total,rate,rate_avg,rate_min,rate_max,rate_std
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CSV sample uses lowercase endpoint_url/metric_name columns and omits the metadata comment block, but ServerMetricsCsvExporter writes sections with comments and Endpoint,Type,Metric,Unit,..., so the sample and pandas filters are not executable. Fix: update the CSV sample and pandas code to use the real exported headers and comment="#" when reading.

```

This will produce:
- `gpu_telemetry_export.json` - GPU metrics (power, utilization, memory)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The listed GPU telemetry artifact has the wrong extension because AIPerf writes gpu_telemetry_export.jsonl, not gpu_telemetry_export.json, so users following this path will look for a non-existent file. Fix: change the filename to gpu_telemetry_export.jsonl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants