docs: Add K8s system telemetry tutorial for CPU/Memory/Network metrics#913
docs: Add K8s system telemetry tutorial for CPU/Memory/Network metrics#913dzier wants to merge 2 commits into
Conversation
Add comprehensive tutorial showing how to collect system-level metrics (CPU, memory, network, disk) from Kubernetes clusters during AIPerf benchmarks using Prometheus Node Exporter. - Document three deployment options: DaemonSet, sidecar, port forwarding - Explain key Node Exporter metrics for benchmarking - Show how to combine with GPU telemetry for complete observability - Include query examples using jq, pandas, and DuckDB - Add troubleshooting section and best practices Closes AIP-898 Signed-off-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: David Zier <dzier@users.noreply.github.com>
for more information, see https://pre-commit.ci
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@408e446e1487ead4d70496838c6bf7e061165004Recommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@408e446e1487ead4d70496838c6bf7e061165004Last updated for commit: |
|
Fern Docs Preview: generation failed — see the Actions log for details. This does not block merge; ask a maintainer to retry if needed. |
WalkthroughAdded a new documentation tutorial page for collecting Kubernetes system-level metrics (CPU, memory, network, disk, filesystem) using Prometheus Node Exporter during AIPerf benchmarks. Updated README and docs navigation to surface the tutorial. Documentation covers architecture, three deployment approaches, metrics reference, output formats, querying workflows, multi-node configuration, GPU telemetry integration, troubleshooting, and a complete observability stack example. ChangesKubernetes System Telemetry Tutorial
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~5 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/tutorials/k8s-system-telemetry.md`:
- Around line 473-477: The CSV section assignments are reversed: currently
gauges = pd.read_csv(StringIO(sections[0])) and counters =
pd.read_csv(StringIO(sections[1])) but the file example has counters first and
gauges second; swap the assignments so counters =
pd.read_csv(StringIO(sections[0])) and gauges =
pd.read_csv(StringIO(sections[1])) to ensure cpu_stats and counter-derived data
are read from the correct section (refer to the variables sections, gauges,
counters in the snippet).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 14d286db-a406-4447-8d60-da99888129c5
📒 Files selected for processing (3)
README.mddocs/index.ymldocs/tutorials/k8s-system-telemetry.md
| with open("server_metrics_export.csv") as f: | ||
| sections = f.read().strip().split('\n\n') | ||
| gauges = pd.read_csv(StringIO(sections[0])) | ||
| counters = pd.read_csv(StringIO(sections[1])) | ||
|
|
There was a problem hiding this comment.
Swap CSV section assignment in pandas example
Line 475 and Line 476 appear reversed relative to the CSV example above (counters first, gauges second). As written, cpu_stats may fail or return empty data because counter columns are read from the wrong section.
Proposed fix
with open("server_metrics_export.csv") as f:
sections = f.read().strip().split('\n\n')
- gauges = pd.read_csv(StringIO(sections[0]))
- counters = pd.read_csv(StringIO(sections[1]))
+ counters = pd.read_csv(StringIO(sections[0]))
+ gauges = pd.read_csv(StringIO(sections[1]))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| with open("server_metrics_export.csv") as f: | |
| sections = f.read().strip().split('\n\n') | |
| gauges = pd.read_csv(StringIO(sections[0])) | |
| counters = pd.read_csv(StringIO(sections[1])) | |
| with open("server_metrics_export.csv") as f: | |
| sections = f.read().strip().split('\n\n') | |
| counters = pd.read_csv(StringIO(sections[0])) | |
| gauges = pd.read_csv(StringIO(sections[1])) | |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/tutorials/k8s-system-telemetry.md` around lines 473 - 477, The CSV
section assignments are reversed: currently gauges =
pd.read_csv(StringIO(sections[0])) and counters =
pd.read_csv(StringIO(sections[1])) but the file example has counters first and
gauges second; swap the assignments so counters =
pd.read_csv(StringIO(sections[0])) and gauges =
pd.read_csv(StringIO(sections[1])) to ensure cpu_stats and counter-derived data
are read from the correct section (refer to the variables sections, gauges,
counters in the snippet).
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
| --server-metrics $NODE_EXPORTERS | ||
| ``` | ||
|
|
||
| ### Option 2: Sidecar Container (Per-Pod Monitoring) |
There was a problem hiding this comment.
This sidecar section presents Node Exporter as per-pod monitoring, but the shown host /proc and /sys mounts expose node-level metrics, so users will attribute whole-node CPU/memory/network to one pod. Fix: relabel this option as node-host telemetry for the pod's node or document a pod-level source such as cAdvisor/kubelet metrics.
| Node Exporter metrics are included in the CSV export with labels expanded into columns: | ||
|
|
||
| ```csv | ||
| endpoint_url,metric_name,metric_type,cpu,mode,total,rate,rate_avg,rate_min,rate_max,rate_std |
There was a problem hiding this comment.
The CSV sample uses lowercase endpoint_url/metric_name columns and omits the metadata comment block, but ServerMetricsCsvExporter writes sections with comments and Endpoint,Type,Metric,Unit,..., so the sample and pandas filters are not executable. Fix: update the CSV sample and pandas code to use the real exported headers and comment="#" when reading.
| ``` | ||
|
|
||
| This will produce: | ||
| - `gpu_telemetry_export.json` - GPU metrics (power, utilization, memory) |
There was a problem hiding this comment.
The listed GPU telemetry artifact has the wrong extension because AIPerf writes gpu_telemetry_export.jsonl, not gpu_telemetry_export.json, so users following this path will look for a non-existent file. Fix: change the filename to gpu_telemetry_export.jsonl.
Summary
Adds comprehensive documentation for collecting system-level metrics (CPU, memory, network, disk I/O) from Kubernetes clusters during AIPerf benchmarks using Prometheus Node Exporter.
Changes
docs/tutorials/k8s-system-telemetry.mddocs/index.yml- Added new tutorial to "Metrics & Analysis" sectionREADME.md- Added tutorial link to "Analysis and Monitoring" sectionMotivation
Users requested the ability to collect CPU utilization, memory utilization, network metrics, and other server component metrics during benchmarks, similar to the existing
--gpu-telemetryfunctionality.Solution
AIPerf's Server Metrics Manager already supports collecting from any Prometheus-compatible endpoint via the
--server-metricsflag. Prometheus Node Exporter exposes comprehensive system metrics in Prometheus format. This tutorial documents how to:--server-metricsTesting
tools/check_docs_index.py)Example Usage
Related Documentation
Checklist
docs/index.ymlandREADME.mdLinear Issue: AIP-898
Summary by CodeRabbit
Release Notes