Skip to content

feat(perf-monitor): aggregate CPU/RAM across descendant processes#1519

Open
shm11C3 wants to merge 2 commits into
developfrom
claude/include-child-process-metrics-cVM4D
Open

feat(perf-monitor): aggregate CPU/RAM across descendant processes#1519
shm11C3 wants to merge 2 commits into
developfrom
claude/include-child-process-metrics-cVM4D

Conversation

@shm11C3
Copy link
Copy Markdown
Owner

@shm11C3 shm11C3 commented May 12, 2026

Summary

  • perf-monitor previously sampled only the parent PID, so apps that put most of their CPU/RSS in child processes (e.g. the Tauri WebView2 host) were significantly under-reported. This PR aggregates CPU and RSS across the parent process and all of its descendants.
  • Switched the per-tick refresh_processes from Some(&[pid]) to All so children spawned mid-measurement are also picked up. On each tick we build a parent→children adjacency map and DFS from the root PID to sum metrics.
  • Extracted the tree walk into a pure function sum_subtree(root, children, metrics) with a visited set that guards against cycles caused by PID reuse / inconsistent snapshots.

Changes

  • perf-monitor/src/monitor.rs
    • Added aggregate_descendants: builds per-PID metrics and a parent→children map from the System snapshot, then returns a ProcessMetrics summed over the subtree rooted at the target PID.
    • Added sum_subtree: pure, cycle-safe DFS that totals CPU and RSS.
    • Initial refresh, warmup, and measurement loop all use ProcessesToUpdate::All now.
  • perf-monitor/README.md
    • Updated the Metrics section to document the parent-plus-descendants aggregation and explain why it matters for multi-process layouts like the Tauri WebView2 host.

Test plan

  • cargo test --manifest-path perf-monitor/Cargo.toml — 13/13 PASS, including 5 new unit tests for the descendant aggregation (root only, multi-level descendants, sibling tree excluded, cyclic parent chain, missing root).
  • cargo clippy --manifest-path perf-monitor/Cargo.toml --all-targets -- -D warnings — clean.
  • cargo build --manifest-path perf-monitor/Cargo.toml — succeeds.
  • Verify the perf-test CI workflow on Windows / Linux / macOS to confirm the new aggregated values do not unexpectedly trip thresholds; adjust perf-thresholds.toml in a follow-up PR if needed.

https://claude.ai/code/session_01XnYxTyEBg9CarpcTbEArjn

Summary by CodeRabbit

  • Documentation

    • Clarified CPU and memory metrics to state aggregation across a process tree.
  • New Features

    • Added per-process breakdowns in text and JSON reports and explicit failure reasons when checks fail.
    • Report output now shows observed process count and a process-tree summary.
  • Bug Fixes

    • Measurement now aggregates CPU and memory across all descendant processes and handles inconsistent parent chains safely.
    • Improved error output on monitoring failure.
  • Tests

    • Expanded unit tests covering subtree traversal and per-process statistics.

Review Change Stack

Apps that spawn helper or renderer processes (e.g. Tauri's WebView2
host) put the bulk of their resource usage in children, so sampling
only the parent PID under-reported CPU and RSS. Refresh all processes
each tick and sum the metrics over the parent and its full descendant
tree, with a visited-set guard so a pathological parent chain can't
double-count or loop.

https://claude.ai/code/session_01XnYxTyEBg9CarpcTbEArjn
Copilot AI review requested due to automatic review settings May 12, 2026 23:33
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 12, 2026

📝 Walkthrough

Walkthrough

Aggregates CPU and RSS metrics across the target process and its descendants each tick, records per-PID samples, computes per-process avg/max/p95 CPU and memory, surfaces per-process breakdowns and failure reasons in text/JSON reports, and updates docs and CLI error text.

Changes

Process Tree Aggregation

Layer / File(s) Summary
Types and baseline/warmup setup
perf-monitor/src/monitor.rs
Add ProcessStats and internal snapshot/track types; extend MonitorResult with per_process; refresh CPU deltas for all processes during initial baseline and warmup; initialize per-PID tracks.
Measurement loop and subtree capture
perf-monitor/src/monitor.rs
Each tick refreshes all processes, capture_subtree/walk_subtree traverse the root PID subtree with a visited-set, subtree CPU/RSS are summed for the tick, per-PID samples are accumulated, and monitoring errors if root PID disappears.
Per-process aggregation
perf-monitor/src/monitor.rs
compute_result accepts per_pid_tracks; compute_per_process_stats computes avg/max/p95 CPU and memory (MB) per PID, filters empty tracks, and sorts root-first then by descending avg CPU.
Unit tests
perf-monitor/src/monitor.rs
Add pid()/snap() helpers; add walk_subtree tests (root-only, descendants, sibling exclusion, cycle handling, missing-root); extend compute_result tests to validate per_process.
Reporting and JSON output
perf-monitor/src/report.rs
Import ProcessStats; add failure_reasons to CheckResult; extend JsonReport and add PerProcessReport; populate per_process and failure_reasons in JSON; print observed process counts, per-process breakdown, and failure reasons in text output.
Docs and CLI error output
perf-monitor/README.md, perf-monitor/src/main.rs
README Metrics section rewritten to document process-tree aggregation semantics; main now emits a standardized multi-line “Performance Test Failed” message with fixed failure reason and underlying error details on monitoring failures.

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 From root PID down to every sprout,

I add their CPU and memory out.
I guard against cycles, count each leaf,
report their stats and list the grief.
Hooray—per-process tales, summed neat and stout.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: aggregating CPU/RAM metrics across descendant processes in perf-monitor.
Description check ✅ Passed The PR description provides a comprehensive summary of changes, includes a detailed test plan with results, and covers the motivation and implementation approach.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/include-child-process-metrics-cVM4D

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@perf-monitor/src/monitor.rs`:
- Around line 90-93: The branch handling aggregate_descendants returning None
currently discards process exit diagnostics and returns a generic error; change
it to re-check the child's try_wait (call the same logic used earlier) and, if
the child has exited, call format_exit_error to produce a detailed error
(including exit status/stderr) instead of the generic message; otherwise
propagate the original "process disappeared" error — update the match arm around
aggregate_descendants(&system, pid, num_cpus) to perform the try_wait check and
delegate to format_exit_error when appropriate so exit diagnostics are
preserved.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 7d29325f-1a22-4016-9ffb-4286720278cc

📥 Commits

Reviewing files that changed from the base of the PR and between d4d9d0b and 9d5c6c4.

📒 Files selected for processing (2)
  • perf-monitor/README.md
  • perf-monitor/src/monitor.rs

Comment thread perf-monitor/src/monitor.rs Outdated
Comment on lines 90 to 93
match aggregate_descendants(&system, pid, num_cpus) {
Some(metrics) => samples.push(metrics),
None => return Err("Process disappeared during measurement".into()),
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Preserve exit diagnostics when aggregation returns None.

If the process exits between the earlier try_wait and aggregate_descendants, this path drops exit status/stderr details and returns only a generic message. Re-checking try_wait here and delegating to format_exit_error improves debuggability.

Suggested patch
-    match aggregate_descendants(&system, pid, num_cpus) {
-      Some(metrics) => samples.push(metrics),
-      None => return Err("Process disappeared during measurement".into()),
-    }
+    match aggregate_descendants(&system, pid, num_cpus) {
+      Some(metrics) => samples.push(metrics),
+      None => {
+        if guard.0.try_wait()?.is_some() {
+          return Err(format_exit_error(&mut guard, "measurement"));
+        }
+        return Err("Process disappeared during measurement".into());
+      }
+    }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@perf-monitor/src/monitor.rs` around lines 90 - 93, The branch handling
aggregate_descendants returning None currently discards process exit diagnostics
and returns a generic error; change it to re-check the child's try_wait (call
the same logic used earlier) and, if the child has exited, call
format_exit_error to produce a detailed error (including exit status/stderr)
instead of the generic message; otherwise propagate the original "process
disappeared" error — update the match arm around aggregate_descendants(&system,
pid, num_cpus) to perform the try_wait check and delegate to format_exit_error
when appropriate so exit diagnostics are preserved.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

perf-monitor の計測対象を「親 PID のみ」から「親 + 全子孫プロセス」に拡張し、Tauri(WebView2) 等のマルチプロセス構成で CPU/RSS を過小評価しないようにする変更です。

Changes:

  • refresh_processesProcessesToUpdate::All に切り替え、計測中に spawn された子孫プロセスも継続的に観測できるよう変更
  • aggregate_descendants / sum_subtree を追加し、親子マップ構築 + cycle-safe DFS で CPU/RSS を合算
  • 子孫合算ロジック向けのユニットテスト追加と README の Metrics 説明更新

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
perf-monitor/src/monitor.rs プロセスツリー全体の CPU/RSS 合算ロジック追加、refresh を All に変更、ユニットテスト追加
perf-monitor/README.md Metrics が「親 + 子孫プロセス合算」になることと背景を明記

Surfaces what's actually consuming CPU/RSS so that GHA logs are
diagnosable at a glance:

- Aggregate tree stats are now followed by a Per-Process Breakdown
  block (one entry per PID observed during the measurement window,
  with sample_count, avg/max/P95 CPU, and avg/max/P95 RSS). The root
  process is listed first and tagged "(root)"; descendants follow in
  descending order of average CPU.
- A Failure Reasons section enumerates exactly which thresholds were
  exceeded (and by how much) when the run fails, instead of leaving
  CI to diff the numbers against the config.
- The JSON output mirrors both: new "per_process" array and
  "failure_reasons" array.
- A pre-sampling abort (binary crashed during warmup, etc.) now
  prints under a "=== Performance Test Failed ===" banner with the
  captured child stderr so the cause is visible at the tail of the
  log.

https://claude.ai/code/session_01XnYxTyEBg9CarpcTbEArjn
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
perf-monitor/src/monitor.rs (1)

126-129: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve exit diagnostics when subtree capture fails

At Line 128, returning a generic error can drop useful exit status/stderr if the child exited between checks. Re-check try_wait here and delegate to format_exit_error before the fallback message.

Suggested patch
-    let snapshots = match capture_subtree(&system, pid) {
-      Some(s) => s,
-      None => return Err("Process disappeared during measurement".into()),
-    };
+    let snapshots = match capture_subtree(&system, pid) {
+      Some(s) => s,
+      None => {
+        if guard.0.try_wait()?.is_some() {
+          return Err(format_exit_error(&mut guard, "measurement"));
+        }
+        return Err("Process disappeared during measurement".into());
+      }
+    };
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@perf-monitor/src/monitor.rs` around lines 126 - 129, When
capture_subtree(&system, pid) returns None in monitor.rs, re-check the child
process status by calling try_wait on the child/process handle and, if it
indicates the child exited, call format_exit_error to produce a detailed error
(including exit status/stderr) and return that; only fall back to the generic
"Process disappeared during measurement" message if try_wait shows the child is
still running or no exit info is available. Locate the failing branch around
capture_subtree, use the same PID/child handle used elsewhere for try_wait, and
replace the plain Err(...) return with logic that delegates to format_exit_error
when try_wait reports an exit.
🧹 Nitpick comments (1)
perf-monitor/src/main.rs (1)

75-78: ⚡ Quick win

Use project logging macros in the new failure path

These new direct eprintln! calls should use the repository logging macros for consistency with Rust logging policy.

As per coding guidelines, "Use logging macros (log_debug!, log_info!, log_warn!, log_error!) from core/src/utils/logger.rs instead of direct logging calls."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@perf-monitor/src/main.rs` around lines 75 - 78, The new direct eprintln!
calls should be replaced with the project's logging macros: use log_error! (from
core/src/utils/logger.rs) instead of eprintln! for the failure banner and
reason/details; e.g. replace the three eprintln! invocations (the "===
Performance Test Failed ===", "Failure reason: ..." and "Details: {e}") with
appropriate log_error! calls that include the same messages and the error
variable {e} so logs follow the repository logging policy.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@perf-monitor/src/main.rs`:
- Around line 77-78: The failure message printed by the eprintln! call is too
specific ("monitoring aborted before metrics could be collected") and can be
misleading for mid-run aborts; update the first eprintln! invocation used in
main (the eprintln!("Failure reason: ...") line) to neutral wording such as
"Failure reason: monitoring aborted before completion" (leave the separate
eprintln!("Details: {e}") unchanged so the actual error is still logged).

---

Duplicate comments:
In `@perf-monitor/src/monitor.rs`:
- Around line 126-129: When capture_subtree(&system, pid) returns None in
monitor.rs, re-check the child process status by calling try_wait on the
child/process handle and, if it indicates the child exited, call
format_exit_error to produce a detailed error (including exit status/stderr) and
return that; only fall back to the generic "Process disappeared during
measurement" message if try_wait shows the child is still running or no exit
info is available. Locate the failing branch around capture_subtree, use the
same PID/child handle used elsewhere for try_wait, and replace the plain
Err(...) return with logic that delegates to format_exit_error when try_wait
reports an exit.

---

Nitpick comments:
In `@perf-monitor/src/main.rs`:
- Around line 75-78: The new direct eprintln! calls should be replaced with the
project's logging macros: use log_error! (from core/src/utils/logger.rs) instead
of eprintln! for the failure banner and reason/details; e.g. replace the three
eprintln! invocations (the "=== Performance Test Failed ===", "Failure reason:
..." and "Details: {e}") with appropriate log_error! calls that include the same
messages and the error variable {e} so logs follow the repository logging
policy.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 968b52b3-e435-4ef4-9216-a38e1c05695f

📥 Commits

Reviewing files that changed from the base of the PR and between 9d5c6c4 and 04271a5.

📒 Files selected for processing (4)
  • perf-monitor/README.md
  • perf-monitor/src/main.rs
  • perf-monitor/src/monitor.rs
  • perf-monitor/src/report.rs

Comment thread perf-monitor/src/main.rs
Comment on lines +77 to +78
eprintln!("Failure reason: monitoring aborted before metrics could be collected");
eprintln!("Details: {e}");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Failure reason text can be misleading for mid-run aborts

At Line 77, the message says metrics could not be collected, but aborts can also happen after some ticks. Use neutral wording (e.g., “monitoring aborted before completion”) to avoid misleading CI triage.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@perf-monitor/src/main.rs` around lines 77 - 78, The failure message printed
by the eprintln! call is too specific ("monitoring aborted before metrics could
be collected") and can be misleading for mid-run aborts; update the first
eprintln! invocation used in main (the eprintln!("Failure reason: ...") line) to
neutral wording such as "Failure reason: monitoring aborted before completion"
(leave the separate eprintln!("Details: {e}") unchanged so the actual error is
still logged).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants