Skip to content

fix(core): read pod cgroup limits instead of node limits in resource metrics#35622

Merged
FrozenPandaz merged 4 commits into
masterfrom
nxc-4445
Jun 2, 2026
Merged

fix(core): read pod cgroup limits instead of node limits in resource metrics#35622
FrozenPandaz merged 4 commits into
masterfrom
nxc-4445

Conversation

@leosvelperez
Copy link
Copy Markdown
Member

@leosvelperez leosvelperez commented May 8, 2026

Current Behavior

When running inside a Linux container or Kubernetes pod, resource metrics report the host node's CPU and memory totals instead of the pod's limits. A pod with constrained CPU/memory shows the underlying node's resources.

The same gap exists for process-isolated Windows containers running inside a Windows Job Object (e.g. Docker --cpus / --memory): metrics report host values rather than the Job's limits.

Expected Behavior

Resource metrics report the effective CPU and memory limits enforced by the kernel for the calling process — derived from the cgroup it belongs to on Linux, or the Job Object on Windows. macOS native processes continue to report host values (no equivalent enforcement primitive exists).

Implementation Details

Linux (cgroup module)

Resolves the calling process's actual cgroup directory by parsing /proc/self/cgroup + /proc/self/mountinfo. Reads cpu.max / memory.max (cgroup v2) or cpu.cfs_{quota,period}_us / memory.limit_in_bytes (cgroup v1, including the systemd cpu,cpuacct co-mount).

Walks from the leaf up to the mount point and takes the minimum finite limit found at any level — the kernel enforces the tightest ancestor's limit (hierarchical enforcement), so leaf-only reads can overreport when a pod-level cgroup is tighter than the container's (common in K8s VPA in-place resize and Burstable QoS).

Composition with sched_getaffinity covers cpuset / taskset restrictions. We deliberately bypass std::thread::available_parallelism() because rust-lang/rust's implementation applies the cgroup quota with floor division internally, which would silently underreport fractional quotas (e.g. 1.5 cores → 1). Instead we ceil quota / period, matching HotSpot JVM, Go 1.25, .NET, and the num_cpus crate.

mountinfo path fields containing spaces / tabs / newlines / backslashes are kernel-encoded as \040 / \011 / \012 / \134 per man 5 proc_pid_mountinfo; we unescape before joining with the cgroup path. /proc/self/cgroup itself emits paths raw (verified across kernels v4.18 → v6.13), so no unescape is needed there.

Replaces the prior leaf-only path lookups (which broke on cgroup v1 co-mount and any non-namespaced container setup). The in-tree module is preferred over sysinfo's cgroup_limits() (now parent-aware in 0.39 — see Additional Changes) because it also covers CPU quota and the v1 co-mount + bind-mount edge cases in a single place we control. 30 unit tests cover cgroup discovery, parsing, ancestor walk, and the v1 / v2 / co-mount / bind-mount / cgroupns=host cases.

Windows (job_object module)

Detects Job Object resource limits via the Win32 API:

  • CPU: ceil(host_cpu_count × CpuRate / 10000) from JobObjectCpuRateControlInformation when HARD_CAP is set (Docker --cpus translates to HARD_CAP). Plus the popcount of the Job's affinity mask (when LIMIT_AFFINITY is set), and GetProcessAffinityMask (covers Job + manual + system intersections). Takes the minimum.
  • Memory: minimum of ProcessMemoryLimit / JobMemoryLimit / MaximumWorkingSetSize from JobObjectExtendedLimitInformation, gated on the corresponding LIMIT_* flags. Mirrors HotSpot and .NET.
  • Skipped: WEIGHT_BASED rate control (relative priority, not a hard limit) and soft-cap rate control (kernel allows transient bursts) — neither maps to a defensible "available cores" number.

Any Win32 failure is treated as "no information"; the caller falls back to host values. No new crate dependency — the existing winapi dep is extended with jobapi, jobapi2, processthreadsapi, winbase, and winnt features.

Known limitation: nested Job hierarchies

QueryInformationJobObject(NULL, ...) returns the immediate Job's settings (per MSDN: "If the job is nested, the immediate job of the calling process is used.") and Win32 exposes no documented API to enumerate parent Jobs.

Per JOBOBJECT_CPU_RATE_CONTROL_INFORMATION Remarks, "the rates set for the job represent its portion of the CPU rate that is allocated to its parent job" — nested rates compose multiplicatively. So with parent HARD_CAP 50% × child HARD_CAP 50%, the effective rate is 25% of host but we read the child's 50% and report ceil(host × 0.5). Per-process / per-job memory has the same shape (kernel enforces min across the chain; we see only the immediate Job).

Affinity is unaffected — GetProcessAffinityMask returns the kernel-effective mask. HotSpot and .NET CoreCLR exhibit the same memory limitation; HARD_CAP CPU-rate detection goes beyond them for Docker --cpus parity in the common single-silo case, accepting the nested-Job overreport as the documented cost.

Cross-platform

SystemInfo { cpuCores, totalMemory } shape is unchanged — consumers do not need to update. macOS native processes continue to report host values (no container-style enforcement primitive exists; container runtimes on macOS run Linux VMs where the Linux path applies).

The module doc-comments cross-reference Go 1.25 internal/runtime/cgroup, libuv src/unix/linux.c, OpenJDK cgroupSubsystem_linux.cpp + os_windows.cpp, .NET CoreCLR gc/unix/cgroup.cpp + gc/windows/gcenv.windows.cpp, and Rust stdlib library/std/src/sys/thread/unix.rs::cgroups.

Additional Changes

Two infrastructure chores are bundled into this PR as separate commits:

chore(core): bump sysinfo to 0.39.1

Bumps sysinfo from 0.37.20.39.1. No source changes were required — all signatures we use (System, Process, Pid, Signal, Disks, the *RefreshKind types, UpdateKind, MINIMUM_CPU_UPDATE_INTERVAL) are unchanged. Cargo.lock deltas:

  • New transitive dep objc2-open-directory from sysinfo 0.39's soundness fix for user retrieval on Apple targets.
  • windows family bumped to 0.62.x per sysinfo's new constraint, which lets the graph converge on single versions of windows-core, windows-link, windows-result, and windows-strings — four duplicate windows-* entries are deduplicated as a result.

sysinfo 0.39.0 also added Process::cgroup_limits() and parent-cgroup memory walking inside System::cgroup_limits() — overlapping conceptually with the in-tree cgroup module introduced by this PR. The in-tree module is kept because it also covers CPU quota (sysinfo's helpers are memory-only), the cgroup v1 co-mount, and bind-mount path translation in a single implementation under our control. Future consolidation onto the upstream APIs is possible but explicitly out of scope here.

chore(repo): bump mise rust to 1.95.0 to match rust-toolchain.toml

rust-toolchain.toml was bumped to 1.95.0 in #35665 to unblock the sysinfo upgrade, but mise.toml was left at 1.90.0. CI installs Rust via mise (which exports RUSTUP_TOOLCHAIN, overriding rust-toolchain.toml), so CI continued to run on 1.90.0 and failed the sysinfo 0.39 MSRV check until this commit. The two files now agree.

@leosvelperez leosvelperez self-assigned this May 8, 2026
@netlify
Copy link
Copy Markdown

netlify Bot commented May 8, 2026

Deploy Preview for nx-docs ready!

Name Link
🔨 Latest commit a443689
🔍 Latest deploy log https://app.netlify.com/projects/nx-docs/deploys/6a0ee60ebfe84d0008ca5f0e
😎 Deploy Preview https://deploy-preview-35622--nx-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@netlify
Copy link
Copy Markdown

netlify Bot commented May 8, 2026

Deploy Preview for nx-dev ready!

Name Link
🔨 Latest commit a443689
🔍 Latest deploy log https://app.netlify.com/projects/nx-dev/deploys/6a0ee60ee0dabb000812ea15
😎 Deploy Preview https://deploy-preview-35622--nx-dev.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@nx-cloud
Copy link
Copy Markdown
Contributor

nx-cloud Bot commented May 8, 2026

View your CI Pipeline Execution ↗ for commit a443689

Command Status Duration Result
nx affected --targets=lint,test,build,e2e,e2e-c... ✅ Succeeded 29m 5s View ↗
nx run-many -t check-imports check-lock-files c... ✅ Succeeded 3s View ↗
nx-cloud record -- pnpm nx-cloud conformance:check ✅ Succeeded 9s View ↗
nx build workspace-plugin ✅ Succeeded 4m 2s View ↗
nx-cloud record -- nx sync:check ✅ Succeeded 18s View ↗
nx-cloud record -- nx format:check ✅ Succeeded <1s View ↗

☁️ Nx Cloud last updated this comment at 2026-05-21 11:38:09 UTC

nx-cloud[bot]

This comment was marked as outdated.

@leosvelperez leosvelperez marked this pull request as ready for review May 8, 2026 16:32
@leosvelperez leosvelperez requested a review from a team as a code owner May 8, 2026 16:32
@leosvelperez leosvelperez requested a review from AgentEnder May 8, 2026 16:32
FrozenPandaz added a commit that referenced this pull request May 12, 2026
## Current Behavior

`rust-toolchain.toml` pins Rust to `1.94.0`. This blocks upgrading
`sysinfo` to `0.39.x`, which requires Rust `1.95` and brings upstream
support for several cgroup limit features that nx currently needs to
implement locally.

## Expected Behavior

Toolchain bumped to `1.95.0`. Verified:

- `cargo build -p nx` produces zero new warnings vs. 1.94.
- `cargo build -p nx --all-targets` (compiles tests too) produces zero
new warnings vs. 1.94 — identical warning set.
- Clippy delta: +11 new clippy lints (style suggestions only — not
gating, none correctness-related).

## Related Issue(s)

N/A — maintenance/hygiene change. Unblocks the sysinfo bump that, in
turn, lets PR #35622 drop its memory-side cgroup parsing in favor of
upstream `Process::cgroup_limits()` + parent cgroup memory walking
(sysinfo PRs
[#1643](GuillaumeGomez/sysinfo#1643) and
[#1651](GuillaumeGomez/sysinfo#1651)).
Copy link
Copy Markdown
Contributor

@FrozenPandaz FrozenPandaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads up: sysinfo v0.39.0 (released 2026-05-11) now does the parent-cgroup memory walking this PR implements — see PR #1651 and the v0.39.0 CHANGELOG. nx is pinned to sysinfo 0.37.2; I've opened #35665 to bump the Rust toolchain to 1.95 (sysinfo 0.39's MSRV) so we can pull that in.

Suggestion: rebase onto sysinfo 0.39 and drop the memory-side cgroup parsing in cgroup.rs, but keep the genuinely-novel parts — the CPU cpu.max / cpu.cfs_* ancestor walk, cgroup v1 co-mount handling, mountinfo octal-escape parsing, and the Windows JobObject reads (no upstream equivalent for any of those; rust-lang/rust#143709 is the open tracking issue for the Windows side).

One concrete bug worth fixing regardless: read_v1_cpu_quota only treats q <= 0 as unlimited, but cgroup v1 emits the same large positive PAGE_COUNTER_MAX-style sentinel for CPU that read_v1_memory_limit already filters at line 346 — feeding that into cores_from_quota will overflow.

polygraph-snapshot-app Bot pushed a commit that referenced this pull request May 13, 2026
## Current Behavior

`rust-toolchain.toml` pins Rust to `1.94.0`. This blocks upgrading
`sysinfo` to `0.39.x`, which requires Rust `1.95` and brings upstream
support for several cgroup limit features that nx currently needs to
implement locally.

## Expected Behavior

Toolchain bumped to `1.95.0`. Verified:

- `cargo build -p nx` produces zero new warnings vs. 1.94.
- `cargo build -p nx --all-targets` (compiles tests too) produces zero
new warnings vs. 1.94 — identical warning set.
- Clippy delta: +11 new clippy lints (style suggestions only — not
gating, none correctness-related).

## Related Issue(s)

N/A — maintenance/hygiene change. Unblocks the sysinfo bump that, in
turn, lets PR #35622 drop its memory-side cgroup parsing in favor of
upstream `Process::cgroup_limits()` + parent cgroup memory walking
(sysinfo PRs
[#1643](GuillaumeGomez/sysinfo#1643) and
[#1651](GuillaumeGomez/sysinfo#1651)).
@leosvelperez leosvelperez force-pushed the nxc-4445 branch 2 times, most recently from 489d9d0 to 4198ea9 Compare May 15, 2026 11:07
nx-cloud[bot]

This comment was marked as outdated.

@leosvelperez
Copy link
Copy Markdown
Member Author

@FrozenPandaz, the PR has been rebased and now includes the sysinfo update. As discussed offline, we'll keep the custom cgroup implementation due to sysinfo still lacking some functionality.

@leosvelperez leosvelperez requested a review from FrozenPandaz May 15, 2026 13:14
Copy link
Copy Markdown
Contributor

@nx-cloud nx-cloud Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nx Cloud has identified a flaky task in your failed CI:

🔂 Since the failure was identified as flaky, we triggered a CI rerun by adding an empty commit to this branch.

Nx Cloud View detailed reasoning in Nx Cloud ↗


🎓 Learn more about Self-Healing CI on nx.dev

vrxj81 pushed a commit to vrxj81/nx that referenced this pull request May 20, 2026
## Current Behavior

`rust-toolchain.toml` pins Rust to `1.94.0`. This blocks upgrading
`sysinfo` to `0.39.x`, which requires Rust `1.95` and brings upstream
support for several cgroup limit features that nx currently needs to
implement locally.

## Expected Behavior

Toolchain bumped to `1.95.0`. Verified:

- `cargo build -p nx` produces zero new warnings vs. 1.94.
- `cargo build -p nx --all-targets` (compiles tests too) produces zero
new warnings vs. 1.94 — identical warning set.
- Clippy delta: +11 new clippy lints (style suggestions only — not
gating, none correctness-related).

## Related Issue(s)

N/A — maintenance/hygiene change. Unblocks the sysinfo bump that, in
turn, lets PR nrwl#35622 drop its memory-side cgroup parsing in favor of
upstream `Process::cgroup_limits()` + parent cgroup memory walking
(sysinfo PRs
[nrwl#1643](GuillaumeGomez/sysinfo#1643) and
[nrwl#1651](GuillaumeGomez/sysinfo#1651)).
Completes the toolchain bump from #35665, which updated
rust-toolchain.toml to 1.95.0 but left mise.toml at 1.90.0.
CI installs rust via mise (sets RUSTUP_TOOLCHAIN, overriding
rust-toolchain.toml), so CI was still on 1.90.0 — failing the
sysinfo 0.39.1 MSRV check.
- Read GetProcessAffinityMask unconditionally so manual
  SetProcessAffinityMask is honored whether or not the process is in
  a Job Object. Matches the Linux arm's unconditional sched_getaffinity
  call and the behavior of Go, .NET, libuv, and OpenJDK's no-Job
  branch.
- Drop the redundant JOB_OBJECT_LIMIT_AFFINITY read; the kernel
  intersects Job-imposed affinity into the process mask, so the
  unconditional GetProcessAffinityMask already covers it.
- Extract shared cgroup/Job Object math into a cfg-free metrics_math
  module with cross-OS unit tests; align cgroup v1 and Job Object
  memory filtering via a shared predicate.
- Emit tracing::debug! on Win32 and /proc fallback paths so silent
  failures are diagnosable.
@FrozenPandaz FrozenPandaz merged commit 6c9d5e0 into master Jun 2, 2026
25 checks passed
@FrozenPandaz FrozenPandaz deleted the nxc-4445 branch June 2, 2026 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants