fix(core): read pod cgroup limits instead of node limits in resource metrics#35622
Conversation
✅ Deploy Preview for nx-docs ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
✅ Deploy Preview for nx-dev ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
View your CI Pipeline Execution ↗ for commit a443689
☁️ Nx Cloud last updated this comment at |
## Current Behavior `rust-toolchain.toml` pins Rust to `1.94.0`. This blocks upgrading `sysinfo` to `0.39.x`, which requires Rust `1.95` and brings upstream support for several cgroup limit features that nx currently needs to implement locally. ## Expected Behavior Toolchain bumped to `1.95.0`. Verified: - `cargo build -p nx` produces zero new warnings vs. 1.94. - `cargo build -p nx --all-targets` (compiles tests too) produces zero new warnings vs. 1.94 — identical warning set. - Clippy delta: +11 new clippy lints (style suggestions only — not gating, none correctness-related). ## Related Issue(s) N/A — maintenance/hygiene change. Unblocks the sysinfo bump that, in turn, lets PR #35622 drop its memory-side cgroup parsing in favor of upstream `Process::cgroup_limits()` + parent cgroup memory walking (sysinfo PRs [#1643](GuillaumeGomez/sysinfo#1643) and [#1651](GuillaumeGomez/sysinfo#1651)).
There was a problem hiding this comment.
Heads up: sysinfo v0.39.0 (released 2026-05-11) now does the parent-cgroup memory walking this PR implements — see PR #1651 and the v0.39.0 CHANGELOG. nx is pinned to sysinfo 0.37.2; I've opened #35665 to bump the Rust toolchain to 1.95 (sysinfo 0.39's MSRV) so we can pull that in.
Suggestion: rebase onto sysinfo 0.39 and drop the memory-side cgroup parsing in cgroup.rs, but keep the genuinely-novel parts — the CPU cpu.max / cpu.cfs_* ancestor walk, cgroup v1 co-mount handling, mountinfo octal-escape parsing, and the Windows JobObject reads (no upstream equivalent for any of those; rust-lang/rust#143709 is the open tracking issue for the Windows side).
One concrete bug worth fixing regardless: read_v1_cpu_quota only treats q <= 0 as unlimited, but cgroup v1 emits the same large positive PAGE_COUNTER_MAX-style sentinel for CPU that read_v1_memory_limit already filters at line 346 — feeding that into cores_from_quota will overflow.
## Current Behavior `rust-toolchain.toml` pins Rust to `1.94.0`. This blocks upgrading `sysinfo` to `0.39.x`, which requires Rust `1.95` and brings upstream support for several cgroup limit features that nx currently needs to implement locally. ## Expected Behavior Toolchain bumped to `1.95.0`. Verified: - `cargo build -p nx` produces zero new warnings vs. 1.94. - `cargo build -p nx --all-targets` (compiles tests too) produces zero new warnings vs. 1.94 — identical warning set. - Clippy delta: +11 new clippy lints (style suggestions only — not gating, none correctness-related). ## Related Issue(s) N/A — maintenance/hygiene change. Unblocks the sysinfo bump that, in turn, lets PR #35622 drop its memory-side cgroup parsing in favor of upstream `Process::cgroup_limits()` + parent cgroup memory walking (sysinfo PRs [#1643](GuillaumeGomez/sysinfo#1643) and [#1651](GuillaumeGomez/sysinfo#1651)).
489d9d0 to
4198ea9
Compare
|
@FrozenPandaz, the PR has been rebased and now includes the |
There was a problem hiding this comment.
Nx Cloud has identified a flaky task in your failed CI:
🔂 Since the failure was identified as flaky, we triggered a CI rerun by adding an empty commit to this branch.
🎓 Learn more about Self-Healing CI on nx.dev
## Current Behavior `rust-toolchain.toml` pins Rust to `1.94.0`. This blocks upgrading `sysinfo` to `0.39.x`, which requires Rust `1.95` and brings upstream support for several cgroup limit features that nx currently needs to implement locally. ## Expected Behavior Toolchain bumped to `1.95.0`. Verified: - `cargo build -p nx` produces zero new warnings vs. 1.94. - `cargo build -p nx --all-targets` (compiles tests too) produces zero new warnings vs. 1.94 — identical warning set. - Clippy delta: +11 new clippy lints (style suggestions only — not gating, none correctness-related). ## Related Issue(s) N/A — maintenance/hygiene change. Unblocks the sysinfo bump that, in turn, lets PR nrwl#35622 drop its memory-side cgroup parsing in favor of upstream `Process::cgroup_limits()` + parent cgroup memory walking (sysinfo PRs [nrwl#1643](GuillaumeGomez/sysinfo#1643) and [nrwl#1651](GuillaumeGomez/sysinfo#1651)).
Completes the toolchain bump from #35665, which updated rust-toolchain.toml to 1.95.0 but left mise.toml at 1.90.0. CI installs rust via mise (sets RUSTUP_TOOLCHAIN, overriding rust-toolchain.toml), so CI was still on 1.90.0 — failing the sysinfo 0.39.1 MSRV check.
- Read GetProcessAffinityMask unconditionally so manual SetProcessAffinityMask is honored whether or not the process is in a Job Object. Matches the Linux arm's unconditional sched_getaffinity call and the behavior of Go, .NET, libuv, and OpenJDK's no-Job branch. - Drop the redundant JOB_OBJECT_LIMIT_AFFINITY read; the kernel intersects Job-imposed affinity into the process mask, so the unconditional GetProcessAffinityMask already covers it. - Extract shared cgroup/Job Object math into a cfg-free metrics_math module with cross-OS unit tests; align cgroup v1 and Job Object memory filtering via a shared predicate. - Emit tracing::debug! on Win32 and /proc fallback paths so silent failures are diagnosable.
Current Behavior
When running inside a Linux container or Kubernetes pod, resource metrics report the host node's CPU and memory totals instead of the pod's limits. A pod with constrained CPU/memory shows the underlying node's resources.
The same gap exists for process-isolated Windows containers running inside a Windows Job Object (e.g. Docker
--cpus/--memory): metrics report host values rather than the Job's limits.Expected Behavior
Resource metrics report the effective CPU and memory limits enforced by the kernel for the calling process — derived from the cgroup it belongs to on Linux, or the Job Object on Windows. macOS native processes continue to report host values (no equivalent enforcement primitive exists).
Implementation Details
Linux (
cgroupmodule)Resolves the calling process's actual cgroup directory by parsing
/proc/self/cgroup+/proc/self/mountinfo. Readscpu.max/memory.max(cgroup v2) orcpu.cfs_{quota,period}_us/memory.limit_in_bytes(cgroup v1, including the systemdcpu,cpuacctco-mount).Walks from the leaf up to the mount point and takes the minimum finite limit found at any level — the kernel enforces the tightest ancestor's limit (hierarchical enforcement), so leaf-only reads can overreport when a pod-level cgroup is tighter than the container's (common in K8s VPA in-place resize and Burstable QoS).
Composition with
sched_getaffinitycovers cpuset / taskset restrictions. We deliberately bypassstd::thread::available_parallelism()because rust-lang/rust's implementation applies the cgroup quota with floor division internally, which would silently underreport fractional quotas (e.g. 1.5 cores → 1). Instead we ceil quota / period, matching HotSpot JVM, Go 1.25, .NET, and thenum_cpuscrate.mountinfo path fields containing spaces / tabs / newlines / backslashes are kernel-encoded as
\040/\011/\012/\134perman 5 proc_pid_mountinfo; we unescape before joining with the cgroup path./proc/self/cgroupitself emits paths raw (verified across kernels v4.18 → v6.13), so no unescape is needed there.Replaces the prior leaf-only path lookups (which broke on cgroup v1 co-mount and any non-namespaced container setup). The in-tree module is preferred over
sysinfo'scgroup_limits()(now parent-aware in 0.39 — see Additional Changes) because it also covers CPU quota and the v1 co-mount + bind-mount edge cases in a single place we control. 30 unit tests cover cgroup discovery, parsing, ancestor walk, and the v1 / v2 / co-mount / bind-mount / cgroupns=host cases.Windows (
job_objectmodule)Detects Job Object resource limits via the Win32 API:
ceil(host_cpu_count × CpuRate / 10000)fromJobObjectCpuRateControlInformationwhenHARD_CAPis set (Docker--cpustranslates to HARD_CAP). Plus the popcount of the Job's affinity mask (whenLIMIT_AFFINITYis set), andGetProcessAffinityMask(covers Job + manual + system intersections). Takes the minimum.ProcessMemoryLimit/JobMemoryLimit/MaximumWorkingSetSizefromJobObjectExtendedLimitInformation, gated on the correspondingLIMIT_*flags. Mirrors HotSpot and .NET.WEIGHT_BASEDrate control (relative priority, not a hard limit) and soft-cap rate control (kernel allows transient bursts) — neither maps to a defensible "available cores" number.Any Win32 failure is treated as "no information"; the caller falls back to host values. No new crate dependency — the existing
winapidep is extended withjobapi,jobapi2,processthreadsapi,winbase, andwinntfeatures.Known limitation: nested Job hierarchies
QueryInformationJobObject(NULL, ...)returns the immediate Job's settings (per MSDN: "If the job is nested, the immediate job of the calling process is used.") and Win32 exposes no documented API to enumerate parent Jobs.Per
JOBOBJECT_CPU_RATE_CONTROL_INFORMATIONRemarks, "the rates set for the job represent its portion of the CPU rate that is allocated to its parent job" — nested rates compose multiplicatively. So with parent HARD_CAP 50% × child HARD_CAP 50%, the effective rate is 25% of host but we read the child's 50% and reportceil(host × 0.5). Per-process / per-job memory has the same shape (kernel enforces min across the chain; we see only the immediate Job).Affinity is unaffected —
GetProcessAffinityMaskreturns the kernel-effective mask. HotSpot and .NET CoreCLR exhibit the same memory limitation; HARD_CAP CPU-rate detection goes beyond them for Docker--cpusparity in the common single-silo case, accepting the nested-Job overreport as the documented cost.Cross-platform
SystemInfo { cpuCores, totalMemory }shape is unchanged — consumers do not need to update. macOS native processes continue to report host values (no container-style enforcement primitive exists; container runtimes on macOS run Linux VMs where the Linux path applies).The module doc-comments cross-reference Go 1.25
internal/runtime/cgroup, libuvsrc/unix/linux.c, OpenJDKcgroupSubsystem_linux.cpp+os_windows.cpp, .NET CoreCLRgc/unix/cgroup.cpp+gc/windows/gcenv.windows.cpp, and Rust stdliblibrary/std/src/sys/thread/unix.rs::cgroups.Additional Changes
Two infrastructure chores are bundled into this PR as separate commits:
chore(core): bump sysinfo to 0.39.1Bumps
sysinfofrom0.37.2→0.39.1. No source changes were required — all signatures we use (System,Process,Pid,Signal,Disks, the*RefreshKindtypes,UpdateKind,MINIMUM_CPU_UPDATE_INTERVAL) are unchanged.Cargo.lockdeltas:objc2-open-directoryfrom sysinfo 0.39's soundness fix for user retrieval on Apple targets.windowsfamily bumped to0.62.xper sysinfo's new constraint, which lets the graph converge on single versions ofwindows-core,windows-link,windows-result, andwindows-strings— four duplicatewindows-*entries are deduplicated as a result.sysinfo 0.39.0 also added
Process::cgroup_limits()and parent-cgroup memory walking insideSystem::cgroup_limits()— overlapping conceptually with the in-treecgroupmodule introduced by this PR. The in-tree module is kept because it also covers CPU quota (sysinfo's helpers are memory-only), the cgroup v1 co-mount, and bind-mount path translation in a single implementation under our control. Future consolidation onto the upstream APIs is possible but explicitly out of scope here.chore(repo): bump mise rust to 1.95.0 to match rust-toolchain.tomlrust-toolchain.tomlwas bumped to1.95.0in #35665 to unblock the sysinfo upgrade, butmise.tomlwas left at1.90.0. CI installs Rust via mise (which exportsRUSTUP_TOOLCHAIN, overridingrust-toolchain.toml), so CI continued to run on1.90.0and failed the sysinfo 0.39 MSRV check until this commit. The two files now agree.