Skip to content

Commit 6c9d5e0

Browse files
authored
fix(core): read pod cgroup limits instead of node limits in resource metrics (#35622)
## Current Behavior When running inside a Linux container or Kubernetes pod, resource metrics report the host node's CPU and memory totals instead of the pod's limits. A pod with constrained CPU/memory shows the underlying node's resources. The same gap exists for process-isolated Windows containers running inside a Windows Job Object (e.g. Docker `--cpus` / `--memory`): metrics report host values rather than the Job's limits. ## Expected Behavior Resource metrics report the effective CPU and memory limits enforced by the kernel for the calling process — derived from the cgroup it belongs to on Linux, or the Job Object on Windows. macOS native processes continue to report host values (no equivalent enforcement primitive exists). ## Implementation Details ### Linux (`cgroup` module) Resolves the calling process's actual cgroup directory by parsing `/proc/self/cgroup` + `/proc/self/mountinfo`. Reads `cpu.max` / `memory.max` (cgroup v2) or `cpu.cfs_{quota,period}_us` / `memory.limit_in_bytes` (cgroup v1, including the systemd `cpu,cpuacct` co-mount). Walks from the leaf up to the mount point and takes the minimum finite limit found at any level — the kernel enforces the tightest ancestor's limit (hierarchical enforcement), so leaf-only reads can overreport when a pod-level cgroup is tighter than the container's (common in K8s VPA in-place resize and Burstable QoS). Composition with `sched_getaffinity` covers cpuset / taskset restrictions. We deliberately bypass `std::thread::available_parallelism()` because rust-lang/rust's implementation applies the cgroup quota with floor division internally, which would silently underreport fractional quotas (e.g. 1.5 cores → 1). Instead we ceil quota / period, matching HotSpot JVM, Go 1.25, .NET, and the `num_cpus` crate. mountinfo path fields containing spaces / tabs / newlines / backslashes are kernel-encoded as `\040` / `\011` / `\012` / `\134` per `man 5 proc_pid_mountinfo`; we unescape before joining with the cgroup path. `/proc/self/cgroup` itself emits paths raw (verified across kernels v4.18 → v6.13), so no unescape is needed there. Replaces the prior leaf-only path lookups (which broke on cgroup v1 co-mount and any non-namespaced container setup). The in-tree module is preferred over `sysinfo`'s `cgroup_limits()` (now parent-aware in 0.39 — see *Additional Changes*) because it also covers CPU quota and the v1 co-mount + bind-mount edge cases in a single place we control. 30 unit tests cover cgroup discovery, parsing, ancestor walk, and the v1 / v2 / co-mount / bind-mount / cgroupns=host cases. ### Windows (`job_object` module) Detects Job Object resource limits via the Win32 API: - **CPU**: `ceil(host_cpu_count × CpuRate / 10000)` from `JobObjectCpuRateControlInformation` when `HARD_CAP` is set (Docker `--cpus` translates to HARD_CAP). Plus the popcount of the Job's affinity mask (when `LIMIT_AFFINITY` is set), and `GetProcessAffinityMask` (covers Job + manual + system intersections). Takes the minimum. - **Memory**: minimum of `ProcessMemoryLimit` / `JobMemoryLimit` / `MaximumWorkingSetSize` from `JobObjectExtendedLimitInformation`, gated on the corresponding `LIMIT_*` flags. Mirrors HotSpot and .NET. - **Skipped**: `WEIGHT_BASED` rate control (relative priority, not a hard limit) and soft-cap rate control (kernel allows transient bursts) — neither maps to a defensible "available cores" number. Any Win32 failure is treated as "no information"; the caller falls back to host values. No new crate dependency — the existing `winapi` dep is extended with `jobapi`, `jobapi2`, `processthreadsapi`, `winbase`, and `winnt` features. #### Known limitation: nested Job hierarchies [`QueryInformationJobObject(NULL, ...)`](https://learn.microsoft.com/en-us/windows/win32/api/jobapi2/nf-jobapi2-queryinformationjobobject) returns the **immediate** Job's settings (per MSDN: *"If the job is nested, the immediate job of the calling process is used."*) and Win32 exposes no documented API to enumerate parent Jobs. Per [`JOBOBJECT_CPU_RATE_CONTROL_INFORMATION`](https://learn.microsoft.com/en-us/windows/win32/api/winnt/ns-winnt-jobobject_cpu_rate_control_information) Remarks, *"the rates set for the job represent its portion of the CPU rate that is allocated to its parent job"* — nested rates compose multiplicatively. So with parent HARD_CAP 50% × child HARD_CAP 50%, the effective rate is 25% of host but we read the child's 50% and report `ceil(host × 0.5)`. Per-process / per-job memory has the same shape (kernel enforces min across the chain; we see only the immediate Job). Affinity is unaffected — `GetProcessAffinityMask` returns the kernel-effective mask. HotSpot and .NET CoreCLR exhibit the same memory limitation; HARD_CAP CPU-rate detection goes beyond them for Docker `--cpus` parity in the common single-silo case, accepting the nested-Job overreport as the documented cost. ### Cross-platform `SystemInfo { cpuCores, totalMemory }` shape is unchanged — consumers do not need to update. macOS native processes continue to report host values (no container-style enforcement primitive exists; container runtimes on macOS run Linux VMs where the Linux path applies). The module doc-comments cross-reference Go 1.25 `internal/runtime/cgroup`, libuv `src/unix/linux.c`, OpenJDK `cgroupSubsystem_linux.cpp` + `os_windows.cpp`, .NET CoreCLR `gc/unix/cgroup.cpp` + `gc/windows/gcenv.windows.cpp`, and Rust stdlib `library/std/src/sys/thread/unix.rs::cgroups`. ## Additional Changes Two infrastructure chores are bundled into this PR as separate commits: ### `chore(core): bump sysinfo to 0.39.1` Bumps `sysinfo` from `0.37.2` → `0.39.1`. No source changes were required — all signatures we use (`System`, `Process`, `Pid`, `Signal`, `Disks`, the `*RefreshKind` types, `UpdateKind`, `MINIMUM_CPU_UPDATE_INTERVAL`) are unchanged. `Cargo.lock` deltas: - New transitive dep `objc2-open-directory` from sysinfo 0.39's soundness fix for user retrieval on Apple targets. - `windows` family bumped to `0.62.x` per sysinfo's new constraint, which lets the graph converge on single versions of `windows-core`, `windows-link`, `windows-result`, and `windows-strings` — four duplicate `windows-*` entries are deduplicated as a result. sysinfo 0.39.0 also added `Process::cgroup_limits()` and parent-cgroup memory walking inside `System::cgroup_limits()` — overlapping conceptually with the in-tree `cgroup` module introduced by this PR. The in-tree module is kept because it also covers CPU quota (sysinfo's helpers are memory-only), the cgroup v1 co-mount, and bind-mount path translation in a single implementation under our control. Future consolidation onto the upstream APIs is possible but explicitly out of scope here. ### `chore(repo): bump mise rust to 1.95.0 to match rust-toolchain.toml` `rust-toolchain.toml` was bumped to `1.95.0` in #35665 to unblock the sysinfo upgrade, but `mise.toml` was left at `1.90.0`. CI installs Rust via mise (which exports `RUSTUP_TOOLCHAIN`, overriding `rust-toolchain.toml`), so CI continued to run on `1.90.0` and failed the sysinfo 0.39 MSRV check until this commit. The two files now agree.
1 parent 8403bca commit 6c9d5e0

8 files changed

Lines changed: 1334 additions & 78 deletions

File tree

Cargo.lock

Lines changed: 48 additions & 73 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

mise.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ bun = "1.3"
33
java = "24"
44
node = "{{ env['NODE_VERSION'] | default(value='24.11.0') }}"
55
maven = "3.9.11"
6-
rust = "1.90.0"
6+
rust = "1.95.0"
77
vale = "3.13.1"
88

99
# Ensure that the packageManager field gets used to resolve the correct version of pnpm

packages/nx/Cargo.toml

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ swc_common = "0.31.16"
5353
swc_ecma_parser = { version = "0.137.1", features = ["typescript"] }
5454
swc_ecma_visit = "0.93.0"
5555
swc_ecma_ast = "0.107.0"
56-
sysinfo = "0.37.2"
56+
sysinfo = "0.39.1"
5757
rand = "0.9.0"
5858
tar = "0.4.44"
5959
terminal-colorsaurus = "0.4.0"
@@ -75,7 +75,17 @@ static_assertions = "1.1"
7575
wrap-ansi = "0.1"
7676

7777
[target.'cfg(windows)'.dependencies]
78-
winapi = { version = "0.3", features = ["fileapi", "psapi"] }
78+
winapi = { version = "0.3", features = [
79+
"fileapi",
80+
"psapi",
81+
# Used by the metrics collector to detect Job Object CPU rate, affinity,
82+
# and memory limits — Windows analog to the Linux cgroup detection.
83+
"jobapi",
84+
"jobapi2",
85+
"processthreadsapi",
86+
"winbase",
87+
"winnt",
88+
] }
7989

8090
[target.'cfg(windows)'.build-dependencies]
8191
winres = "0.1"
@@ -92,6 +102,11 @@ uuid = "1"
92102
mio = "1.0"
93103
nix = { version = "0.30.0", features = ["fs", "process", "signal"] }
94104

105+
# Used by the metrics collector to read sched_getaffinity directly, bypassing
106+
# std::thread::available_parallelism's floor-rounded cgroup quota handling.
107+
[target.'cfg(target_os = "linux")'.dependencies]
108+
libc = "0.2"
109+
95110
[target.'cfg(not(target_arch = "wasm32"))'.dependencies]
96111
arboard = { version = "3.4.1", features = ["wayland-data-control"] }
97112
crossterm = { version = "0.29.0", features = ["event-stream", "use-dev-tty"] }

0 commit comments

Comments
 (0)