Commit 6c9d5e0
authored
fix(core): read pod cgroup limits instead of node limits in resource metrics (#35622)
## Current Behavior
When running inside a Linux container or Kubernetes pod, resource
metrics report the host node's CPU and memory totals instead of the
pod's limits. A pod with constrained CPU/memory shows the underlying
node's resources.
The same gap exists for process-isolated Windows containers running
inside a Windows Job Object (e.g. Docker `--cpus` / `--memory`): metrics
report host values rather than the Job's limits.
## Expected Behavior
Resource metrics report the effective CPU and memory limits enforced by
the kernel for the calling process — derived from the cgroup it belongs
to on Linux, or the Job Object on Windows. macOS native processes
continue to report host values (no equivalent enforcement primitive
exists).
## Implementation Details
### Linux (`cgroup` module)
Resolves the calling process's actual cgroup directory by parsing
`/proc/self/cgroup` + `/proc/self/mountinfo`. Reads `cpu.max` /
`memory.max` (cgroup v2) or `cpu.cfs_{quota,period}_us` /
`memory.limit_in_bytes` (cgroup v1, including the systemd `cpu,cpuacct`
co-mount).
Walks from the leaf up to the mount point and takes the minimum finite
limit found at any level — the kernel enforces the tightest ancestor's
limit (hierarchical enforcement), so leaf-only reads can overreport when
a pod-level cgroup is tighter than the container's (common in K8s VPA
in-place resize and Burstable QoS).
Composition with `sched_getaffinity` covers cpuset / taskset
restrictions. We deliberately bypass
`std::thread::available_parallelism()` because rust-lang/rust's
implementation applies the cgroup quota with floor division internally,
which would silently underreport fractional quotas (e.g. 1.5 cores → 1).
Instead we ceil quota / period, matching HotSpot JVM, Go 1.25, .NET, and
the `num_cpus` crate.
mountinfo path fields containing spaces / tabs / newlines / backslashes
are kernel-encoded as `\040` / `\011` / `\012` / `\134` per `man 5
proc_pid_mountinfo`; we unescape before joining with the cgroup path.
`/proc/self/cgroup` itself emits paths raw (verified across kernels
v4.18 → v6.13), so no unescape is needed there.
Replaces the prior leaf-only path lookups (which broke on cgroup v1
co-mount and any non-namespaced container setup). The in-tree module is
preferred over `sysinfo`'s `cgroup_limits()` (now parent-aware in 0.39 —
see *Additional Changes*) because it also covers CPU quota and the v1
co-mount + bind-mount edge cases in a single place we control. 30 unit
tests cover cgroup discovery, parsing, ancestor walk, and the v1 / v2 /
co-mount / bind-mount / cgroupns=host cases.
### Windows (`job_object` module)
Detects Job Object resource limits via the Win32 API:
- **CPU**: `ceil(host_cpu_count × CpuRate / 10000)` from
`JobObjectCpuRateControlInformation` when `HARD_CAP` is set (Docker
`--cpus` translates to HARD_CAP). Plus the popcount of the Job's
affinity mask (when `LIMIT_AFFINITY` is set), and
`GetProcessAffinityMask` (covers Job + manual + system intersections).
Takes the minimum.
- **Memory**: minimum of `ProcessMemoryLimit` / `JobMemoryLimit` /
`MaximumWorkingSetSize` from `JobObjectExtendedLimitInformation`, gated
on the corresponding `LIMIT_*` flags. Mirrors HotSpot and .NET.
- **Skipped**: `WEIGHT_BASED` rate control (relative priority, not a
hard limit) and soft-cap rate control (kernel allows transient bursts) —
neither maps to a defensible "available cores" number.
Any Win32 failure is treated as "no information"; the caller falls back
to host values. No new crate dependency — the existing `winapi` dep is
extended with `jobapi`, `jobapi2`, `processthreadsapi`, `winbase`, and
`winnt` features.
#### Known limitation: nested Job hierarchies
[`QueryInformationJobObject(NULL,
...)`](https://learn.microsoft.com/en-us/windows/win32/api/jobapi2/nf-jobapi2-queryinformationjobobject)
returns the **immediate** Job's settings (per MSDN: *"If the job is
nested, the immediate job of the calling process is used."*) and Win32
exposes no documented API to enumerate parent Jobs.
Per
[`JOBOBJECT_CPU_RATE_CONTROL_INFORMATION`](https://learn.microsoft.com/en-us/windows/win32/api/winnt/ns-winnt-jobobject_cpu_rate_control_information)
Remarks, *"the rates set for the job represent its portion of the CPU
rate that is allocated to its parent job"* — nested rates compose
multiplicatively. So with parent HARD_CAP 50% × child HARD_CAP 50%, the
effective rate is 25% of host but we read the child's 50% and report
`ceil(host × 0.5)`. Per-process / per-job memory has the same shape
(kernel enforces min across the chain; we see only the immediate Job).
Affinity is unaffected — `GetProcessAffinityMask` returns the
kernel-effective mask. HotSpot and .NET CoreCLR exhibit the same memory
limitation; HARD_CAP CPU-rate detection goes beyond them for Docker
`--cpus` parity in the common single-silo case, accepting the nested-Job
overreport as the documented cost.
### Cross-platform
`SystemInfo { cpuCores, totalMemory }` shape is unchanged — consumers do
not need to update. macOS native processes continue to report host
values (no container-style enforcement primitive exists; container
runtimes on macOS run Linux VMs where the Linux path applies).
The module doc-comments cross-reference Go 1.25
`internal/runtime/cgroup`, libuv `src/unix/linux.c`, OpenJDK
`cgroupSubsystem_linux.cpp` + `os_windows.cpp`, .NET CoreCLR
`gc/unix/cgroup.cpp` + `gc/windows/gcenv.windows.cpp`, and Rust stdlib
`library/std/src/sys/thread/unix.rs::cgroups`.
## Additional Changes
Two infrastructure chores are bundled into this PR as separate commits:
### `chore(core): bump sysinfo to 0.39.1`
Bumps `sysinfo` from `0.37.2` → `0.39.1`. No source changes were
required — all signatures we use (`System`, `Process`, `Pid`, `Signal`,
`Disks`, the `*RefreshKind` types, `UpdateKind`,
`MINIMUM_CPU_UPDATE_INTERVAL`) are unchanged. `Cargo.lock` deltas:
- New transitive dep `objc2-open-directory` from sysinfo 0.39's
soundness fix for user retrieval on Apple targets.
- `windows` family bumped to `0.62.x` per sysinfo's new constraint,
which lets the graph converge on single versions of `windows-core`,
`windows-link`, `windows-result`, and `windows-strings` — four duplicate
`windows-*` entries are deduplicated as a result.
sysinfo 0.39.0 also added `Process::cgroup_limits()` and parent-cgroup
memory walking inside `System::cgroup_limits()` — overlapping
conceptually with the in-tree `cgroup` module introduced by this PR. The
in-tree module is kept because it also covers CPU quota (sysinfo's
helpers are memory-only), the cgroup v1 co-mount, and bind-mount path
translation in a single implementation under our control. Future
consolidation onto the upstream APIs is possible but explicitly out of
scope here.
### `chore(repo): bump mise rust to 1.95.0 to match rust-toolchain.toml`
`rust-toolchain.toml` was bumped to `1.95.0` in #35665 to unblock the
sysinfo upgrade, but `mise.toml` was left at `1.90.0`. CI installs Rust
via mise (which exports `RUSTUP_TOOLCHAIN`, overriding
`rust-toolchain.toml`), so CI continued to run on `1.90.0` and failed
the sysinfo 0.39 MSRV check until this commit. The two files now agree.1 parent 8403bca commit 6c9d5e0
8 files changed
Lines changed: 1334 additions & 78 deletions
File tree
- packages/nx
- src/native/metrics
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
6 | | - | |
| 6 | + | |
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
53 | 53 | | |
54 | 54 | | |
55 | 55 | | |
56 | | - | |
| 56 | + | |
57 | 57 | | |
58 | 58 | | |
59 | 59 | | |
| |||
75 | 75 | | |
76 | 76 | | |
77 | 77 | | |
78 | | - | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
79 | 89 | | |
80 | 90 | | |
81 | 91 | | |
| |||
92 | 102 | | |
93 | 103 | | |
94 | 104 | | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
95 | 110 | | |
96 | 111 | | |
97 | 112 | | |
| |||
0 commit comments