@gberche-orange @rkoster
Summary
PR #599 (merged Jun 5 2026, fixing #585) introduced two regressions that prevent the BOSH agent from bootstrapping on warden/bosh-lite stemcells (confirmed broken in stemcell 1.1234).
Regression 1 — broken awk pattern (1e4a11482)
The fix changed the cgroup2 mount detection from:
cgroup_mount="$(awk '$3 == "cgroup2" { print $2 }' /proc/self/mounts)"
to:
cgroup_mount="$(awk '$1 == "cgroup2" && $3 == "cgroup2" { print $2 }' /proc/self/mounts)"
The intent was to avoid matching extra cgroup2 mounts added by Cilium workloads (#585). However, this breaks warden/Docker containers where the device field ($1) for the cgroup2 mount is "cgroup", not "cgroup2":
cgroup /sys/fs/cgroup cgroup2 rw,... ← $1 == "cgroup", never "cgroup2"
Result: cgroup_mount is always empty → nb_matching_cgroup_mounts=0 → permit_monit_access returns 1 → both the agent and monit runit run scripts exit immediately (they source this helper with set -e) → BOSH agent never starts on any warden VM.
The correct fix for #585 is to use exit after the first match, which selects the canonical system cgroup2 mount regardless of device name, and naturally ignores any extra Cilium mounts (which appear later in /proc/self/mounts):
cgroup_mount="$(awk '$3 == "cgroup2" { print $2; exit }' /proc/self/mounts)"
Regression 2 — stale function name in restrict-monit-api-access (50cc4d1fc)
Commit 50cc4d1fc renamed monit_using_unified_cgroup_v2 → system_using_unified_cgroup_v2 in monit-access-helper.sh but did not update restrict-monit-api-access, which still calls the old name:
# restrict-monit-api-access (unchanged by PR #599)
source /var/vcap/bosh/etc/monit-access-helper.sh
if monit_using_unified_cgroup_v2; then ← function no longer exists
Since monit_using_unified_cgroup_v2 is undefined, the if silently fails and the script falls through to the cgroup v1 path, which installs an iptables DROP rule based on net_cls.classid:
iptables -t mangle -I POSTROUTING -d 127.0.0.1 -p tcp --dport 2822 -m cgroup ! --cgroup 2958295041 -j DROP
On systems running a pure cgroup v2 hierarchy (all modern kernels), net_cls is unavailable, every process has classid 0, and 0 != 2958295041 is always true — so all connections to monit's port 2822 are permanently dropped, including those from the BOSH agent itself.
Fix: update restrict-monit-api-access to call system_using_unified_cgroup_v2.
Impact
Both regressions are independently fatal on warden stemcells. The chain of failures observed in stemcell 1.1234 running in a Docker/bosh-lite environment:
- Container boots,
delay_monit_start removes /etc/service/monit symlink
- BOSH agent runit script sources
monit-access-helper.sh and calls permit_monit_access → fails (Regression 1) → agent never starts
- If Regression 1 is fixed: monit starts, but
restrict-monit-api-access installs the wrong DROP rule → all port 2822 connections blocked (Regression 2)
- BOSH director API (port 25555) never comes up → all BOSH operations time out
This was root-caused and verified by manually patching the three issues in a running stemcell 1.1234 container, after which the BOSH agent bootstrapped successfully.
References
@gberche-orange @rkoster
Summary
PR #599 (merged Jun 5 2026, fixing #585) introduced two regressions that prevent the BOSH agent from bootstrapping on warden/bosh-lite stemcells (confirmed broken in stemcell 1.1234).
Regression 1 — broken awk pattern (
1e4a11482)The fix changed the cgroup2 mount detection from:
cgroup_mount="$(awk '$3 == "cgroup2" { print $2 }' /proc/self/mounts)"to:
cgroup_mount="$(awk '$1 == "cgroup2" && $3 == "cgroup2" { print $2 }' /proc/self/mounts)"The intent was to avoid matching extra cgroup2 mounts added by Cilium workloads (#585). However, this breaks warden/Docker containers where the device field (
$1) for the cgroup2 mount is"cgroup", not"cgroup2":Result:
cgroup_mountis always empty →nb_matching_cgroup_mounts=0→permit_monit_accessreturns 1 → both the agent and monit runit run scripts exit immediately (they source this helper withset -e) → BOSH agent never starts on any warden VM.The correct fix for #585 is to use
exitafter the first match, which selects the canonical system cgroup2 mount regardless of device name, and naturally ignores any extra Cilium mounts (which appear later in/proc/self/mounts):cgroup_mount="$(awk '$3 == "cgroup2" { print $2; exit }' /proc/self/mounts)"Regression 2 — stale function name in
restrict-monit-api-access(50cc4d1fc)Commit
50cc4d1fcrenamedmonit_using_unified_cgroup_v2→system_using_unified_cgroup_v2inmonit-access-helper.shbut did not updaterestrict-monit-api-access, which still calls the old name:Since
monit_using_unified_cgroup_v2is undefined, theifsilently fails and the script falls through to the cgroup v1 path, which installs an iptables DROP rule based onnet_cls.classid:On systems running a pure cgroup v2 hierarchy (all modern kernels),
net_clsis unavailable, every process has classid 0, and0 != 2958295041is always true — so all connections to monit's port 2822 are permanently dropped, including those from the BOSH agent itself.Fix: update
restrict-monit-api-accessto callsystem_using_unified_cgroup_v2.Impact
Both regressions are independently fatal on warden stemcells. The chain of failures observed in stemcell 1.1234 running in a Docker/bosh-lite environment:
delay_monit_startremoves/etc/service/monitsymlinkmonit-access-helper.shand callspermit_monit_access→ fails (Regression 1) → agent never startsrestrict-monit-api-accessinstalls the wrong DROP rule → all port 2822 connections blocked (Regression 2)This was root-caused and verified by manually patching the three issues in a running stemcell 1.1234 container, after which the BOSH agent bootstrapped successfully.
References
1e4a11482,50cc4d1fc)