Skip to content

PR #599 regresses warden/Docker stemcells: broken awk pattern and renamed function breaks restrict-monit-api-access #637

Description

@colins

@gberche-orange @rkoster

Summary

PR #599 (merged Jun 5 2026, fixing #585) introduced two regressions that prevent the BOSH agent from bootstrapping on warden/bosh-lite stemcells (confirmed broken in stemcell 1.1234).


Regression 1 — broken awk pattern (1e4a11482)

The fix changed the cgroup2 mount detection from:

cgroup_mount="$(awk '$3 == "cgroup2" { print $2 }' /proc/self/mounts)"

to:

cgroup_mount="$(awk '$1 == "cgroup2" && $3 == "cgroup2" { print $2 }' /proc/self/mounts)"

The intent was to avoid matching extra cgroup2 mounts added by Cilium workloads (#585). However, this breaks warden/Docker containers where the device field ($1) for the cgroup2 mount is "cgroup", not "cgroup2":

cgroup  /sys/fs/cgroup  cgroup2  rw,...   ← $1 == "cgroup", never "cgroup2"

Result: cgroup_mount is always empty → nb_matching_cgroup_mounts=0permit_monit_access returns 1 → both the agent and monit runit run scripts exit immediately (they source this helper with set -e) → BOSH agent never starts on any warden VM.

The correct fix for #585 is to use exit after the first match, which selects the canonical system cgroup2 mount regardless of device name, and naturally ignores any extra Cilium mounts (which appear later in /proc/self/mounts):

cgroup_mount="$(awk '$3 == "cgroup2" { print $2; exit }' /proc/self/mounts)"

Regression 2 — stale function name in restrict-monit-api-access (50cc4d1fc)

Commit 50cc4d1fc renamed monit_using_unified_cgroup_v2system_using_unified_cgroup_v2 in monit-access-helper.sh but did not update restrict-monit-api-access, which still calls the old name:

# restrict-monit-api-access (unchanged by PR #599)
source /var/vcap/bosh/etc/monit-access-helper.sh
if monit_using_unified_cgroup_v2; thenfunction no longer exists

Since monit_using_unified_cgroup_v2 is undefined, the if silently fails and the script falls through to the cgroup v1 path, which installs an iptables DROP rule based on net_cls.classid:

iptables -t mangle -I POSTROUTING -d 127.0.0.1 -p tcp --dport 2822 -m cgroup ! --cgroup 2958295041 -j DROP

On systems running a pure cgroup v2 hierarchy (all modern kernels), net_cls is unavailable, every process has classid 0, and 0 != 2958295041 is always true — so all connections to monit's port 2822 are permanently dropped, including those from the BOSH agent itself.

Fix: update restrict-monit-api-access to call system_using_unified_cgroup_v2.


Impact

Both regressions are independently fatal on warden stemcells. The chain of failures observed in stemcell 1.1234 running in a Docker/bosh-lite environment:

  1. Container boots, delay_monit_start removes /etc/service/monit symlink
  2. BOSH agent runit script sources monit-access-helper.sh and calls permit_monit_accessfails (Regression 1) → agent never starts
  3. If Regression 1 is fixed: monit starts, but restrict-monit-api-access installs the wrong DROP rule → all port 2822 connections blocked (Regression 2)
  4. BOSH director API (port 25555) never comes up → all BOSH operations time out

This was root-caused and verified by manually patching the three issues in a running stemcell 1.1234 container, after which the BOSH agent bootstrapped successfully.


References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions