Skip to content

Discuss: per-agent cgroup v2 sub-scope isolation to prevent one agent starving siblings #192

@SevenX77

Description

@SevenX77

Context

When CCB is launched inside a systemd service with TasksMax=N (e.g. a CI runner, a sandbox wrapper, or our own orchestrator tool that spins up per-task scopes), all provider agents share that single cgroup's budget. One heavy agent — for example a codex doing pytest test/ with many workers — can exhaust the shared TasksMax and starve siblings. We've seen codex panic with WouldBlock: Resource temporarily unavailable in exactly this scenario.

This is a budgeting/fairness issue inside the keeper's cgroup. It's not addressable by tweaking tmux or provider CLIs individually; it needs per-agent cgroup scoping.

Proposal

Per-agent cgroup v2 sub-directories under the CCB keeper's cgroup. Each provider agent's tmux pane process is migrated into agent-<name>/ with its own pids.max and memory.max.

  • Feature-flag gated: CCB_PER_AGENT_SUBCGROUP=1 (default off, no behavior change for existing users)
  • Graceful degradation: if delegation is unavailable (cgroup v1 host, missing controllers, no write permission), migration is a no-op with a WARNING log
  • Wiring point: a single call site in lib/cli/services/runtime_launch_runtime/tmux_runtime.py::launch_tmux_runtime right after launch_pane returns; uses tmux display-message '#{pane_pid}' to locate the pane shell PID and writes it to cgroup.procs of the sub-cgroup
  • New module lib/provider_core/subcgroup.py (~150 lines), fully unit-tested (40 tests)

Full design doc: RFC (the referenced RFC is in sevenx's personal notes; the fork commit itself links to it).

Reference implementation on our fork branch: commit d633ddf on SevenX77/personal.

Prerequisites

For the feature to have an effect, the keeper must be started with cgroup v2 delegation, e.g.:

systemd-run --user -p Delegate=pids memory cpu -p TasksMax=<N> --unit ... -- ccb ...

The companion tool we use to wrap CCB into a sibling scope (claude-ccb-orchestrator) already passes this.

Questions

  1. Would you be open to a PR for this? (fork-first route OK; we're happy to keep it in our fork if upstream scope is narrower)
  2. If yes, any preferred location for the helper module? lib/provider_core/ felt natural since it's provider-agnostic infrastructure.
  3. Default-on in the future, or permanently behind a flag?

Related

Sibling change already under discussion: #191 (discussion: .ccb/ccb.config.ccb.example)

Open PRs from same author: #185 (merged), #186 (merged), #188, #189, #190.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions