docs(acfs): document swarm resource isolation

Dicklesworthstone · Dicklesworthstone · commit cdf5fdc419c9 · 2026-05-08T05:20:16.000-04:00
diff --git a/.beads/issues.jsonl b/.beads/issues.jsonl
@@ -488,7 +488,7 @@
 {"id":"bd-8kk","title":"Replace default Next.js landing + metadata with ACFS wizard entry","description":"apps/web still has the default create-next-app landing page and metadata. Replace with an ACFS landing that links to the wizard (start at /wizard/os-selection) and update app metadata (title/description) to match ACFS.","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-20T20:22:03.455898Z","updated_at":"2025-12-20T20:24:29.617100Z","closed_at":"2025-12-20T20:24:29.617100Z","close_reason":"Replaced create-next-app landing + metadata with ACFS landing that links to /wizard/os-selection; bun lint + type-check clean.","source_repo":".","compaction_level":0,"original_size":0,"labels":["ux","web"]}
 {"id":"bd-8mv","title":"Add --strict flag for legacy checksum behavior","description":"# Task: Add --strict flag for legacy checksum behavior\n\n## Context\nPart of EPIC: Checkpoint-Based Checksum Recovery (agentic_coding_flywheel_setup-tx7)\n\n## What to Do\nAdd --strict flag that restores the original behavior where ANY checksum mismatch aborts installation.\n\n### Behavior\n- Without --strict (default): Use new recovery flow\n- With --strict: All tools treated as critical, any mismatch aborts\n\n### Use Cases\n- Security-conscious users who want no exceptions\n- CI/CD environments where reproducibility matters\n- Auditing/compliance scenarios\n\n## Acceptance Criteria\n- --strict flag parsed in argument handling\n- When set, all tools treated as critical\n- Help text documents the flag\n- Default behavior is recovery flow\n\n## Files to Modify\n- install.sh: Argument parsing, pass flag to security functions","status":"closed","priority":2,"issue_type":"task","created_at":"2025-12-21T17:44:39.830135Z","updated_at":"2025-12-21T20:04:39.956587Z","closed_at":"2025-12-21T20:04:39.956587Z","close_reason":"Added --strict flag to install.sh that sets ACFS_STRICT_MODE=true. When enabled, all tools are treated as CRITICAL and any checksum mismatch aborts installation. Updated header documentation.","source_repo":".","compaction_level":0,"original_size":0,"dependencies":[{"issue_id":"bd-8mv","depends_on_id":"bd-v8a","type":"blocks","created_at":"2025-12-21T17:47:33.579339Z","created_by":"daemon","metadata":"{}","thread_id":""}]}
 {"id":"bd-8pkn","title":"Fix contract tests: stub _acfs_is_interactive","status":"closed","priority":2,"issue_type":"bug","created_at":"2025-12-22T22:03:32.947459Z","updated_at":"2025-12-22T22:04:25.346479Z","closed_at":"2025-12-22T22:04:25.346479Z","close_reason":"Completed: test stubs now include _acfs_is_interactive required by contract","source_repo":".","compaction_level":0,"original_size":0}
-{"id":"bd-8qd3d","title":"Research optional systemd resource isolation profiles for agent CLIs","description":"## Parent program\n`bd-nlb8w`; next-best idea from the idea-wizard expansion; depends on capacity model `bd-e63fl`.\n\n## What\nResearch and design optional systemd/user-slice or shell-wrapper resource isolation profiles for agent CLIs on large ACFS hosts. The output should say whether ACFS should expose recommended CPU/memory weights or limits for Claude, Codex, Gemini, RCH, and support daemons.\n\n## Why\nOn 64+ core hosts, resource contention can still make the system feel broken if one class of process consumes all CPU, RAM, or I/O. Optional isolation can improve responsiveness without reducing user control.\n\n## How\nStudy current ACFS services/user service setup, systemd availability on target Ubuntu, and agent launch aliases. Prefer opt-in profiles and documented recommendations before enforcement.\n\n## Risks\nHard limits can kill expensive agent work or surprise users. This bead is research/design first; implementation should follow only if the design is clearly safe and reversible.\n\n## Success criteria\n- Documents whether systemd slices are appropriate for ACFS agent workflows.\n- Provides proposed resource classes and defaults if useful.\n- Defines tests or manual verification needed before any implementation.","status":"open","priority":2,"issue_type":"task","created_at":"2026-05-08T04:37:41.210895085Z","created_by":"ubuntu","updated_at":"2026-05-08T04:41:20.781111514Z","source_repo":".","compaction_level":0,"original_size":0,"labels":["idea-wizard","performance","safety","swarm"],"dependencies":[{"issue_id":"bd-8qd3d","depends_on_id":"bd-e63fl","type":"blocks","created_at":"2026-05-08T04:41:20.780497665Z","created_by":"ubuntu","metadata":"{}","thread_id":""}]}
+{"id":"bd-8qd3d","title":"Research optional systemd resource isolation profiles for agent CLIs","description":"## Parent program\n`bd-nlb8w`; next-best idea from the idea-wizard expansion; depends on capacity model `bd-e63fl`.\n\n## What\nResearch and design optional systemd/user-slice or shell-wrapper resource isolation profiles for agent CLIs on large ACFS hosts. The output should say whether ACFS should expose recommended CPU/memory weights or limits for Claude, Codex, Gemini, RCH, and support daemons.\n\n## Why\nOn 64+ core hosts, resource contention can still make the system feel broken if one class of process consumes all CPU, RAM, or I/O. Optional isolation can improve responsiveness without reducing user control.\n\n## How\nStudy current ACFS services/user service setup, systemd availability on target Ubuntu, and agent launch aliases. Prefer opt-in profiles and documented recommendations before enforcement.\n\n## Risks\nHard limits can kill expensive agent work or surprise users. This bead is research/design first; implementation should follow only if the design is clearly safe and reversible.\n\n## Success criteria\n- Documents whether systemd slices are appropriate for ACFS agent workflows.\n- Provides proposed resource classes and defaults if useful.\n- Defines tests or manual verification needed before any implementation.","status":"closed","priority":2,"issue_type":"task","created_at":"2026-05-08T04:37:41.210895085Z","created_by":"ubuntu","updated_at":"2026-05-08T09:19:07.265929220Z","closed_at":"2026-05-08T09:19:07.265593692Z","close_reason":"Documented opt-in systemd resource isolation design and verification plan","source_repo":".","compaction_level":0,"original_size":0,"labels":["idea-wizard","performance","safety","swarm"],"dependencies":[{"issue_id":"bd-8qd3d","depends_on_id":"bd-e63fl","type":"blocks","created_at":"2026-05-08T04:41:20.780497665Z","created_by":"ubuntu","metadata":"{}","thread_id":""}]}
 {"id":"bd-8wpc","title":"Add 'Learning Hub' to footer links","description":"Add Learning Hub link to the footer section alongside GitHub, NTM, Agent Mail links.","status":"closed","priority":1,"issue_type":"task","created_at":"2025-12-25T05:16:33.679471Z","updated_at":"2025-12-25T05:24:53.341681Z","closed_at":"2025-12-25T05:24:53.341681Z","close_reason":"Completed as part of parent task agentic_coding_flywheel_setup-umil","source_repo":".","compaction_level":0,"original_size":0,"dependencies":[{"issue_id":"bd-8wpc","depends_on_id":"bd-umil","type":"blocks","created_at":"2025-12-25T05:17:34.226998Z","created_by":"jemanuel","metadata":"{}","thread_id":""}]}
 {"id":"bd-8xx","title":"End-to-end test: acfs-update","description":"## What\nComprehensive testing of acfs-update:\n1. Test on fresh ACFS install (all tools present)\n2. Test on partial install (some tools missing)\n3. Test each category individually\n4. Test --dry-run mode\n5. Test --yes mode (non-interactive)\n6. Test failure handling (simulate failures)\n7. Test logging output\n\n## Test Scenarios\n- Fresh VPS with full ACFS install\n- Existing VPS with previous ACFS version\n- Missing tools (should skip gracefully)\n- Network failures (should report and continue)\n- Permission issues (should report and continue)\n\n## Considerations\n- Need a test VPS or Docker container\n- Some tests may need mocking (network failures)\n- Should test as both root and non-root user\n\n## Success Criteria\n- [ ] All categories update correctly\n- [ ] Dry-run shows accurate preview\n- [ ] Failures handled gracefully\n- [ ] Logs are comprehensive and useful\n- [ ] No regressions in existing functionality","status":"closed","priority":3,"issue_type":"task","created_at":"2025-12-21T18:27:10.046030Z","updated_at":"2025-12-21T20:38:12.190216Z","closed_at":"2025-12-21T20:38:12.190216Z","close_reason":"Created comprehensive Docker-based E2E test (tests/vm/test_acfs_update.sh) covering: --help, --dry-run, --quiet, --yes modes; category filters (--agents-only, --shell-only, --no-apt); log file creation; exit codes; missing tool handling; version display. Test follows same pattern as test_install_ubuntu.sh. Shellcheck clean.","source_repo":".","compaction_level":0,"original_size":0,"dependencies":[{"issue_id":"bd-8xx","depends_on_id":"bd-csv","type":"blocks","created_at":"2025-12-21T18:27:35.230950Z","created_by":"daemon","metadata":"{}","thread_id":""}]}
 {"id":"bd-8z5","title":"Create report_skipped_tools() post-install summary","description":"# Task: Create report_skipped_tools() post-install summary\n\n## Context\nPart of EPIC: Checkpoint-Based Checksum Recovery (agentic_coding_flywheel_setup-tx7)\n\n## What to Do\nAfter installation completes, report any tools that were skipped:\n\n### Output\n```\n⚠ The following tools were skipped due to checksum mismatches:\n    → ntm: https://raw.githubusercontent.com/.../install.sh\n    → bv: https://raw.githubusercontent.com/.../install.sh\n\nYou can install these manually after verifying they are safe.\nOr wait for ACFS to update checksums and run: acfs update --stack\n```\n\n### Storage\n- SKIPPED_TOOLS array populated during install\n- Also saved to state.json for persistence\n\n## Acceptance Criteria\n- Only shown if SKIPPED_TOOLS is non-empty\n- Shows tool name and installer URL\n- Provides manual install command\n- Mentions acfs update as alternative\n\n## Files to Modify\n- install.sh: Add report at end of successful install","status":"closed","priority":1,"issue_type":"task","created_at":"2025-12-21T17:44:38.550459Z","updated_at":"2025-12-21T19:48:42.459331Z","closed_at":"2025-12-21T19:48:42.459331Z","close_reason":"Added comprehensive skipped tools reporting: record_skipped_tool() with reason and URL, report_skipped_tools() showing detailed summary with manual install commands and acfs update alternative, get_skipped_tools_json() for state persistence.","source_repo":".","compaction_level":0,"original_size":0,"dependencies":[{"issue_id":"bd-8z5","depends_on_id":"bd-4jr","type":"blocks","created_at":"2025-12-21T17:47:33.367598Z","created_by":"daemon","metadata":"{}","thread_id":""}]}
diff --git a/docs/operations/swarm-resource-isolation.md b/docs/operations/swarm-resource-isolation.md
@@ -0,0 +1,103 @@
+# Swarm Resource Isolation Research
+
+This note records the `bd-8qd3d` design decision for optional resource isolation on large ACFS hosts.
+
+## Decision
+
+ACFS should not enforce CPU, memory, or I/O limits for agent CLIs by default.
+
+ACFS should expose an opt-in "balanced" resource profile later, implemented with `systemd-run --user --scope` wrappers for interactive agent commands and optional user-service drop-ins for support daemons. The first implementation should use scheduling weights and accounting before hard limits:
+
+- Prefer `CPUWeight=` and `IOWeight=` for agent/build/background classes.
+- Prefer `MemoryHigh=` only after a local capacity check can size it conservatively.
+- Avoid `MemoryMax=` for Claude, Codex, Gemini, and build commands by default.
+- Keep direct `cc`, `cod`, and `gmi` behavior unchanged unless the user opts in.
+
+This matches the current ACFS posture: the capacity model recommends agent counts and RCH offload, while the shell aliases in `acfs/zsh/acfs.zshrc` launch the model CLIs directly and the Agent Mail service is the only always-on user service that ACFS currently owns.
+
+## Why Systemd Is Appropriate
+
+Systemd slices are designed to group services/scopes under a cgroup resource-control tree, and resource settings on a slice apply to the units inside that slice. See the upstream `systemd.slice` manual and `systemd.resource-control` manual:
+
+- https://www.freedesktop.org/software/systemd/man/devel/systemd.slice.html
+- https://www.freedesktop.org/software/systemd/man/devel/systemd.resource-control.html
+
+For user sessions, systemd already defines user-side `app.slice`, `session.slice`, and `background.slice`. The upstream guidance describes `app.slice` for user applications, `session.slice` for latency-sensitive session services, and `background.slice` for non-interactive work:
+
+- https://www.freedesktop.org/software/systemd/man/256/systemd.special.html
+
+For shell-launched commands, `systemd-run --user --scope` is the least invasive shape because it creates a transient scope around a command instead of requiring a persistent service file. It also supports assigning a slice and properties such as resource controls:
+
+- https://www.freedesktop.org/software/systemd/man/devel/systemd-run.html
+
+## Proposed Resource Classes
+
+These are recommendations for a future opt-in profile, not current installer behavior.
+
+| Class | Commands | Proposed controls | Rationale |
+| --- | --- | --- | --- |
+| `acfs-agent.slice` | `claude`, `codex`, `gemini`, `ntm`-spawned interactive agents | `CPUWeight=100`, `IOWeight=100`, `TasksMax=512`, no default `MemoryMax` | Keep agent sessions first-class and avoid killing expensive context-heavy work. |
+| `acfs-background.slice` | CASS indexing, maintenance sweeps, update jobs, local analysis that is not user-interactive | `CPUWeight=40`, `IOWeight=50`, optional `MemoryHigh=` from capacity model | Let interactive shells and support services stay responsive under load. |
+| `acfs-local-build.slice` | Local fallback build/test commands when RCH is unavailable | `CPUWeight=60`, `IOWeight=50`, no default `MemoryMax` | Local builds are expensive but should not freeze the host; RCH remains the preferred path. |
+| `acfs-support.slice` | Agent Mail, local dashboards, support-bundle helpers, lightweight telemetry collectors | `CPUWeight=80`, `IOWeight=100`, `TasksMax=256`, optional `MemoryHigh=1G` | Coordination daemons should remain responsive but are not expected to consume large CPU. |
+| `acfs-rch.slice` | RCH CLI/daemon-side local processes | `CPUWeight=100`, `IOWeight=100` | RCH is already a pressure-relief layer; do not penalize the offload path. |
+
+The first profile should expose these values as documented recommendations and generated examples. It should not automatically rewrite user aliases or service units.
+
+## Example Future Wrapper
+
+This is the shape to test before implementation:
+
+```bash
+systemd-run --user --scope --same-dir --collect \
+  --slice=acfs-agent.slice \
+  --property=CPUAccounting=yes \
+  --property=MemoryAccounting=yes \
+  --property=IOAccounting=yes \
+  --property=TasksAccounting=yes \
+  --property=CPUWeight=100 \
+  --property=IOWeight=100 \
+  --property=TasksMax=512 \
+  claude --dangerously-skip-permissions
+```
+
+For Codex and Gemini, substitute the command after the properties. A wrapper must preserve the current working directory, environment variables, terminal behavior, exit status, and auth/config lookup paths.
+
+## What To Avoid
+
+- Do not put hard `MemoryMax=` on model CLIs until real high-context sessions are tested.
+- Do not place all user processes under one capped `user-$UID.slice` profile; that risks surprising unrelated shells and editors.
+- Do not make `systemd-run` mandatory. Fresh VPS and container-like environments can have a missing or degraded user manager.
+- Do not wrap RCH-heavy commands in a way that hides the `rch exec --` policy or makes local fallback look like the preferred path.
+
+## Implementation Path
+
+1. Add an `acfs resource-profile` or `acfs capacity --resource-profile` report that prints the proposed classes for the current host.
+2. Add a shell helper such as `acfs_scope <class> -- <command...>` that falls back to direct execution when `systemd-run --user` is unavailable.
+3. Add opt-in aliases, for example `ccs`, `cods`, and `gmis`, instead of changing `cc`, `cod`, or `gmi`.
+4. Add optional drop-ins for ACFS-owned user services only, starting with `agent-mail.service.d/resource-profile.conf`.
+5. Add NTM integration only after direct shell-wrapper tests pass; NTM should receive explicit class hints rather than infer from command strings.
+
+## Verification Required Before Implementation
+
+Manual checks:
+
+```bash
+systemctl --user show-environment
+systemd-run --user --scope --same-dir --collect --property=CPUAccounting=yes true
+systemd-run --user --scope --same-dir --collect --slice=acfs-agent.slice --property=CPUWeight=100 claude --help
+systemd-cgls --user
+systemctl --user show acfs-agent.slice -p CPUAccounting -p CPUWeight -p TasksAccounting
+```
+
+Regression coverage:
+
+- Wrapper preserves exit code for success and failure commands.
+- Wrapper preserves `PWD` and auth/config environment for Claude, Codex, and Gemini.
+- Wrapper falls back to direct execution when `systemd-run --user` or the user bus is unavailable.
+- Agent Mail drop-in can be enabled, disabled, and reverted without breaking health checks.
+- Capacity output explains that weights are relative and hard memory limits are opt-in.
+
+## Final Recommendation
+
+Systemd slices are appropriate for ACFS, but only as an opt-in profile. The safe first step is a report plus wrappers that apply CPU/I/O weights and accounting. Hard memory limits should wait for measured high-context agent sessions and explicit user consent.