feat: parallel-arm orchestration helpers (#123, Phase A)

sriumcp · claude · sriumcp · commit da89dad337ff · 2026-05-24T08:47:46.000-04:00
Stacks on #133 (which stacks on #121). Phase A ships the orchestration layer that turns experiment_plan.yaml into a flat list of independent units, fans them out via an injected runner, and deterministically merges their results into a findings-shaped dict. The actual SDK subagent fan-out + worktree-isolation per unit (the issue's main thrust) is Phase B once #121 + #133 merge. Why partition first: the 5/18 mech-design-enforcement session ran 8 conditions × 3 seeds = 24 simulations sequentially in one Sonnet session. That 2.5-hour mega-session is what produced the connection drops and the race-two-executors bug. Decomposing into small independent units is the prerequisite to parallel execution; once the units exist as data, the run path can be sync (Phase A) or anyio.gather over SDK subagents (Phase B) without touching the partitioner or merge. Phase A surface: partition_plan(plan) -> list[ArmUnit] Turns experiment_plan.yaml into one ArmUnit per (arm × condition × seed). Default seed when none specified is "seed-1"; multi-seed conditions fan out. Skips arms with no command. Each unit's relative_results_dir is unique by construction (results/<arm>/<seed>) — no two units write to the same path. run_units(units, *, runner, max_parallel) -> list[ArmUnitResult] Runs each unit through the injected runner. Catches runner exceptions and converts them to failed ArmUnitResults so a single arm crashing doesn't abort the iteration. Returns results in input order so callers can pair them deterministically. merge_unit_results(results, *, plan) -> dict Deterministic merge into a findings-shaped structure: arms grouped by arm_id (sorted), arm.status="failed" when any unit failed, units within an arm sorted by (seed, condition). Byte-equal across repeated calls — that's the criterion the issue asks for. failed_units(results) -> list[ArmUnit] Helper for partial-retry: which units need re-running? default_max_parallel() -> int The min(CPU, 4) default the issue calls out. Behavioral tests (14 in tests/test_parallel_arms.py): partition_plan: - single arm/condition with default seed - multi-seed condition fans out - multiple arms × conditions: 3 units; sorted assertion - results_dir doesn't overlap across seeds - arm without command skipped run_units: - results in input order (the determinism contract for merge) - runner exception becomes failed unit, doesn't abort run - max_parallel < 1 raises ValueError merge_unit_results: - arms grouped by arm_id, sorted - arm.status="failed" when any unit failed - failed_unit_count + total_unit_count correct - byte-equal across repeated calls - units within arm sorted by (seed, condition) failed_units: - returns only failed units (the partial-retry contract) Out of scope (Phase B): - SDKDispatcher integration: a runner that actually spawns Agent(isolation="worktree") per unit - anyio.gather + semaphore for real parallelism - Wire-up into iteration.py so EXECUTE_ANALYZE picks parallel mode when max_parallel_arms > 1 - Wall-clock measurement on a multi-arm campaign (the "significantly less wall-clock" criterion) Test suite (this branch, stacked on #133): 346 + 14 new = 360 passing. Refs #120, #123. Stacked on #143 (#133) which stacks on #136 (#121). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/orchestrator/parallel_arms.py b/orchestrator/parallel_arms.py
@@ -0,0 +1,198 @@
+"""Parallel-arm execution orchestration (issue #123, Phase A).
+
+After DESIGN produces ``experiment_plan.yaml``, EXECUTE_ANALYZE today
+runs every (arm × seed × condition) tuple sequentially in one Sonnet
+session. That mega-session is what produced the 5/18 connection-drop
+incidents and is the proximate cause of the "race two executors" bug
+that #71/#111 partly fixed at the symptom level.
+
+The fix: partition the plan into independent units, fan them out to
+per-unit subagents (each in its own worktree via #133), wait for all,
+and run the existing deterministic merge into findings.json +
+principle_updates.json.
+
+Phase A scope:
+
+  * partition_plan(plan) — turn experiment_plan.yaml into a flat list
+    of ArmUnit descriptors.
+  * run_units(units, *, runner, max_parallel) — fan out via an injected
+    runner callable, collect ArmUnitResult records (one per unit).
+  * merge_unit_results(results, plan) — deterministic merge into a
+    findings-shaped dict (the schema validation step is reused from
+    the existing executor pipeline).
+
+Phase B (lands when #121 + #133 merge):
+
+  * SDKDispatcher integration: the runner spawns
+    ``Agent(isolation="worktree", subagent_type="claude")`` per unit.
+  * Real ``anyio.gather`` for actual parallelism with a CPU-bounded
+    semaphore.
+  * Wire-up into iteration.py so EXECUTE_ANALYZE picks parallel mode
+    when ``max_parallel_arms > 1``.
+"""
+from __future__ import annotations
+
+import os
+from dataclasses import dataclass, field
+from typing import Callable
+
+
+@dataclass(frozen=True)
+class ArmUnit:
+    """A single (arm, seed, condition) work item."""
+
+    arm_id: str
+    seed: str
+    condition_name: str
+    command: str
+
+    @property
+    def relative_results_dir(self) -> str:
+        """Where this unit's results land — never overlaps with another unit."""
+        return f"results/{self.arm_id}/{self.seed}"
+
+
+@dataclass
+class ArmUnitResult:
+    unit: ArmUnit
+    status: str  # "complete" | "failed"
+    duration_ms: int = 0
+    output_files: list[str] = field(default_factory=list)
+    error: str = ""
+
+
+def partition_plan(plan: dict) -> list[ArmUnit]:
+    """Turn an experiment_plan.yaml-shaped dict into a list of ArmUnits.
+
+    Each (arm × condition) becomes one unit. Seed defaults to ``"seed-1"``
+    when the condition doesn't carry an explicit seed list; multi-seed
+    conditions fan out to one unit per seed.
+    """
+    units: list[ArmUnit] = []
+    for arm in plan.get("arms", []) or []:
+        if not isinstance(arm, dict):
+            continue
+        arm_id = str(arm.get("arm_id") or arm.get("type") or "?")
+        for cond in arm.get("conditions", []) or []:
+            if not isinstance(cond, dict):
+                continue
+            command = str(cond.get("command") or cond.get("cmd") or "")
+            if not command:
+                continue
+            cond_name = str(cond.get("name") or cond.get("id") or "default")
+            seeds = cond.get("seeds") or [cond.get("seed") or "seed-1"]
+            if not isinstance(seeds, list):
+                seeds = [str(seeds)]
+            for s in seeds:
+                units.append(ArmUnit(
+                    arm_id=arm_id,
+                    seed=str(s),
+                    condition_name=cond_name,
+                    command=command,
+                ))
+    return units
+
+
+ArmRunner = Callable[[ArmUnit], ArmUnitResult]
+"""Callable that executes one ArmUnit and returns its result.
+
+The default real-world implementation spawns an SDK subagent with
+``isolation="worktree"`` and the planned command. Tests inject a
+deterministic fake.
+"""
+
+
+def run_units(
+    units: list[ArmUnit],
+    *,
+    runner: ArmRunner,
+    max_parallel: int | None = None,
+) -> list[ArmUnitResult]:
+    """Fan out units to the runner.
+
+    ``max_parallel`` is honored as an upper bound on simultaneous
+    in-flight runner calls. Phase A is synchronous over the runner;
+    the bound is enforced trivially. Phase B replaces this with
+    ``anyio.gather`` + a semaphore for real parallelism.
+
+    Returns results in the same order as ``units`` so callers can pair
+    them deterministically with their inputs (the merge step depends
+    on this — it would be nondeterministic otherwise).
+    """
+    if max_parallel is not None and max_parallel < 1:
+        raise ValueError("max_parallel must be >= 1")
+    results: list[ArmUnitResult] = []
+    for unit in units:
+        try:
+            result = runner(unit)
+        except Exception as exc:  # runner exceptions become failed units
+            result = ArmUnitResult(
+                unit=unit,
+                status="failed",
+                error=f"{type(exc).__name__}: {exc}",
+            )
+        results.append(result)
+    return results
+
+
+def default_max_parallel() -> int:
+    """Issue default: ``min(CPU, 4)``."""
+    cpus = os.cpu_count() or 1
+    return max(1, min(cpus, 4))
+
+
+def merge_unit_results(
+    results: list[ArmUnitResult],
+    *,
+    plan: dict | None = None,
+) -> dict:
+    """Deterministic merge of unit results into a findings-shaped dict.
+
+    Output keys (sorted):
+      - ``arms``: list of ``{arm_id, status, units}`` rows
+      - ``failed_unit_count``: int
+      - ``total_unit_count``: int
+
+    No timestamps, no random ordering. Calling twice on the same input
+    must produce byte-equal output.
+    """
+    by_arm: dict[str, list[ArmUnitResult]] = {}
+    for r in results:
+        by_arm.setdefault(r.unit.arm_id, []).append(r)
+
+    arms_out: list[dict] = []
+    for arm_id in sorted(by_arm):
+        arm_results = by_arm[arm_id]
+        # Arm status: complete only when every unit completed; otherwise
+        # failed. Granular per-unit status is preserved in `units`.
+        any_failed = any(r.status == "failed" for r in arm_results)
+        arms_out.append({
+            "arm_id": arm_id,
+            "status": "failed" if any_failed else "complete",
+            "units": [
+                {
+                    "seed": r.unit.seed,
+                    "condition": r.unit.condition_name,
+                    "status": r.status,
+                    "duration_ms": r.duration_ms,
+                    "output_files": sorted(r.output_files),
+                    "error": r.error,
+                }
+                for r in sorted(
+                    arm_results,
+                    key=lambda x: (x.unit.seed, x.unit.condition_name),
+                )
+            ],
+        })
+
+    failed_count = sum(1 for r in results if r.status == "failed")
+    return {
+        "arms": arms_out,
+        "failed_unit_count": failed_count,
+        "total_unit_count": len(results),
+    }
+
+
+def failed_units(results: list[ArmUnitResult]) -> list[ArmUnit]:
+    """Helper for the partial-retry path: which units need re-running?"""
+    return [r.unit for r in results if r.status == "failed"]
diff --git a/tests/test_parallel_arms.py b/tests/test_parallel_arms.py
@@ -0,0 +1,192 @@
+"""Behavioral tests for the parallel-arm orchestration (#123 Phase A)."""
+from __future__ import annotations
+
+import json
+
+import pytest
+
+from orchestrator.parallel_arms import (
+    ArmUnit,
+    ArmUnitResult,
+    failed_units,
+    merge_unit_results,
+    partition_plan,
+    run_units,
+)
+
+
+# ─── Plan partitioning ─────────────────────────────────────────────────────
+
+class TestPartitionPlan:
+
+    def test_single_arm_single_condition_default_seed(self):
+        plan = {"arms": [{
+            "arm_id": "h-main",
+            "conditions": [{"name": "baseline", "command": "./blis run"}],
+        }]}
+        units = partition_plan(plan)
+        assert len(units) == 1
+        assert units[0].arm_id == "h-main"
+        assert units[0].seed == "seed-1"
+        assert units[0].condition_name == "baseline"
+        assert units[0].command == "./blis run"
+
+    def test_multi_seed_condition_fans_out(self):
+        plan = {"arms": [{
+            "arm_id": "h-main",
+            "conditions": [{
+                "name": "x", "command": "./run",
+                "seeds": ["s1", "s2", "s3"],
+            }],
+        }]}
+        units = partition_plan(plan)
+        assert len(units) == 3
+        assert sorted(u.seed for u in units) == ["s1", "s2", "s3"]
+
+    def test_multiple_arms_and_conditions(self):
+        plan = {"arms": [
+            {"arm_id": "h-main", "conditions": [
+                {"name": "a", "command": "./a"},
+                {"name": "b", "command": "./b"},
+            ]},
+            {"arm_id": "h-ablation", "conditions": [
+                {"name": "c", "command": "./c"},
+            ]},
+        ]}
+        units = partition_plan(plan)
+        assert len(units) == 3
+        ids = sorted((u.arm_id, u.condition_name) for u in units)
+        assert ids == [("h-ablation", "c"), ("h-main", "a"), ("h-main", "b")]
+
+    def test_relative_results_dir_does_not_overlap(self):
+        plan = {"arms": [{
+            "arm_id": "h-main",
+            "conditions": [{
+                "name": "x", "command": "./run", "seeds": ["s1", "s2"],
+            }],
+        }]}
+        units = partition_plan(plan)
+        dirs = {u.relative_results_dir for u in units}
+        assert len(dirs) == 2  # s1 and s2 land in different paths
+
+    def test_skips_arms_without_command(self):
+        plan = {"arms": [{
+            "arm_id": "h-main",
+            "conditions": [{"name": "no-cmd"}],
+        }]}
+        assert partition_plan(plan) == []
+
+
+# ─── Run units ─────────────────────────────────────────────────────────────
+
+class _RecordingRunner:
+    def __init__(self, statuses: dict[str, str] | None = None):
+        self.calls: list[ArmUnit] = []
+        self.statuses = statuses or {}
+
+    def __call__(self, unit: ArmUnit) -> ArmUnitResult:
+        self.calls.append(unit)
+        status = self.statuses.get(unit.arm_id, "complete")
+        return ArmUnitResult(
+            unit=unit, status=status, duration_ms=100,
+            output_files=[f"{unit.relative_results_dir}/out.json"],
+        )
+
+
+class TestRunUnits:
+
+    def test_results_returned_in_input_order(self):
+        units = [
+            ArmUnit("h-main", "s1", "x", "./a"),
+            ArmUnit("h-main", "s2", "x", "./a"),
+            ArmUnit("h-ablation", "s1", "y", "./b"),
+        ]
+        runner = _RecordingRunner()
+        results = run_units(units, runner=runner)
+        assert [r.unit.seed for r in results] == ["s1", "s2", "s1"]
+
+    def test_runner_exception_becomes_failed_unit(self):
+        units = [ArmUnit("h-main", "s1", "x", "./a")]
+
+        def crash(_):
+            raise RuntimeError("boom")
+
+        results = run_units(units, runner=crash)
+        assert results[0].status == "failed"
+        assert "boom" in results[0].error
+        assert "RuntimeError" in results[0].error
+
+    def test_max_parallel_must_be_positive(self):
+        with pytest.raises(ValueError):
+            run_units([], runner=_RecordingRunner(), max_parallel=0)
+
+
+# ─── Merge ─────────────────────────────────────────────────────────────────
+
+class TestMergeUnitResults:
+
+    def _results(self) -> list[ArmUnitResult]:
+        return [
+            ArmUnitResult(
+                unit=ArmUnit("h-main", "s1", "x", "./a"),
+                status="complete", duration_ms=100,
+                output_files=["results/h-main/s1/out.json"],
+            ),
+            ArmUnitResult(
+                unit=ArmUnit("h-main", "s2", "x", "./a"),
+                status="complete", duration_ms=120,
+                output_files=["results/h-main/s2/out.json"],
+            ),
+            ArmUnitResult(
+                unit=ArmUnit("h-ablation", "s1", "y", "./b"),
+                status="failed", error="exit 1",
+            ),
+        ]
+
+    def test_arms_grouped_by_arm_id(self):
+        out = merge_unit_results(self._results())
+        ids = [a["arm_id"] for a in out["arms"]]
+        # Sorted for determinism.
+        assert ids == ["h-ablation", "h-main"]
+
+    def test_arm_status_failed_when_any_unit_failed(self):
+        out = merge_unit_results(self._results())
+        by_id = {a["arm_id"]: a for a in out["arms"]}
+        assert by_id["h-ablation"]["status"] == "failed"
+        assert by_id["h-main"]["status"] == "complete"
+
+    def test_failed_count_correct(self):
+        out = merge_unit_results(self._results())
+        assert out["failed_unit_count"] == 1
+        assert out["total_unit_count"] == 3
+
+    def test_byte_equal_across_repeated_calls(self):
+        a = json.dumps(merge_unit_results(self._results()), sort_keys=True)
+        b = json.dumps(merge_unit_results(self._results()), sort_keys=True)
+        assert a == b
+
+    def test_units_within_arm_sorted_by_seed_and_condition(self):
+        results = [
+            ArmUnitResult(unit=ArmUnit("h-main", "s2", "b", "./x"), status="complete"),
+            ArmUnitResult(unit=ArmUnit("h-main", "s1", "a", "./x"), status="complete"),
+            ArmUnitResult(unit=ArmUnit("h-main", "s1", "b", "./x"), status="complete"),
+        ]
+        out = merge_unit_results(results)
+        seeds = [u["seed"] for u in out["arms"][0]["units"]]
+        conds = [u["condition"] for u in out["arms"][0]["units"]]
+        assert list(zip(seeds, conds)) == [("s1", "a"), ("s1", "b"), ("s2", "b")]
+
+
+# ─── Partial-retry helper ──────────────────────────────────────────────────
+
+class TestFailedUnits:
+
+    def test_returns_only_failed_units(self):
+        results = [
+            ArmUnitResult(unit=ArmUnit("h-main", "s1", "x", "./a"), status="complete"),
+            ArmUnitResult(unit=ArmUnit("h-main", "s2", "x", "./a"), status="failed"),
+            ArmUnitResult(unit=ArmUnit("h-ablation", "s1", "y", "./b"), status="failed"),
+        ]
+        failed = failed_units(results)
+        assert len(failed) == 2
+        assert all(r.arm_id != "h-main" or r.seed == "s2" for r in failed)