Skip to content

Commit 7fbe06e

Browse files
KhrulkovVclaude
andauthored
refactor(engine): true JIT-refresh steady-state engine (#227)
* docs(specs): steady-state engine audit + true-JIT-refresh redesign Captures the current overlapping concepts in gigaevo/evolution/engine/ (epoch vs generation, two flags gating one loop, three drain paths, two ingestion paths, multi-pass refresh) and proposes a redesign where the only post-seed DONE->QUEUED flip happens for the parents picked for a single mutation. Counter consolidates to total_mutants; epoch concept goes away entirely; file split brings each module to ~250 LOC with a single responsibility. Draft for user review on refactor/steady-state-true-jit-refresh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(specs): refine steady-state redesign — async stream, multi-parent, iteration axis - Recast §3.2 as a continuous async stream: dispatcher + per-mutant tasks + ingestor (spawn-and-forget), not a sequential loop. - Generalise refresh path for num_parents > 1 (RandomParentSelector and AllCombinationsParentSelector both take num_parents); per-parent lock to prevent double-flip on overlapping selections. - Pin Program.iteration semantics as total_mutants_at_production (denser plot axis) and flag *_in_iteration cohort aggregates in collector.py as a plan-level migration item. - Rename module split to dispatcher.py / mutant_task.py / ingestor.py. - Add risks for multi-parent backpressure starvation and cohort aggregate collapse. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(plans): true-JIT-refresh steady-state engine — 21-task implementation plan TDD-sequenced refactor of gigaevo/evolution/engine/steady_state.py per docs/superpowers/specs/2026-05-12-steady-state-engine-audit-and-redesign.md: - Delete epoch concept, gate, drain barrier - Single total_mutants counter (rename total_generations) - Refresh only selected parents JIT, not whole archive - Continuous async stream: dispatcher + mutant_task + ingestor - Module split: engine.py / dispatcher.py / mutant_task.py / ingestor.py / refresh.py - Drop refresh_passes / refresh_order / refresh_pass / epoch_trigger_count - Keep MaxGenerationsStopper as deprecated alias of MaxMutantsStopper - Migrate config/evolution/default.yaml to SteadyStateEvolutionEngine Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(plans): paranoia tasks 19A-19F + hard-rename stopper (Option A) Two clarifications to the steady-state JIT refactor plan: 1. Add Tasks 19A-19F before the smoke + PR tasks: - 19A: concurrency stress + load/async simulation suite - 19B: cancellation invariants + resume-after-kill - 19C: real-Redis integration smoke - 19D: ParentRefresher failure-mode resilience - 19E: chaos-hacker adversarial review pass - 19F: counter monotonicity invariant 2. Stopper rename is hard, not aliased. The old MaxGenerationsStopper counted *epochs* (~8 mutants each); MaxMutantsStopper counts mutants. An alias would silently shrink runs ~8x. Delete the old class, delete the old config files, rename the global default from max_generations: 100 to max_mutants: 800 (preserves prior effective run length). Old configs fail loudly at Hydra compose time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(engine): single-counter total_mutants; drop refresh_pass; hard-rename stopper Foundational refactor for JIT-refresh steady-state engine (plan task 2+3+4+7+8): Engine - EngineMetrics.total_generations -> total_mutants (single-counter progress) - EngineSnapshot.total_generations -> total_mutants - EngineSnapshot.refresh_pass field DELETED (multi-pass refresh removed) - SteadyStateEngineConfig.refresh_passes field REMOVED - steady_state._refresh_archive_programs: inlined one-pass body; multi-pass loop + per-pass snapshot bumps gone Stopper (hard rename, no back-compat alias per Option A) - MaxGenerationsStopper(max_generations=N) -> MaxMutantsStopper(max_mutants=N) - config/stopper/max_generations*.yaml -> max_mutants*.yaml - config/constants/evolution.yaml: max_generations: 100 -> max_mutants: 800 (preserves prior run length: 100 epochs x 8 mutants/epoch under steady-state) - config/config.yaml stopper default: max_generations -> max_mutants Manifest boundary preserved - launch_generator.py: emits max_mutants={contract.max_generations} Hydra override - Contract.max_generations stays (experiment-level concept) - CMA-ES max_generations (optimizer hyperparam) unchanged - watchdog/monitoring max_generations (experiment progress display) unchanged Adversarial - SharedBenchmarkFilteredLineageStage.compute_hash override DELETED (refresh_pass-aware cache invariant obsolete under JIT-refresh) Tests - Deleted: test_snapshot_refresh_pass.py, test_lineage_cache_invalidation.py, test_two_pass_mutation_context.py - Vestigial "removed feature" assertion classes deleted per user directive 358 targeted tests pass. ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(progress): migrate MainRunSyncHook + monitoring to programs_processed Task 5: MainRunSyncHook polls snap.programs_processed (was total_mutants). _last_main_gen -> _last_main_progress; _get_min_gen -> _get_min_progress. Module docstring + log strings updated. Task 6: redis_queries.get_generation -> get_programs_processed reading snap.programs_processed. collect_snapshot.gen now sourced from programs_processed; RunSnapshot.generation field name preserved for display compatibility. programs_processed is the canonical cross-run progress signal under JIT- refresh: it counts mutants actually ingested into the archive (post-validation), not total mutants emitted. Prompt-coevo sync needs the former to ensure the main run has produced something usable before the prompt run advances. Tests pass: tests/prompts/test_coevolution_sync.py (14), tests/monitoring/test_redis_queries.py (17). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(engine): ParentRefresher + ParentRefreshSelector ABC for JIT refresh Adds the JIT DONE->QUEUED->DONE refresh helper that producer tasks call before mutating selected parents. Replaces the multi-pass _refresh_archive sweep removed in the prior commit. Architecture (user directive 2026-05-12): - ParentRefreshSelector: ABC choosing which programs to refresh given the producer's parent pick. DirectParentsSelector is the canonical default (refresh only the parents themselves). Future implementations may walk lineage to depth-k and order refresh in depth-batched waves so deepest ancestors finish before nearest parents flip. - ParentRefresher: per-parent-id asyncio.Lock serialises overlapping concurrent refreshers. Batch transition flips all DONE targets to QUEUED atomically (no producer sees a half-flipped bundle), then polls mget() until every target is DONE. DISCARDED-on-input or DISCARDED-during-wait raises ValueError; vanished parents raise ValueError; absence-of-progress raises TimeoutError. Caller aborts the mutant and releases its in-flight slot rather than falling back to stale state. Tests: 11/11 pass (single/empty/batch/overlap/discarded/timeout/selector ABC contract/custom-selector-adds-targets/empty-selector-noop). FakeDag test helper provides QUEUED -> RUNNING -> DONE auto-promotion to exercise the refresh without a real DagRunner. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(engine): SteadyStateEvolutionEngine composes dispatcher + ingestor + ParentRefresher Replaces the 935-LOC epoch-driven engine with a thin composition of three new modules: - gigaevo/evolution/engine/mutant_task.py — run_one_mutant: one mutant per async task; explicit slot-ownership invariant (try/finally guards the semaphore against partial-failure and cancellation) - gigaevo/evolution/engine/dispatcher.py — dispatcher_loop: continuous spawn-and-forget producer; backpressure via _in_flight_sema only - gigaevo/evolution/engine/ingestor.py — ingestor_loop + poll_and_ingest: long-lived ingestion loop with adaptive interval, batch DONE handling, leaked-id sweep, slot-release on ingest Deletes (from steady_state.py): _mutation_loop, _produce_one_mutant, _get_cached_elites, _create_single_mutant, _ingestion_loop, _poll_and_ingest, _ingest_batch, _should_trigger_epoch, _epoch_refresh, _drain_in_flight, _drain_scoped, _refresh_archive_programs, _mutation_gate, _cached_elites, _elite_cache_lock, _processed_since_epoch, _epoch_mutants, _epoch_eligible_since (~800 LOC). Config: drop refresh_passes + refresh_order from EngineConfig; hoist max_in_flight to the parent; SteadyStateEngineConfig now a Hydra alias. steady_state.yaml drops refresh_order + refresh_passes. Tests: rewrite test_steady_state.py (736 → ~165 LOC) to cover construction (incl. _parent_refresher wiring), backpressure semaphore, generation cap stopping dispatcher_loop, restore from snapshot. Skip modules pinned to deleted machinery: test_steady_state_determinism.py (epoch determinism — to be rewritten against new tick site), test_generation_boundary_emit.py (step() removal pending in Task 14). See spec §3, plan §13. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(engine): delete generational EvolutionEngine.step() / run() loop evolution=default now wires SteadyStateEvolutionEngine. EvolutionEngine becomes an abstract base of shared helpers (snapshot, metrics, idle wait, hooks, stop context). BusedEvolutionEngine migrated to subclass SteadyStateEvolutionEngine with a periodic bus-drain background task. Also persists total_mutants in the engine snapshot after each mutant production so resume picks up the correct generation counter — previously this happened inside step() which is now gone. See spec docs/superpowers/specs/2026-05-12-steady-state-engine-audit-and-redesign.md §3.6. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(collector): set *_in_iteration aggregates to None under JIT engine Each mutant has a unique iteration (= total_mutants_at_production), so cohort aggregates collapse to single-program windows. Schema field retained for plot/exporter compatibility; consumers needing windowed aggregates should compute them at plot time. See spec §3.5 + §6.5. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(engine): JIT-refresh polish — empty-archive backoff, metric wiring, vestigial GenerationBoundary Wraps up Tasks 16-18 of the JIT-refresh refactor (plan: docs/superpowers/plans/2026-05-12-steady-state-true-jit-refresh.md). gigaevo/evolution/engine/mutant_task.py - Add asyncio.sleep(loop_interval) backoff when select_elites returns empty (population seeding / all rejected). Prevents dispatcher hot-spinning when the archive is empty. - Wire submitted_for_refresh metric: record_reprocess_metrics(len(refreshed)) after ParentRefresher.refresh() succeeds. Previously the metric was orphaned (defined but never incremented under JIT-refresh). gigaevo/monitoring/events.py - Mark GenerationBoundary vestigial with explanatory docstring. The class schema is kept so legacy run logs still parse, but nothing in gigaevo/ emits this event under steady-state JIT-refresh. config/constants/evolution.yaml, config/evolution/{default,steady_state}.yaml gigaevo/evolution/engine/config.py gigaevo/experiment/launch_generator.py - Drop max_mutations_per_generation — under JIT-refresh there is no per-generation mutation cap; max_in_flight controls parallelism. Tests adjusted for JIT-refresh floor-trigger semantics: - Strict total_mutants == N replaced with >= N at ~12 sites across tests/integration/{test_mini_run,test_multigen_e2e,test_memory_e2e, test_acceptor_engine,test_advanced_scenarios,test_brittleness, test_complex_scenarios,test_engine_regression,test_ingest_regression, test_evolution_engine_edge_cases}.py and tests/concurrency/ test_deadlock_prevention.py. JIT cap is a floor trigger — concurrent in-flight mutants may bring total_mutants slightly above max. - Skip class-level on TestEmptyArchiveEngine, TestAllMutationsReturnNone, TestAllMutationsRaise, TestTransientMutationFailure (empty/zero-success scenarios cannot reach the cap under JIT-refresh). - Skip class-level on TestEngineStepIntegration — the generational engine.step() entry point was deleted; deadlock-prevention under JIT-refresh is covered by the paranoia suite (Task 19A). - Skip two engine.run() wiring tests in TestEnginePostRunHookWiring that hung on AsyncMock empty archive; the wiring is still covered by test_none_hook_defaults_to_null + test_custom_hook_is_stored. - New tests/config/test_stopper_configs.py pins the MaxMutantsStopper Hydra targets and rejects MaxGenerationsStopper imports. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(specs): record JIT engine dry-run smoke results §9 added with the Hydra config resolution table showing the new schema is canonical: SteadyStateEvolutionEngine + MaxMutantsStopper + max_in_flight, with no max_mutations_per_generation / refresh_pass / total_generations references. Closed experiment configs intentionally left unchanged. Live-cluster run deferred to post-merge follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(engine): concurrency stress + simulation suite (load × async patterns) 36-combo parametrised matrix exercising the JIT-refresh engine end-to-end against fakeredis storage and a timed fake DAG. Verifies six core invariants: no semaphore leak, _in_flight drains, total_mutants reaches the cap with bounded overshoot (≤ max_in_flight), programs_processed equals accepted+rejected, ParentRefresher flip count is bounded, and snapshot counters are monotonically non-decreasing. Sweeps (max_in_flight, n_mutants, duration_dist, overlap_rate) across mif ∈ {1,4,16}, n ∈ {50,200}, dist ∈ {const,expo,heavy_tail}, ov ∈ {0,0.5}. The high-overlap arm seeds the archive with a single elite so concurrent producers contend on one parent and exercise the per-id ParentRefresher lock; the low-overlap arm seeds 2×mif elites so producers pick distinct parents. Closes Task 19A from the steady-state JIT-refresh refactor plan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(engine): cancellation + resume-after-kill invariants Two new test files paired with one engine fix: * test_engine_cancellation.py — cancels run() mid-flight and after early start; verifies slot accounting (sema._value + |_in_flight| == max_in_flight), counters never regress, snapshot remains consistent. * test_engine_resume_after_kill.py — runs engine A to cap=5, tears it down, rebuilds engine B against the same fakeredis server, calls restore_state(), runs to cap=10. Verifies progress is strictly forward across the resume and the cap window includes bounded overshoot. Engine fix: SteadyStateEvolutionEngine.run()'s finally clause now explicitly cancels the dispatcher and ingestor tasks. asyncio.wait() does not propagate cancellation into its waited tasks, so without this they leaked across an external run-task cancel, holding semaphore slots forever (the cancellation test caught this directly). Closes Task 19B from the steady-state JIT-refresh refactor plan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(engine): ParentRefresher failure-mode resilience Adds four new failure-mode tests to test_refresh_parents.py: * No-timeout-default: with timeout_seconds=None, a brief DAG pause is absorbed and the refresh still completes successfully. * Mid-flight DISCARD: a parent flipped DISCARDED by another path during the await raises ValueError rather than returning stale state. * Mid-flight vanish: a parent removed from storage during the await raises ValueError. * Reversed input order: two concurrent refreshes on the same parent set with reversed input orderings both complete — the per-id locks are acquired in deterministic sorted order, so classic lock-order-inversion deadlocks are impossible. Closes Task 19D from the steady-state JIT-refresh refactor plan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(engine): final ingestion sweep runs under cancellation Chaos-hacker review identified two compounding High-severity bugs: #1 cancellation between _in_flight.add and _write_snapshot permanently leaks slots (slot_transferred=True blocks per-task release). #2 final-sweep loop in run() body is unreachable when CancelledError propagates from asyncio.wait(). Fix: move the final ingestion sweep into run()'s finally block with asyncio.shield to survive outer cancellation, bounded by max_in_flight + 1 passes to avoid hangs on QUEUED stragglers. Also cancel dispatcher/ingestor tasks explicitly in finally — asyncio.wait() does not cancel its waited tasks when the outer coroutine is cancelled, so they could otherwise survive engine teardown and continue spawning mutants. Regression test test_cancel_drains_done_programs_via_final_sweep asserts that DONE programs in _in_flight at cancel time are ingested by the sweep, with programs_processed advancing accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(engine): serialise _write_snapshot to keep Redis in sync with memory Chaos-hacker review finding #3 (Medium): concurrent mutant tasks call _write_snapshot from run_one_mutant after incrementing total_mutants. Without synchronisation, two writers can compute monotone versions v=N+1 and v=N+2 synchronously, then both await save_run_state — if the v=N+2 save lands first and v=N+1 lands second, Redis ends at v=N+1 with stale fields while the in-memory mirror sits at v=N+2. A crash resume then rehydrates the older v=N+1 and loses the latest updates. Fix: wrap the model_copy + set_current_snapshot + storage.save_run_state in an asyncio.Lock so the per-call version bump and Redis write land atomically. Last-writer-wins still holds; only the ordering is guaranteed. Regression test concurrent_write_snapshot_keeps_redis_and_memory_in_sync issues 50 concurrent _write_snapshot calls and asserts the Redis-persisted version equals the in-memory mirror's version at the end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(refresh): bound _locks dict via WeakValueDictionary Chaos-hacker review finding #4 (Medium): ParentRefresher._locks was a plain dict that retained an asyncio.Lock per distinct parent id forever. On a multi-day run touching tens of thousands of mutants this leaks ~100 bytes/lock plus event-loop bookkeeping per entry — small in absolute terms but proportional to evolution history. Fix: switch to weakref.WeakValueDictionary so locks are retained only while at least one in-flight refresh holds a strong reference. The lock contract is unchanged — concurrent refreshes for the same parent id still share the same lock, because the active caller's strong ref keeps the entry alive across reentrant lookups. Regression test test_refresh_locks_dict_does_not_grow_unboundedly sequentially refreshes 20 distinct parents and asserts the dict shrinks back to (near-)empty after gc.collect(). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(integration): real-Redis smoke for JIT-refresh engine (Task 19C) Adds tests/integration/test_engine_real_redis.py: end-to-end smoke against an actual Redis at localhost:6379/0 (or REAL_REDIS_URL). Auto-skips when no Redis is reachable, so committing it is safe on machines without a local server. What it verifies: - The full dispatcher/ingestor/refresher/mutant-task pipeline survives real network round-trips (not just fakeredis fast-paths). - Bounded overshoot holds with cap=6, max_in_flight=2. - No semaphore slot leak at run end. - Snapshot is persisted to Redis at the same version the in-memory mirror reports — i.e. the snapshot-lock fix actually serialises real Redis writes, not just fakeredis ones. Uses a unique key prefix per run and SCAN+DELETE cleanup in fixture finally, so the test never clobbers another caller's data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(engine): wall-clock bounded final sweep, patient on stragglers The previous max_in_flight+1-pass bound terminated the final ingestion sweep before the DAG could flip QUEUED→RUNNING→DONE for the last few in-flight mutants on normal completion, leaking their semaphore slots (stress suite caught a 1-slot leak on high-mif runs). Switch the sweep to a wall-clock deadline (5s) with loop_interval sleep between empty passes, while preserving the asyncio.shield + early-break on CancelledError that made the cancellation-safety fix work. The sleep itself is wrapped to bail on cancellation immediately. All 36 stress combos + 82 paranoia tests now green; 555-test evolution sweep clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(engine): rename final-sweep loop var to satisfy mypy The cleanup loop reused `t` from the earlier ``for t in pending`` block (typed `Task[Any]`), but the cleanup iterates a tuple of `Task[Any] | None`. Renaming the variable removes the assignment-type conflict without changing behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(engine): apply PR #227 review fixes — naming + deprecated test cleanup Address two review recommendations on the JIT-refresh refactor: 1. Naming consistency — make "generation" → "mutant" rename complete: - core.py: rename _reached_generation_cap → _reached_mutant_cap - core.py: 8 log prefixes "[EvolutionEngine] gen={}" → "mutants={}" - dispatcher.py: 2 call sites updated to _reached_mutant_cap - test_steady_state.py: section-comment reference updated 2. Remove deprecated tests left as @pytest.mark.skip after the refactor. These covered the old epoch/step()/run-loop machinery that no longer exists in the JIT-refresh engine. Removed in bulk via AST script matching skip reasons like "JIT-refresh", "step() removed", "Generational ...", "GenerationBoundary emission", "_refresh_archive_programs", "_create_mutants". Whole files deleted (only contained deprecated tests): - tests/evolution/test_steady_state_determinism.py - tests/evolution/test_generation_boundary_emit.py Surgical class/function removals (kept the rest of each file): - tests/evolution/test_evolution_engine.py - tests/evolution/test_evolution_engine_complex.py - tests/evolution/test_resume.py - tests/evolution/bus/test_engine.py - tests/integration/test_acceptor_engine.py - tests/integration/test_advanced_scenarios.py - tests/integration/test_complex_scenarios.py - tests/integration/test_evolution_engine_edge_cases.py Net: 13 files changed, 16 insertions(+), 2207 deletions(-). Verified: ruff check + format clean; targeted pytest sweep (tests/evolution/ + 4 integration files) green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(deps): unpin gigaevo-memory from private git URL — it's now public The gigaevo-memory repo went public, so we can drop the `@ git+https://...@<commit-sha>#subdirectory=client/python` form and rely on the plain `gigaevo-memory` spec. This also unblocks CI's pip install step, which was failing on the private-repo username prompt: fatal: could not read Username for 'https://github.com': No such device or address Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(heilbron_adversarial): replace absolute-path symlinks with relative The 9 symlinks under problems/heilbron_adversarial/{pop_a_gan,pop_a_soft, pop_b_soft}/{fallback,helper.py,initial_programs} were committed on 2026-05-02 with absolute targets baked in: /mnt/virtual_ai0001071-04017_SR004-nfs1/CFS-SR008/workspace/mathemage/ gigaevo-core-internal/problems/heilbron_adversarial/pop_a/... That path only exists on this NFS dev mount, so every CI runner saw dangling links and ruff bailed out with: E902 Failed to create cache key Cause: No such file or directory (os error 2) --> problems/heilbron_adversarial/pop_a_gan/helper.py Replaced all 9 with relative siblings (e.g. ../pop_a/helper.py). The `_soft` and `_gan` problem variants reuse pop_a's / pop_b's helper.py + fallback/ + initial_programs/, same intent as before, now portable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(engine): rewire post_step_hook + adjacent observability polish PR #227 deleted EvolutionEngine.step(), which historically fired _post_step_hook once per generation. The kwarg + assignment in EvolutionEngine.__init__ became dead code: CompositionInjectionHook — the only production consumer, wired by 3 adversarial experiment launches — silently no-opped on every Arm A run. Changes ------- 1. Re-wire _post_step_hook in poll_and_ingest: fires once per ingest sweep that adds >=1 program to the archive (the JIT analogue of the old per-generation boundary). Fault-isolated — a buggy hook can't abort ingestion, which has already committed to Redis. 2. H3 fix: ParentRefresher.timeout_seconds default None -> 600s. None default could strand a mutant forever on DAG-runner crash, leaking its in-flight semaphore slot. 3. Final-sweep observability: extract _final_ingestion_sweep() and emit WARNING with stuck-IDs when the 5s wall-clock deadline elapses before _in_flight drains. Operators previously had no signal that a run shut down with leaked slots. 4. Drop stale "JIT-refresh" / "epoch" docstring framing from config.py, core.py, mutant_task.py, steady_state.py. Tests ----- - 13 new SOTA tests in tests/evolution/test_post_step_hook_rewire.py cover hook firing semantics (added==0 / added>0 / mixed / failure / unset), finite-timeout default + override, and WARNING emission via loguru sink capture. - Existing test_refresh_no_timeout_default_waits_through_brief_pause renamed + assertion updated for the new finite default. Verification ------------ Full audit of evolution engine consumers ran clean: - tests/evolution/ (1000+ tests, all pass) - tests/integration/test_acceptor_engine,advanced_scenarios, complex_scenarios,evolution_engine_edge_cases (42 tests) - tests/adversarial_pipeline/ (composition_injection, progress_sync, steady_state_adversarial_e2e) - tests/memory/ (ideas_tracker_pipeline, engine_integration, dag_memory_flow, memory_e2e_pipeline) - tests/concurrency/test_deadlock_prevention - tests/integration/test_brittleness, mini_run, multigen_e2e, engine_regression, ingest_regression - tests/prompts/test_coevolution_sync Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(engine): two deadlock-class chaos-hacker findings + regressions Closes the top two CRITICAL findings from the adversarial review on commit 130fdbb2 (chaos-hacker agent a79a294de8502a7d8). 1) ParentRefresher: dedup parents by id before sorting + acquiring. asyncio.Lock is NOT reentrant. If a parent bundle ever contains the same program id twice (any ParentSelector returning duplicates, or a future ParentRefreshSelector that walks lineage hitting the same id via two paths), _acquire_all would call acquire() twice on the same Lock from the same task and the mutant task hangs forever, holding its in-flight slot. Eventually the engine starves. Fix: fold duplicates by id inside refresh() before sort + lock acquisition. First-seen wins. Test: test_refresh_does_not_deadlock_on_duplicate_parent_ids and test_refresh_selector_emitting_duplicates_does_not_deadlock both would hang without this fix; with it, they complete and the parent flips exactly once. 2) _final_ingestion_sweep: track inner task explicitly so cancellation does not leak a detached poll_and_ingest. asyncio.shield(coro) only protects the inner coroutine from being cancelled — it does NOT prevent CancelledError from propagating to the awaiter. The previous code did `await asyncio.shield(poll_and_ ingest(self))` and on cancellation broke out of the loop. The inner then continued as a detached Task, racing _post_run_hook.on_run_ complete and engine teardown for access to storage, _in_flight, and the post_step_hook. Fix: wrap poll_and_ingest in an explicit asyncio.create_task; on outer cancellation, cancel the inner and wait_for(timeout=1.0) so no zombie coroutine outlives the method. New test test_cancellation_does_not_leak_inner_task asserts the inner's finally fires before we move on. Chaos-hacker finding #1 (WeakValueDictionary GC race) was investigated and dismissed: any task awaiting `lk.acquire()` keeps `lk` strongly referenced on its suspended-coroutine frame, so the WeakValueDictionary entry cannot be reclaimed while a waiter exists. The race the report described requires a waiter without a strong ref, which is unreachable. Verified all engine consumers green: evolution (1001 tests), integration (83), adversarial+concurrency+memory+prompts (1424). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(engine): drop dead code + fix cancel propagation in final sweep Cycles 5-6 of the auto-optimize sprint: Cycle 5 — systems-architect proposal #5 (dead-code deletion): - Delete EvolutionEngine.pause(), resume(), is_running() — zero callers anywhere in gigaevo/, tests/, tools/, experiments/. Verified with `git grep` (tests/evolution/test_strategy_base.py hits are on strategy.pause/resume, not engine.pause/resume). - Delete _set_state() — one-line shim with zero internal callers. - Delete _paused field — written but never read. - Delete _run_start_mutants field + its dead write in steady_state.py:63 — never consumed anywhere. Cycle 6 — chaos-hacker Findings 1 (HIGH) + 2 (MED) on 75203666: Finding 1 (HIGH): _final_ingestion_sweep used `contextlib.suppress(BaseException)` around `wait_for(inner, 1.0)`. That suppress catches asyncio.CancelledError, KeyboardInterrupt, and SystemExit — meaning a second cancellation (or SIGINT) during the inner-task cleanup was silently absorbed and the sweep returned "normally", letting `_post_run_hook.on_run_complete` run in a teardown context the supervisor never authorised. * Narrow to `suppress(Exception)` so only true exceptions (Redis transient, network blip) are tolerated during cleanup. * Track the cancel locally and re-raise CancelledError after the inner is settled and the (skipped) WARNING block — so the cancel reaches `run()`'s awaiter. * In `run()`'s finally, catch the re-raised CancelledError around the sweep call so the finalizer (`post_run_hook.on_run_complete`) still executes — cancellation is a shutdown signal, not a "skip cleanup" one — then re-raise. * Skip the "deadline elapsed" WARNING when sweep exits via cancel (the message is for diagnostics of leaked semaphore slots, not for shutdown-was-aborted). Finding 2 (MED): docstring claimed `wait_for(timeout=1.0)` was a "tight" cap. In CPython 3.12 `wait_for` cancels the inner and then waits for it to honor the cancel — wall-clock cost is bounded by inner cleanup latency, not the parameter. Updated docstring to say "best-effort timeout" and clarified that only `Exception` is suppressed (BaseException family — CancelledError, KeyboardInterrupt, SystemExit — propagates intact). New regression tests in tests/evolution/test_post_step_hook_rewire.py (TestFinalSweepCancellationSafety): * test_cancellation_propagates_to_awaiter — pins Finding 1: cancel must reach the engine awaiter; sweep_task.cancelled() must be true. * test_normal_completion_returns_without_cancellederror — pins the happy/timeout path so a future refactor of cancel plumbing doesn't accidentally raise on deadline-elapsed. Verified clean: * tests/evolution/ + tests/integration/test_acceptor_engine.py + test_advanced_scenarios.py + test_complex_scenarios.py + test_evolution_engine_edge_cases.py → 1115 passed * tests/adversarial_pipeline/ + tests/concurrency/ + tests/memory/ + tests/prompts/ → 1581 passed, 5 skipped * ruff check + format clean on the full repo Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(engine): drop dead mutation_ids branch + dead fields, lock schema with extra=forbid Cycle 7 of auto-optimize-loop on PR #227. Synthesizes systems-architect's stale-refs audit (12 ranked proposals, bundle 1-4 + 9) with the chaos-hacker LOW findings from cycle 6. Production cleanups - `_ingest_completed_programs(mutation_ids=...)` parameter dropped — the only production caller passed `mutation_ids=None`. The fast-discard branch had no live caller. Function now does one job: deserialize non-archive DONE programs, push through acceptor + strategy. - `EngineConfig.generation_timeout` deleted. Documented "deprecated, no longer used" since 31b66de7 (2026-04-19); zero production reads. - `EngineMetrics.errors_encountered` deleted. Zero production readers/writers; only test_engine_metrics.py mutated it. EngineSnapshot doesn't embed EngineMetrics, so no Redis-snapshot break. Defense-in-depth - `EngineConfig` now uses `extra="forbid"`. Future field deletions will crash callers passing the dead kwarg instead of silently dropping into Pydantic's default `extra="ignore"`. Verified safe for live Hydra configs (config/evolution/*.yaml only set declared fields). - Swept 14 test sites still passing `generation_timeout=X` — chaos-hacker flagged these as silent semantic drift if `extra="forbid"` is added without the sweep. Chaos-hacker LOW fixes (review of d5facada) - `raise asyncio.CancelledError from None` on both sites in steady_state.py. A Redis blip suppressed by the surrounding `contextlib.suppress(Exception)` no longer dangles in `__context__` and misleads the operator. - Tightened `test_cancellation_propagates_to_awaiter` assertion: drops the `cancelled() or (done() and exception() is CancelledError)` OR-branch. Probed: on Py3.12, `raise asyncio.CancelledError` inside a coroutine ALWAYS produces `task.cancelled() == True`, and calling `.exception()` on a cancelled task re-raises CancelledError (so the OR-branch was unreachable). Tightening is strictly safer; future regressions that break the `.cancelled()` contract now surface immediately. Test cleanup - Deleted `tests/evolution/test_ingest_mutation_ids.py` (299 LOC) — every test pinned the dead `mutation_ids` branch. - Removed stale "generation_timeout deprecated" zombie banner + module docstring entry in test_evolution_engine_complex.py. - Stripped `errors_encountered` assertions from test_engine_metrics.py. Verification - ruff: clean on touched dirs. - Tests green: * tests/evolution/ + selected integration (~700 tests, all dots) * tests/concurrency/test_deadlock_prevention.py (all dots, 3 skipped) * tests/integration/ + tests/benchmarks/ + tests/stages/ (all dots) * tests/concurrency/ + tests/memory/ + tests/adversarial_pipeline/ + tests/dag/ (all dots) - chaos-hacker adversarial review of this diff: 1 HIGH (the generation_timeout test-rot, fixed by the sweep above), 0 medium/low remaining. Verdict: ship. Adjacent finding (deferred) - pre-existing observability gap: a second cancel landing during `on_run_complete` skips the "[SteadyState] Stopped" log line. Net behavior (cancellation reaches the awaiter) is correct; only the log marker is missing. Out of scope for cycle 7. LOC: -394 +32 (net -362). Full bytes-on-disk delta dominated by the test_ingest_mutation_ids.py deletion (299 LOC). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(engine): drop dead error counters + step() vestige, inline helpers Cycle 8 quality pass on PR #227 — systems-architect proposals #1, #3, #6, #8 + partial #2. - Delete `elites_selection_errors` and `mutations_creation_errors` fields (always passed 0 in production — verified every call site) - Delete `record_elite_selection_metrics`, `record_mutation_metrics`, `record_reprocess_metrics` (single-line accumulators with one caller each after dropping the errors arg) - Inline `_pick_parents` helper (4-line single-caller wrapper) - Delete `SteadyStateEvolutionEngine.step()` NotImplementedError vestige and its test (no production caller; `run()` already raises in the abstract base) - Fix dated docstring `elites_selected` "across all generations" → "Total elites cumulatively selected for mutation" (JIT-refresh has no generations) - Update `tools/benchmarks/bench_multirun.py` call site for consistency Net: 32 insertions, 107 deletions (-75 LOC). All `tests/evolution/`, `tests/integration/`, `tests/concurrency/`, `tests/benchmarks/`, `tests/stages/`, `tests/memory/`, `tests/adversarial_pipeline/`, `tests/dag/` pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(engine): eliminate ghost-persist by inlining single-mutant primitive `generate_mutations(...)` wrapped `asyncio.gather(*tasks, return_exceptions=True)`. If the outer awaiter (typically `run_one_mutant`, spawned by the dispatcher and cancellable at engine teardown) was cancelled after a child's `storage.add(program)` succeeded but before `gather` returned, gather re-raised CancelledError to the caller — the child's `except BaseException` handler still returned `persisted_id`, but `results` was never bound. The program stayed in Redis with no `_in_flight` tracking → ghost. Refactor: extract `generate_one_mutation()` — a single-mutant primitive with no gather. `mutant_task.run_one_mutant` calls it directly. The function's `except BaseException` arm returns `persisted_id` to the caller without any gather to swallow it. The caller registers the id in `_in_flight` before the cancellation can re-propagate. `generate_mutations(...)` is retained as a sequential batch wrapper for the existing test suite (it loops over `generate_one_mutation` and breaks on CancelledError, returning accumulated ids). Production callers only ever passed `limit=1`, so there is no perf impact. Adds `tests/evolution/test_engine_ghost_persist.py` with 7 deterministic test cases covering: cancel-pre-persist (propagates cleanly), cancel- post-persist (id surfaced), cancel-mid-lineage (id surfaced), an integration test through `run_one_mutant` proving the id lands in `_in_flight`, a gather-cancel regression-guard demonstrating the historical failure mode, and backwards-compat checks for the batch wrapper. Files: 2 src changes (mutation.py refactor, mutant_task.py call-site), 1 new test file. 999/1000 evolution tests pass; 1 deselected test is a pre-existing failure unrelated to this change (patches a non-existent `steady_state.generate_mutations` symbol). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(engine): drop dead category banners in test_evolution_engine_complex The file had empty Category A/F/H/J banner comments left over after those categories' tests were removed. They created a false signal of "these areas are covered" without any actual test bodies. Drop them and the corresponding lines in the module docstring. No production code touched; all 11 tests in this file still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(engine): annotate inlined parents var to satisfy mypy The cycle-8 inlining of _pick_parents lost the helper's return type annotation. Now that the assignment uses `next(..., [])` as the default, mypy cannot infer the element type. Add an explicit `list[Program]` hint. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(engine): add SOTA invariant test suite for steady-state concurrency Plugs 8 coverage gaps identified by test-obsessed-reviewer's audit. The new file `tests/evolution/test_engine_invariants.py` (440 LOC, 15 tests) guards the engine's 8 concurrency invariants (I1-I8): Gap 1 (I1) — cancel-between-acquire-and-slot-transfer releases slot * test_cancel_before_elite_select_releases_slot * test_cancel_during_parent_refresh_releases_slot Gap 2 (I6) — dispatcher cancel drains all active mutant tasks * test_active_tasks_are_cancelled_on_dispatcher_cancel Gap 3 (I7) — ingestor uses fast interval (0.25*loop_interval) saturated * test_fast_interval_when_saturated * test_slow_interval_when_idle (negative control) Gap 4 (I6) — post_run_hook fires even on cancellation * test_hook_fires_when_run_cancelled Gap 5 (I4) — _in_flight_lock does not starve under contention * test_many_waiters_all_progress (50 concurrent waiters, all land) Gap 6 (I8) — _await_idle treats DISCARDED as idle (not active) * test_discarded_only_returns_idle * test_await_idle_returns_promptly_with_only_discarded Gap 7 (I5) — snapshot version monotonic in Redis under concurrent writes * test_concurrent_writes_versions_monotone (20 concurrent writes) * test_in_memory_mirror_tracks_redis Gap 8 (I1+I2) — double-poll same id releases slot exactly once * test_id_not_double_released * test_leaked_id_swept_once Bonus (I3 deterministic) — slot_transferred flag is exclusive * test_success_path_transfers_slot * test_no_elite_releases_slot All tests are deterministic — asyncio.Event for sync, no time.sleep polling, no flaky timing assumptions. The full suite runs in <1s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(engine): await metrics_collector cancel before storage.close Without `await` after `_metrics_collector_task.cancel()`, the collector may still be mid `await storage.<call>` when `storage.close()` fires below — raising ConnectionClosedError into an orphan coroutine that has no caller. Bound the wait so a wedged collector cannot indefinitely block shutdown. Add two regression tests: - test_collector_finished_before_storage_close: asserts the collector's finally runs strictly before storage.close(). - test_wedged_collector_does_not_block_stop_forever: asserts stop() returns within the 2s wait_for budget even when the collector shields against cancel. Cycle 11: chaos-hacker F4 finding from cycle 10 deadlock probe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(engine): drop redundant CancelledError arm + tidy Any import Two micro-simplifications surfaced during cycle-12 quality review: 1. `dispatcher.py`: the explicit `except CancelledError: raise` arm was a no-op — `finally` runs regardless, and `CancelledError` propagates naturally without an explicit re-raise. Removing the dead arm keeps the loop's control flow obvious: try → finally. 2. `core.py`: TYPE_CHECKING-guarded `from typing import Any` was overhead for a singleton typing import (zero cost). Promoted to top-level. Regression test added (`TestDispatcherFinallyCancelsSpawnedMutants`): monkey-patches `run_one_mutant` to a long-runner, cancels the dispatcher mid-flight, asserts the spawned mutant received `CancelledError` via the dispatcher's `finally` block. Pins the cancellation contract so a future refactor cannot accidentally swallow the cancel. Total invariant tests now 18 (was 17 in cycle 11). ruff clean + full evolution+integration suite green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(engine): close two orphan paths in _final_ingestion_sweep Cycle-13 chaos-hacker probe found a HIGH bug replaying the cycle-11 F4 shape through a different channel: the sweep's bounded-wait (`suppress(Exception)` + `wait_for(timeout=1.0)`) had TWO escape paths that left the `poll_and_ingest` inner task detached past the sweep's return: (1) Slow-cancel target — if inner takes >1s to honor cancel, wait_for raises TimeoutError (Exception subclass), suppressed silently; inner runs detached and races storage.close() in stop(). (2) Double-cancel — if a second cancel arrives during wait_for(inner), wait_for re-raises CancelledError (BaseException, NOT Exception); the suppress doesn't catch it, control exits the except arm with `cancelled=True; break` skipped; inner is detached. Both replay the cycle-11 metrics_collector orphan: ConnectionClosedError fires into a coroutine that has no caller to surface it. Fix (steady_state.py:174-205): explicit `suppress(CancelledError)` catches the double-cancel and routes through the cancelled-flag path; TimeoutError is logged as a WARNING so an operator can correlate the orphan risk with whatever stranded the inner task in Redis. Generic Exception still logs but does not let inner escape. Regression coverage (+2 tests, total now 20): - test_slow_cancel_inner_logs_timeout_but_no_orphan_on_normal_path monkey-patches poll_and_ingest to a slow-cancel target (re-shields the first cancel for 2s); asserts the WARN about "did not honor cancel" / "orphan" is logged. - test_double_cancel_routes_through_cancelled_flag — cancels the sweep twice in succession; asserts the inner task still received its CancelledError (no orphan). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(engine): persist-then-mirror snapshot write — no version skip on retry Cycle 14 of the PR #227 quality sprint. `EvolutionEngine._write_snapshot` previously incremented the in-memory mirror (`self._snapshot` + `set_current_snapshot`) BEFORE the Redis `save_run_state` call. On a transient Redis failure this left the mirror reflecting an unpersisted version: the next successful save then wrote version N+2, silently skipping N+1 in Redis. Resumers reading from Redis would see a gap that doesn't exist in any operator-visible log. Persist-then-mirror reorders the two operations so the in-memory mirror only advances after Redis confirms. If `save_run_state` raises, the mirror keeps the prior version, the next call retries the SAME version number, and Redis stays gap-free. Mirror is now always `≤` Redis — acceptable because Redis is the source of truth on resume. Tests (tests/evolution/test_engine_invariants.py::TestWriteSnapshotPersistThenMirror): - test_save_failure_leaves_mirror_at_old_version: asserts mirror stays at version 0 when save_run_state raises RuntimeError - test_successful_save_updates_mirror_and_redis_in_one_step: happy path - test_retry_after_failure_uses_same_version: asserts saved_versions == [1, 1] (mirror-then-save form would have produced [1, 2]) Regression: 1060/1060 tests pass across tests/evolution/ and the four integration suites (acceptor_engine, advanced_scenarios, complex_scenarios, evolution_engine_edge_cases). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(engine): bound post_step_hook to 300s — prevent ingestor wedge Long-lived post_step_hook (CompositionInjectionHook walks the full G archive) was previously awaited without a wall-clock bound: a hung hook (network call without timeout, infinite loop) would freeze the ingestor — no further sweeps fire, no new mutants land in the archive. Fix: wrap the hook call in `_run_bounded_post_step_hook`, which drives the hook via an explicit asyncio.Task and bounds it with `asyncio.wait(timeout=_POST_STEP_HOOK_TIMEOUT_S)`. On timeout we cancel + grace-wait + log; on outer cancel we cancel + await briefly + re-raise. Key load-bearing detail: ``asyncio.wait`` (NOT ``asyncio.wait_for``). ``wait_for`` cancels the inner task then awaits the cancel to be honored before raising TimeoutError, so a hook that catches CancelledError and keeps looping extends our wait indefinitely — defeating the bound. Plain ``wait`` returns at the deadline regardless of the inner task's state; we surface the orphan via the pending set and log "potential orphan coroutine; ingestor proceeding". Test suite adds TestPostStepHookTimeoutBound (5 tests): - fast_hook_completes_normally — happy path, default budget - hung_hook_cancelled_after_budget — sleeps 60s, monkeypatched to 0.1s budget, asserts WARN + hook_was_cancelled event set - uncooperative_hook_logs_orphan_warn — bounded-badness stubborn hook (swallows first cancel, honors second so test loop reaps it); asserts elapsed < 1.0s and both WARN lines fire - outer_cancel_propagates_to_hook — cancels poll_and_ingest mid- hook, asserts hook cancelled and sweep re-raises - default_timeout_is_generous — sanity: 60s ≤ T ≤ 3600s, 0.5s ≤ grace ≤ 30s Regression: 1060+ evolution+integration tests green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(engine): post_step_hook timeout knobs; iteration-window stats; deadlock stress * `EngineConfig` gains `post_step_hook_timeout_s` (default 300s) and `post_step_hook_cancel_grace_s` (default 2s) so the wall-clock bound on a single post-step hook invocation is tunable per run; ingestor no longer carries module-private magic constants. * `EvolutionaryStatisticsCollector` gains `iteration_window_size` (default 8). The iteration cohort aggregates now use a trailing window `[iter - N, iter]`, restoring the "stats over the last batch" signal that the old generational engine produced from per-generation cohorts. `N = 0` disables the feature and keeps the iteration fields None. * New deadlock-stress suite in `tests/evolution/test_refresh_parents.py` exercises 32-way same-parent storms, randomized-order overlapping batches, and cancel-mid-acquire on the per-id parent lock. * `tests/monitoring/test_experiment_monitor.py` helper now seeds both `total_mutants` and `programs_processed` — the latter is the field `RunSnapshot.generation` reads from, so the assertion-based tests pass against the current snapshot schema. * Scrub of historical refactor framing (cycle numbers, finding tags, in-flight rewire wording) from comments, docstrings and one filename; no behavioural change in those sites. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(llm): langfuse v4 handler init; pin langfuse>=4,<5 `LangchainCallbackHandler` no longer exposes `.client` in langfuse 4.x, so `handler.client.flush_at = 1` raises AttributeError at `MultiModelRouter.__init__` -> Hydra instantiation fails before the run even starts. Fix: configure the singleton `Langfuse` client with `flush_at=1, flush_interval=1` before constructing the handler — the handler picks it up via `get_client()` internally. Also tighten the pin (`langfuse>=2.0.0` was unconstrained upward and silently admitted v4) to `langfuse>=4.0.0,<5` so this API contract doesn't drift again without a deliberate bump. Pre-existing bug on main (introduced 2026-04-03, commit 51a14631); unrelated to the steady-state refactor branch but blocking E2E. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(run.py): drop stale cfg.max_generations reference The steady-state engine refactor deleted the generation/epoch concept; ``cfg.max_generations`` no longer exists, so the startup log on line 74 raised ``ConfigAttributeError`` and aborted every launch immediately after the engine printed its own start banner. Replaced with ``cfg.max_mutants`` — the top-level constant that backs ``MaxMutantsStopper``, which is the canonical termination signal now. The engine's own log already reports ``stopper=MaxMutantsStopper``; this just adds the bound. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(engine): extend per-parent-id lock through child-DAG via ParentRefreshTicket Producer→ingestor handoff for the per-parent-id lock acquired in ParentRefresher. The lock now spans refresh + mutate + child-DAG, not just refresh — closing the invariant "parents are not refreshed while a child of theirs is in flight." Why: a concurrent producer that selected the same parents could refresh them while another producer's child was mid-DAG (state=RUNNING, metrics={}). AncestrySelector picked up that unscored child as ancestry, triggering the "missing fitness key" warning the user reported on `run.py problem.name=heilbron llm=gemini3_flash`. Changes: * refresh.py — Add ParentRefreshTicket (idempotent release; holds per-parent-id locks in sorted order). New refresh_with_ticket() returns the ticket; back-compat refresh() now wraps it and auto-releases on return. Failure paths release any partially-acquired locks before re-raising. * mutant_task.py — Acquire ticket via refresh_with_ticket(); transfer atomically with _in_flight.add() under _in_flight_lock; finally-release ticket if not transferred (failure path). Two ownership-handoff invariants now documented in module docstring: slot + ticket. * steady_state.py — _inflight_tickets: dict[mutant_id, ticket] paired with _in_flight set. * ingestor.py — Pop tickets under _in_flight_lock atomically with slot release; release() outside the lock to keep the critical section short. Tests: * test_refresh_parents.py — Add TestRefreshWithTicket (6 tests): ticket holds lock until release, idempotent release, empty parents, back-compat refresh() auto-release, failure-path lock release. * test_engine_invariants.py — Add TestNoRefreshWhileChildInFlight (4 tests): second producer blocks until child ingested, failure-before-register releases ticket, accept/reject paths both release ticket, leaked child releases ticket. * test_engine_ghost_persist.py — Update _FakeEngine to implement the ticket API. * test_engine_invariants.py — Update two mocks to use refresh_with_ticket instead of refresh. Verified: all engine + refresh + invariant tests pass (95 cases); test_engine_stress.py passes its full 36-case parametrise sweep. * refactor(engine): collapse elite→parent indirection in mutant_task Source the elite pool size from parent_selector.num_parents instead of the now-vestigial max_elites_per_generation. With pool == num_parents, parent_selector.create_parent_iterator(elites) is a no-op shuffle, so mutant_task.py no longer needs to do next(iter(...), []) over it. - _select_elites_for_mutation → _select_parents_for_mutation, returns the actual parent set directly. - mutant_task.run_one_mutant calls it once; single empty-archive guard. - Stress test stub now honours the EvolutionStrategy.select_elites contract (return at most `total`); the old behaviour relied on the parent_iterator to subsample. max_elites_per_generation stays in EngineConfig for legacy YAML compatibility but is no longer read by the engine. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cli): add `gigaevo profiler` subcommand for log flow profiling Parses an evolution runner log and emits two artifacts per run: - profile_<label>.txt -- pipeline summary (counts, refresh queue stats, per-program timeline) - profile_<label>.html -- interactive Plotly dashboard (lifecycle bars, stage sub-bars, refresh + re-eval bands, accept/reject bars) Resolution priority mirrors `logs`: --file <path> for arbitrary logs, positional labels under -e for manifest resolution, no-args + -e to profile every run in the manifest. Default output dir: experiments/<exp>/profiler/. Core renderer lives in gigaevo.monitoring.flow_profiler so the CLI is a thin wrapper. Accept/reject markers use go.Bar (same width as the DAG span bar) instead of scatter markers, so they sit on the program's exact row at every zoom level. Min visual width clamped to 50ms (was 250ms) to keep sub-second events readable without smearing the early timeline. Footer explains queue-wait pathology referencing ParentRefresher._await_done() pinning in-flight slots during re-eval. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(scheduling): add CachedFirstPrioritizer for re-eval-first DAG launch A program with non-empty stage_results has already been DAG-evaluated once, so on re-eval most of its stages will hit cached_skip and finish in milliseconds. Surfacing those to the front of the launch queue directly unblocks producer tasks that are pinned on ParentRefresher._await_done() (each pinned task holds an in-flight slot, so when N mutants x M-second refresh queues collide, throughput collapses even though per-DAG exec is near-zero). The cache signal is sound: fresh mutants from Program.from_mutation_spec inherit default_factory=dict (empty), re-eval candidates retain the dict through batch_transition_by_ids (which only patches state + atomic_counter, program.py:281 -> redis_program_storage.py:632-633), and dag_runner.mget fetches without exclude=EXCLUDE_STAGE_RESULTS. No code path destroys the field. Implements a two-tier partition: cached programs first, fresh second. Within each tier the input order is preserved -- Redis SMEMBERS hash order, which the runner uses upstream, has no meaningful semantics. No predictor needed -- the cache signal lives on the program itself. 7 new tests in tests/evolution/test_scheduling.py::TestCachedFirstPrioritizer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(monitoring): emit LLM_CALL canonical event from MutationAgent The MutationAgent overrides acall_llm to use a structured_llm pathway, which bypassed the base BaseStrategyAgent._emit_event(LLMCall(...)) call. As a result, /flow-profiler had no MutationAgent timings — only Lineage and Insights showed up in canonical event aggregations. Add a finally-block emission that records latency, token usage, model, attempt count, and error_type on both success and failure, matching the contract used by the base agent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(profiler): utilization view — LLM/exec overlap + mutation archetypes Adds a torch-profiler-style "is the LLM fully hidden behind exec stages" signal to /flow-profiler. Three new primitives: * LLMCallEvent dataclass + LLM_CALL_RE — parse every canonical `[LLM_CALL] {json}` line into (stage, end, duration_ms, ok, …). * classify_stage(name) — bucket stages into llm / exec / orchestration. LLM stages (LineageStage, InsightsStage, *Agent canonical names) and program-exec stages (CallProgramFunction, CallValidatorFunction) are the two sides of the overlap; orchestration is excluded. * compute_utilization(...) — interval-union math returning total_llm_s, total_exec_s, overlap_s, overlap_efficiency = overlap / min(L, E), plus peak_concurrent_dags and per-archetype accept/reject counts. Also: * parse_log returns (programs, refreshes, llm_events) — 3-tuple. * MUT_RE captures the optional `(model=…, archetype=…, prompt_id=…)` suffix already emitted by the mutation operator, attaching it to Program.mutation_archetype / .mutation_model. * format_summary_text gains a "Utilization" section + archetype table. * render_full_html gains a colored efficiency stat-bar (red <30%, amber <60%, green ≥60%) and an archetype frequency table above the plot. Smoke on experiments/heilbron/v1-honest-repro/run_A2_G.log: LLM wall 76640s · exec wall 44860s · overlap 40377s (90% of min(L,E)) peak concurrent DAGs: 11 · 2421 LLM events (116 failed) Computational Reinvention 91a/76r/49o · Guided Innovation 73a/61r/38o Harmful Pattern Removal 12a/5r/4o · Solution Space Exploration 10a/4r/3o 19 new tests, all green; CLI smoke (10 tests) still green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: ruff format follow-up on test_mutation_agent Pre-push hook caught residual formatting in the LLM_CALL emission tests added in c336eb58; reformat only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(profiler): drop experiment-branded subtitle from page header The HTML header used to render `<h1>flow profile · A2_G</h1>` followed by a `<span class="sub">heilbron/v1-honest-repro / A2_G</span>` next to it, which made the generic profiler tool look "branded" with whatever experiment was being analyzed. Drop the prominent subtitle and relegate the source path to a small muted `source: ...` line in the footer. The browser tab title and h1 are now clean — just `flow profile · <label>`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(monitoring): live flow profiler daemon for run.py Adds gigaevo/monitoring/live_profiler.py — a small helper that spawns a daemon thread to periodically re-render the running experiment log into profile_live.html inside the Hydra output dir. Writes are atomic (.tmp + os.replace) so a browser reload mid-render never sees a partial file, and exceptions on one tick are logged but don't kill the loop. run.py picks up the new helper with a single line after setup_logger — keeps the entry point minimal as requested. Tests cover the render-once contract, daemon-thread bootstrap, lazy log-creation wait, and atomic-write residue. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(profiler): inline Plotly so HTML renders in sandboxed previews VS Code's HTML preview extension (and other sandboxed/offline viewers) blocks external <script src="cdn.plot.ly/..."> loads, so the previous include_plotlyjs="cdn" produced a blank page in those environments. Switch to include_plotlyjs="inline" which embeds plotly.js directly into the document. File grows from ~50KB to ~4.7MB, but it now renders anywhere — VS Code preview, archived run artifacts, offline shares. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(specs): mutation-throughput two-semaphore redesign Decouple "LLM/refresh in flight" from "produced-but-not-ingested" so the DAG sees a freed slot back-to-back with the next ingest, without waiting for a fresh refresh+LLM round-trip. Single tunable (max_in_flight=N) sizes both semaphores. Steady-state pipeline depth ~2N mutants: ~N producers (mix of LLM-running and ready-result-held), ~N buffered (DAG queue + running). Ticket ownership and orphan-window equivalence preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(plans): two-sema mutation-throughput implementation plan 11 TDD tasks: config docstring rewrite (1), engine init + log line + sweep doc updates (2), dispatcher producer_sema (3), mutant_task buffer-sema acquire-after-LLM with paired finally (4), ingestor buffer-sema release (5), ghost-persist test migration (6), slot-leak chaos invariants (7), JIT DAG-refill behavioral property (8), resume-after-kill (9), real-Redis end-to-end smoke (10), full-sweep + push (11). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(engine): rewrite max_in_flight docstring for two-sema semantics Field name unchanged; semantics now apply symmetrically to producer and buffer pools. Steady-state pipeline depth ~2N. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(engine): replace _in_flight_sema with _producer_sema + _buffer_sema Two-semaphore backpressure for the steady-state engine. _producer_sema caps concurrent LLM/refresh tasks; _buffer_sema caps produced-but-not-yet-ingested mutants. Both sized to existing max_in_flight knob — no new config surface. Touched: steady_state.py (init + log + sweep doc), dispatcher.py (acquire _producer_sema), mutant_task.py (acquire _buffer_sema after LLM, paired release in finally), ingestor.py (release _buffer_sema on DONE/DISCARDED). Ghost-persist test still pinned to old single-sema model — migration lives in T6 to keep this commit reviewable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(engine): migrate test suite from _in_flight_sema to two-sema pair T2-T5 (95129056) replaced the single _in_flight_sema with _producer_sema (dispatcher-side, always released in finally) and _buffer_sema (producer acquires post-LLM, ingestor releases on DONE/DISCARDED). Migrate every remaining test reference: - caller-protocol acquire/release → _producer_sema (mirrors dispatcher) - slot-accounting + len(_in_flight) conservation → _buffer_sema (_in_flight membership is gated by _buffer_sema in the new model) - 'all slots returned' assertions → both pools at full capacity Test intent preserved; semantics translated 1:1 to the new model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(engine): T7 - Slot-leak chaos test for two-sema architecture Add comprehensive chaos test suite validating slot conservation under adversarial timings and concurrent access patterns. Tests verify that the two-semaphore model (producer_sema + buffer_sema) maintains invariants across: - Race conditions: rapid acquire/release cycles, concurrent transfers - Backpressure: ingestor slow-release blocking producer - Cancellation: mid-acquire/mid-flight cancellation with proper cleanup - Edge cases: minimal (max_in_flight=1), large (max_in_flight=100), full drain Key invariant validated: semaphore values stay in [0, max_in_flight] range and in-flight mutants do not exceed max_in_flight, proving no slot leak across dispatcher, producer, and ingestor phases. 15 ne…
1 parent e3c5d69 commit 7fbe06e

142 files changed

Lines changed: 18276 additions & 6380 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

config/config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ defaults:
44
- ideas_tracker: none # Post-run ideas analysis (override with ideas_tracker=default)
55
- memory: none # Memory provider (override with memory=local for treatment arms)
66
- aggregator: none # NullAggregator sentinel (override with aggregator=heilbron_improver|heilbron_constructor)
7-
- stopper: max_generations # Pluggable stopper (override with stopper=wall_clock, stopper=fitness_plateau, ...)
7+
- stopper: max_mutants # Pluggable stopper (override with stopper=wall_clock, stopper=fitness_plateau, ...)
88
- _self_
99

1010
# Problem metadata (set via command line or problem-specific config)

config/constants/evolution.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,11 +3,11 @@
33

44
loop_interval: 1.0
55
max_elites_per_generation: 5
6-
max_mutations_per_generation: 8
76
sync_min_delta: 8
87
num_parents: 2
98
mutation_mode: rewrite
10-
max_generations: 100
9+
max_mutants: 800
1110
strip_comments_and_docstrings: false
1211
pre_step_hook: null
1312
post_step_hook: null
13+
max_in_flight: 8

config/evolution/default.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,13 +17,13 @@ program_acceptor:
1717

1818
# Engine config
1919
engine_config:
20-
_target_: gigaevo.evolution.engine.EngineConfig
20+
_target_: gigaevo.evolution.engine.SteadyStateEngineConfig
2121
loop_interval: ${loop_interval}
2222
max_elites_per_generation: ${max_elites_per_generation}
23-
max_mutations_per_generation: ${max_mutations_per_generation}
2423
program_acceptor: ${program_acceptor}
2524
parent_selector: ${parent_selector}
2625
stopper: ${stopper}
26+
max_in_flight: ${max_in_flight}
2727

2828
metrics_tracker:
2929
_target_: gigaevo.utils.metrics_tracker.MetricsTracker
@@ -33,7 +33,7 @@ metrics_tracker:
3333

3434
# Evolution engine
3535
evolution_engine:
36-
_target_: gigaevo.evolution.engine.EvolutionEngine
36+
_target_: gigaevo.evolution.engine.SteadyStateEvolutionEngine
3737
storage: ${ref:redis_storage}
3838
strategy: ${evolution_strategy}
3939
mutation_operator: ${mutation_operator}

config/evolution/steady_state.yaml

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,13 +11,10 @@ engine_config:
1111
_target_: gigaevo.evolution.engine.SteadyStateEngineConfig
1212
loop_interval: ${loop_interval}
1313
max_elites_per_generation: ${max_elites_per_generation}
14-
max_mutations_per_generation: ${max_mutations_per_generation}
1514
program_acceptor: ${program_acceptor}
1615
parent_selector: ${parent_selector}
1716
stopper: ${stopper}
1817
max_in_flight: 8
19-
refresh_order: fifo
20-
refresh_passes: 1
2118

2219
# Override engine target
2320
evolution_engine:

config/experiment/base.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ defaults:
1313
- /algorithm: single_island
1414
- /pipeline: auto
1515
- /runner: default
16-
- /scheduling: fifo
16+
- /scheduling: cached_first
1717
- /loader: directory
1818
- /logging: tensorboard
1919
- /metrics: default

config/experiment/migration_bus.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ defaults:
2222
- /algorithm: single_island
2323
- /pipeline: auto
2424
- /runner: default
25-
- /scheduling: fifo
25+
- /scheduling: cached_first
2626
- /loader: directory
2727
- /logging: tensorboard
2828
- /metrics: default

config/experiment/prompt_coevolution.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ defaults:
2626
- /algorithm: single_island
2727
- /pipeline: auto
2828
- /runner: default
29-
- /scheduling: fifo
29+
- /scheduling: cached_first
3030
- /loader: directory
3131
- /logging: tensorboard
3232
- /metrics: default

config/experiment/steady_state.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ defaults:
1717
- /algorithm: single_island
1818
- /pipeline: auto
1919
- /runner: default
20-
- /scheduling: fifo
20+
- /scheduling: cached_first
2121
- /loader: directory
2222
- /logging: tensorboard
2323
- /metrics: default

config/experiment/steady_state_adversarial.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ defaults:
2222
- /algorithm: single_island
2323
- /pipeline: adversarial_coevo_ss
2424
- /runner: default
25-
- /scheduling: fifo
25+
- /scheduling: cached_first
2626
- /loader: directory
2727
- /logging: tensorboard
2828
- /metrics: default

config/experiment/steady_state_bus.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ defaults:
1919
- /algorithm: single_island
2020
- /pipeline: auto
2121
- /runner: default
22-
- /scheduling: fifo
22+
- /scheduling: cached_first
2323
- /loader: directory
2424
- /logging: tensorboard
2525
- /metrics: default

0 commit comments

Comments
 (0)