Commit 7fbe06e
refactor(engine): true JIT-refresh steady-state engine (#227)
* docs(specs): steady-state engine audit + true-JIT-refresh redesign
Captures the current overlapping concepts in gigaevo/evolution/engine/
(epoch vs generation, two flags gating one loop, three drain paths, two
ingestion paths, multi-pass refresh) and proposes a redesign where the
only post-seed DONE->QUEUED flip happens for the parents picked for a
single mutation. Counter consolidates to total_mutants; epoch concept
goes away entirely; file split brings each module to ~250 LOC with a
single responsibility.
Draft for user review on refactor/steady-state-true-jit-refresh.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(specs): refine steady-state redesign — async stream, multi-parent, iteration axis
- Recast §3.2 as a continuous async stream: dispatcher + per-mutant tasks
+ ingestor (spawn-and-forget), not a sequential loop.
- Generalise refresh path for num_parents > 1 (RandomParentSelector and
AllCombinationsParentSelector both take num_parents); per-parent lock
to prevent double-flip on overlapping selections.
- Pin Program.iteration semantics as total_mutants_at_production (denser
plot axis) and flag *_in_iteration cohort aggregates in collector.py
as a plan-level migration item.
- Rename module split to dispatcher.py / mutant_task.py / ingestor.py.
- Add risks for multi-parent backpressure starvation and cohort aggregate
collapse.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(plans): true-JIT-refresh steady-state engine — 21-task implementation plan
TDD-sequenced refactor of gigaevo/evolution/engine/steady_state.py per
docs/superpowers/specs/2026-05-12-steady-state-engine-audit-and-redesign.md:
- Delete epoch concept, gate, drain barrier
- Single total_mutants counter (rename total_generations)
- Refresh only selected parents JIT, not whole archive
- Continuous async stream: dispatcher + mutant_task + ingestor
- Module split: engine.py / dispatcher.py / mutant_task.py / ingestor.py / refresh.py
- Drop refresh_passes / refresh_order / refresh_pass / epoch_trigger_count
- Keep MaxGenerationsStopper as deprecated alias of MaxMutantsStopper
- Migrate config/evolution/default.yaml to SteadyStateEvolutionEngine
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(plans): paranoia tasks 19A-19F + hard-rename stopper (Option A)
Two clarifications to the steady-state JIT refactor plan:
1. Add Tasks 19A-19F before the smoke + PR tasks:
- 19A: concurrency stress + load/async simulation suite
- 19B: cancellation invariants + resume-after-kill
- 19C: real-Redis integration smoke
- 19D: ParentRefresher failure-mode resilience
- 19E: chaos-hacker adversarial review pass
- 19F: counter monotonicity invariant
2. Stopper rename is hard, not aliased. The old MaxGenerationsStopper
counted *epochs* (~8 mutants each); MaxMutantsStopper counts mutants.
An alias would silently shrink runs ~8x. Delete the old class, delete
the old config files, rename the global default from
max_generations: 100 to max_mutants: 800 (preserves prior effective
run length). Old configs fail loudly at Hydra compose time.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(engine): single-counter total_mutants; drop refresh_pass; hard-rename stopper
Foundational refactor for JIT-refresh steady-state engine (plan task 2+3+4+7+8):
Engine
- EngineMetrics.total_generations -> total_mutants (single-counter progress)
- EngineSnapshot.total_generations -> total_mutants
- EngineSnapshot.refresh_pass field DELETED (multi-pass refresh removed)
- SteadyStateEngineConfig.refresh_passes field REMOVED
- steady_state._refresh_archive_programs: inlined one-pass body; multi-pass
loop + per-pass snapshot bumps gone
Stopper (hard rename, no back-compat alias per Option A)
- MaxGenerationsStopper(max_generations=N) -> MaxMutantsStopper(max_mutants=N)
- config/stopper/max_generations*.yaml -> max_mutants*.yaml
- config/constants/evolution.yaml: max_generations: 100 -> max_mutants: 800
(preserves prior run length: 100 epochs x 8 mutants/epoch under steady-state)
- config/config.yaml stopper default: max_generations -> max_mutants
Manifest boundary preserved
- launch_generator.py: emits max_mutants={contract.max_generations} Hydra override
- Contract.max_generations stays (experiment-level concept)
- CMA-ES max_generations (optimizer hyperparam) unchanged
- watchdog/monitoring max_generations (experiment progress display) unchanged
Adversarial
- SharedBenchmarkFilteredLineageStage.compute_hash override DELETED
(refresh_pass-aware cache invariant obsolete under JIT-refresh)
Tests
- Deleted: test_snapshot_refresh_pass.py, test_lineage_cache_invalidation.py,
test_two_pass_mutation_context.py
- Vestigial "removed feature" assertion classes deleted per user directive
358 targeted tests pass. ruff clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(progress): migrate MainRunSyncHook + monitoring to programs_processed
Task 5: MainRunSyncHook polls snap.programs_processed (was total_mutants).
_last_main_gen -> _last_main_progress; _get_min_gen -> _get_min_progress.
Module docstring + log strings updated.
Task 6: redis_queries.get_generation -> get_programs_processed reading
snap.programs_processed. collect_snapshot.gen now sourced from
programs_processed; RunSnapshot.generation field name preserved for
display compatibility.
programs_processed is the canonical cross-run progress signal under JIT-
refresh: it counts mutants actually ingested into the archive (post-validation),
not total mutants emitted. Prompt-coevo sync needs the former to ensure the
main run has produced something usable before the prompt run advances.
Tests pass: tests/prompts/test_coevolution_sync.py (14), tests/monitoring/test_redis_queries.py (17).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(engine): ParentRefresher + ParentRefreshSelector ABC for JIT refresh
Adds the JIT DONE->QUEUED->DONE refresh helper that producer tasks call
before mutating selected parents. Replaces the multi-pass _refresh_archive
sweep removed in the prior commit.
Architecture (user directive 2026-05-12):
- ParentRefreshSelector: ABC choosing which programs to refresh given the
producer's parent pick. DirectParentsSelector is the canonical default
(refresh only the parents themselves). Future implementations may walk
lineage to depth-k and order refresh in depth-batched waves so deepest
ancestors finish before nearest parents flip.
- ParentRefresher: per-parent-id asyncio.Lock serialises overlapping
concurrent refreshers. Batch transition flips all DONE targets to QUEUED
atomically (no producer sees a half-flipped bundle), then polls mget()
until every target is DONE. DISCARDED-on-input or DISCARDED-during-wait
raises ValueError; vanished parents raise ValueError; absence-of-progress
raises TimeoutError. Caller aborts the mutant and releases its in-flight
slot rather than falling back to stale state.
Tests: 11/11 pass (single/empty/batch/overlap/discarded/timeout/selector
ABC contract/custom-selector-adds-targets/empty-selector-noop). FakeDag
test helper provides QUEUED -> RUNNING -> DONE auto-promotion to exercise
the refresh without a real DagRunner.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(engine): SteadyStateEvolutionEngine composes dispatcher + ingestor + ParentRefresher
Replaces the 935-LOC epoch-driven engine with a thin composition of
three new modules:
- gigaevo/evolution/engine/mutant_task.py — run_one_mutant: one mutant
per async task; explicit slot-ownership invariant (try/finally guards
the semaphore against partial-failure and cancellation)
- gigaevo/evolution/engine/dispatcher.py — dispatcher_loop: continuous
spawn-and-forget producer; backpressure via _in_flight_sema only
- gigaevo/evolution/engine/ingestor.py — ingestor_loop + poll_and_ingest:
long-lived ingestion loop with adaptive interval, batch DONE handling,
leaked-id sweep, slot-release on ingest
Deletes (from steady_state.py): _mutation_loop, _produce_one_mutant,
_get_cached_elites, _create_single_mutant, _ingestion_loop,
_poll_and_ingest, _ingest_batch, _should_trigger_epoch, _epoch_refresh,
_drain_in_flight, _drain_scoped, _refresh_archive_programs,
_mutation_gate, _cached_elites, _elite_cache_lock, _processed_since_epoch,
_epoch_mutants, _epoch_eligible_since (~800 LOC).
Config: drop refresh_passes + refresh_order from EngineConfig; hoist
max_in_flight to the parent; SteadyStateEngineConfig now a Hydra alias.
steady_state.yaml drops refresh_order + refresh_passes.
Tests: rewrite test_steady_state.py (736 → ~165 LOC) to cover
construction (incl. _parent_refresher wiring), backpressure semaphore,
generation cap stopping dispatcher_loop, restore from snapshot. Skip
modules pinned to deleted machinery: test_steady_state_determinism.py
(epoch determinism — to be rewritten against new tick site),
test_generation_boundary_emit.py (step() removal pending in Task 14).
See spec §3, plan §13.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(engine): delete generational EvolutionEngine.step() / run() loop
evolution=default now wires SteadyStateEvolutionEngine. EvolutionEngine
becomes an abstract base of shared helpers (snapshot, metrics, idle wait,
hooks, stop context). BusedEvolutionEngine migrated to subclass
SteadyStateEvolutionEngine with a periodic bus-drain background task.
Also persists total_mutants in the engine snapshot after each mutant
production so resume picks up the correct generation counter — previously
this happened inside step() which is now gone.
See spec docs/superpowers/specs/2026-05-12-steady-state-engine-audit-and-redesign.md §3.6.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(collector): set *_in_iteration aggregates to None under JIT engine
Each mutant has a unique iteration (= total_mutants_at_production), so cohort
aggregates collapse to single-program windows. Schema field retained for
plot/exporter compatibility; consumers needing windowed aggregates should
compute them at plot time. See spec §3.5 + §6.5.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(engine): JIT-refresh polish — empty-archive backoff, metric wiring, vestigial GenerationBoundary
Wraps up Tasks 16-18 of the JIT-refresh refactor (plan:
docs/superpowers/plans/2026-05-12-steady-state-true-jit-refresh.md).
gigaevo/evolution/engine/mutant_task.py
- Add asyncio.sleep(loop_interval) backoff when select_elites returns
empty (population seeding / all rejected). Prevents dispatcher
hot-spinning when the archive is empty.
- Wire submitted_for_refresh metric: record_reprocess_metrics(len(refreshed))
after ParentRefresher.refresh() succeeds. Previously the metric was
orphaned (defined but never incremented under JIT-refresh).
gigaevo/monitoring/events.py
- Mark GenerationBoundary vestigial with explanatory docstring. The class
schema is kept so legacy run logs still parse, but nothing in gigaevo/
emits this event under steady-state JIT-refresh.
config/constants/evolution.yaml, config/evolution/{default,steady_state}.yaml
gigaevo/evolution/engine/config.py
gigaevo/experiment/launch_generator.py
- Drop max_mutations_per_generation — under JIT-refresh there is no
per-generation mutation cap; max_in_flight controls parallelism.
Tests adjusted for JIT-refresh floor-trigger semantics:
- Strict total_mutants == N replaced with >= N at ~12 sites across
tests/integration/{test_mini_run,test_multigen_e2e,test_memory_e2e,
test_acceptor_engine,test_advanced_scenarios,test_brittleness,
test_complex_scenarios,test_engine_regression,test_ingest_regression,
test_evolution_engine_edge_cases}.py and tests/concurrency/
test_deadlock_prevention.py. JIT cap is a floor trigger — concurrent
in-flight mutants may bring total_mutants slightly above max.
- Skip class-level on TestEmptyArchiveEngine, TestAllMutationsReturnNone,
TestAllMutationsRaise, TestTransientMutationFailure (empty/zero-success
scenarios cannot reach the cap under JIT-refresh).
- Skip class-level on TestEngineStepIntegration — the generational
engine.step() entry point was deleted; deadlock-prevention under
JIT-refresh is covered by the paranoia suite (Task 19A).
- Skip two engine.run() wiring tests in TestEnginePostRunHookWiring
that hung on AsyncMock empty archive; the wiring is still covered by
test_none_hook_defaults_to_null + test_custom_hook_is_stored.
- New tests/config/test_stopper_configs.py pins the MaxMutantsStopper
Hydra targets and rejects MaxGenerationsStopper imports.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(specs): record JIT engine dry-run smoke results
§9 added with the Hydra config resolution table showing the new schema
is canonical: SteadyStateEvolutionEngine + MaxMutantsStopper +
max_in_flight, with no max_mutations_per_generation / refresh_pass /
total_generations references. Closed experiment configs intentionally
left unchanged.
Live-cluster run deferred to post-merge follow-up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(engine): concurrency stress + simulation suite (load × async patterns)
36-combo parametrised matrix exercising the JIT-refresh engine end-to-end
against fakeredis storage and a timed fake DAG. Verifies six core
invariants: no semaphore leak, _in_flight drains, total_mutants reaches
the cap with bounded overshoot (≤ max_in_flight), programs_processed
equals accepted+rejected, ParentRefresher flip count is bounded, and
snapshot counters are monotonically non-decreasing.
Sweeps (max_in_flight, n_mutants, duration_dist, overlap_rate) across
mif ∈ {1,4,16}, n ∈ {50,200}, dist ∈ {const,expo,heavy_tail}, ov ∈
{0,0.5}. The high-overlap arm seeds the archive with a single elite so
concurrent producers contend on one parent and exercise the per-id
ParentRefresher lock; the low-overlap arm seeds 2×mif elites so producers
pick distinct parents.
Closes Task 19A from the steady-state JIT-refresh refactor plan.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(engine): cancellation + resume-after-kill invariants
Two new test files paired with one engine fix:
* test_engine_cancellation.py — cancels run() mid-flight and after early
start; verifies slot accounting (sema._value + |_in_flight| ==
max_in_flight), counters never regress, snapshot remains consistent.
* test_engine_resume_after_kill.py — runs engine A to cap=5, tears it
down, rebuilds engine B against the same fakeredis server, calls
restore_state(), runs to cap=10. Verifies progress is strictly forward
across the resume and the cap window includes bounded overshoot.
Engine fix: SteadyStateEvolutionEngine.run()'s finally clause now
explicitly cancels the dispatcher and ingestor tasks. asyncio.wait()
does not propagate cancellation into its waited tasks, so without this
they leaked across an external run-task cancel, holding semaphore slots
forever (the cancellation test caught this directly).
Closes Task 19B from the steady-state JIT-refresh refactor plan.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(engine): ParentRefresher failure-mode resilience
Adds four new failure-mode tests to test_refresh_parents.py:
* No-timeout-default: with timeout_seconds=None, a brief DAG pause is
absorbed and the refresh still completes successfully.
* Mid-flight DISCARD: a parent flipped DISCARDED by another path during
the await raises ValueError rather than returning stale state.
* Mid-flight vanish: a parent removed from storage during the await
raises ValueError.
* Reversed input order: two concurrent refreshes on the same parent
set with reversed input orderings both complete — the per-id locks
are acquired in deterministic sorted order, so classic
lock-order-inversion deadlocks are impossible.
Closes Task 19D from the steady-state JIT-refresh refactor plan.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(engine): final ingestion sweep runs under cancellation
Chaos-hacker review identified two compounding High-severity bugs:
#1 cancellation between _in_flight.add and _write_snapshot permanently
leaks slots (slot_transferred=True blocks per-task release).
#2 final-sweep loop in run() body is unreachable when CancelledError
propagates from asyncio.wait().
Fix: move the final ingestion sweep into run()'s finally block with
asyncio.shield to survive outer cancellation, bounded by
max_in_flight + 1 passes to avoid hangs on QUEUED stragglers.
Also cancel dispatcher/ingestor tasks explicitly in finally — asyncio.wait()
does not cancel its waited tasks when the outer coroutine is cancelled, so
they could otherwise survive engine teardown and continue spawning mutants.
Regression test test_cancel_drains_done_programs_via_final_sweep asserts
that DONE programs in _in_flight at cancel time are ingested by the sweep,
with programs_processed advancing accordingly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(engine): serialise _write_snapshot to keep Redis in sync with memory
Chaos-hacker review finding #3 (Medium): concurrent mutant tasks call
_write_snapshot from run_one_mutant after incrementing total_mutants.
Without synchronisation, two writers can compute monotone versions
v=N+1 and v=N+2 synchronously, then both await save_run_state — if the
v=N+2 save lands first and v=N+1 lands second, Redis ends at v=N+1
with stale fields while the in-memory mirror sits at v=N+2. A crash
resume then rehydrates the older v=N+1 and loses the latest updates.
Fix: wrap the model_copy + set_current_snapshot + storage.save_run_state
in an asyncio.Lock so the per-call version bump and Redis write land
atomically. Last-writer-wins still holds; only the ordering is
guaranteed.
Regression test concurrent_write_snapshot_keeps_redis_and_memory_in_sync
issues 50 concurrent _write_snapshot calls and asserts the Redis-persisted
version equals the in-memory mirror's version at the end.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(refresh): bound _locks dict via WeakValueDictionary
Chaos-hacker review finding #4 (Medium): ParentRefresher._locks was a
plain dict that retained an asyncio.Lock per distinct parent id forever.
On a multi-day run touching tens of thousands of mutants this leaks
~100 bytes/lock plus event-loop bookkeeping per entry — small in
absolute terms but proportional to evolution history.
Fix: switch to weakref.WeakValueDictionary so locks are retained only
while at least one in-flight refresh holds a strong reference. The lock
contract is unchanged — concurrent refreshes for the same parent id
still share the same lock, because the active caller's strong ref keeps
the entry alive across reentrant lookups.
Regression test test_refresh_locks_dict_does_not_grow_unboundedly
sequentially refreshes 20 distinct parents and asserts the dict shrinks
back to (near-)empty after gc.collect().
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(integration): real-Redis smoke for JIT-refresh engine (Task 19C)
Adds tests/integration/test_engine_real_redis.py: end-to-end smoke
against an actual Redis at localhost:6379/0 (or REAL_REDIS_URL).
Auto-skips when no Redis is reachable, so committing it is safe on
machines without a local server.
What it verifies:
- The full dispatcher/ingestor/refresher/mutant-task pipeline survives
real network round-trips (not just fakeredis fast-paths).
- Bounded overshoot holds with cap=6, max_in_flight=2.
- No semaphore slot leak at run end.
- Snapshot is persisted to Redis at the same version the in-memory
mirror reports — i.e. the snapshot-lock fix actually serialises real
Redis writes, not just fakeredis ones.
Uses a unique key prefix per run and SCAN+DELETE cleanup in fixture
finally, so the test never clobbers another caller's data.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(engine): wall-clock bounded final sweep, patient on stragglers
The previous max_in_flight+1-pass bound terminated the final ingestion
sweep before the DAG could flip QUEUED→RUNNING→DONE for the last few
in-flight mutants on normal completion, leaking their semaphore slots
(stress suite caught a 1-slot leak on high-mif runs).
Switch the sweep to a wall-clock deadline (5s) with loop_interval sleep
between empty passes, while preserving the asyncio.shield + early-break
on CancelledError that made the cancellation-safety fix work. The sleep
itself is wrapped to bail on cancellation immediately.
All 36 stress combos + 82 paranoia tests now green; 555-test evolution
sweep clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(engine): rename final-sweep loop var to satisfy mypy
The cleanup loop reused `t` from the earlier ``for t in pending`` block
(typed `Task[Any]`), but the cleanup iterates a tuple of
`Task[Any] | None`. Renaming the variable removes the assignment-type
conflict without changing behavior.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(engine): apply PR #227 review fixes — naming + deprecated test cleanup
Address two review recommendations on the JIT-refresh refactor:
1. Naming consistency — make "generation" → "mutant" rename complete:
- core.py: rename _reached_generation_cap → _reached_mutant_cap
- core.py: 8 log prefixes "[EvolutionEngine] gen={}" → "mutants={}"
- dispatcher.py: 2 call sites updated to _reached_mutant_cap
- test_steady_state.py: section-comment reference updated
2. Remove deprecated tests left as @pytest.mark.skip after the refactor.
These covered the old epoch/step()/run-loop machinery that no longer
exists in the JIT-refresh engine. Removed in bulk via AST script
matching skip reasons like "JIT-refresh", "step() removed",
"Generational ...", "GenerationBoundary emission",
"_refresh_archive_programs", "_create_mutants".
Whole files deleted (only contained deprecated tests):
- tests/evolution/test_steady_state_determinism.py
- tests/evolution/test_generation_boundary_emit.py
Surgical class/function removals (kept the rest of each file):
- tests/evolution/test_evolution_engine.py
- tests/evolution/test_evolution_engine_complex.py
- tests/evolution/test_resume.py
- tests/evolution/bus/test_engine.py
- tests/integration/test_acceptor_engine.py
- tests/integration/test_advanced_scenarios.py
- tests/integration/test_complex_scenarios.py
- tests/integration/test_evolution_engine_edge_cases.py
Net: 13 files changed, 16 insertions(+), 2207 deletions(-).
Verified: ruff check + format clean; targeted pytest sweep
(tests/evolution/ + 4 integration files) green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(deps): unpin gigaevo-memory from private git URL — it's now public
The gigaevo-memory repo went public, so we can drop the
`@ git+https://...@<commit-sha>#subdirectory=client/python` form and
rely on the plain `gigaevo-memory` spec. This also unblocks CI's pip
install step, which was failing on the private-repo username prompt:
fatal: could not read Username for 'https://github.com':
No such device or address
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(heilbron_adversarial): replace absolute-path symlinks with relative
The 9 symlinks under problems/heilbron_adversarial/{pop_a_gan,pop_a_soft,
pop_b_soft}/{fallback,helper.py,initial_programs} were committed on
2026-05-02 with absolute targets baked in:
/mnt/virtual_ai0001071-04017_SR004-nfs1/CFS-SR008/workspace/mathemage/
gigaevo-core-internal/problems/heilbron_adversarial/pop_a/...
That path only exists on this NFS dev mount, so every CI runner saw
dangling links and ruff bailed out with:
E902 Failed to create cache key
Cause: No such file or directory (os error 2)
--> problems/heilbron_adversarial/pop_a_gan/helper.py
Replaced all 9 with relative siblings (e.g. ../pop_a/helper.py).
The `_soft` and `_gan` problem variants reuse pop_a's / pop_b's
helper.py + fallback/ + initial_programs/, same intent as before, now
portable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(engine): rewire post_step_hook + adjacent observability polish
PR #227 deleted EvolutionEngine.step(), which historically fired
_post_step_hook once per generation. The kwarg + assignment in
EvolutionEngine.__init__ became dead code: CompositionInjectionHook —
the only production consumer, wired by 3 adversarial experiment launches
— silently no-opped on every Arm A run.
Changes
-------
1. Re-wire _post_step_hook in poll_and_ingest: fires once per ingest
sweep that adds >=1 program to the archive (the JIT analogue of the
old per-generation boundary). Fault-isolated — a buggy hook can't
abort ingestion, which has already committed to Redis.
2. H3 fix: ParentRefresher.timeout_seconds default None -> 600s.
None default could strand a mutant forever on DAG-runner crash,
leaking its in-flight semaphore slot.
3. Final-sweep observability: extract _final_ingestion_sweep() and
emit WARNING with stuck-IDs when the 5s wall-clock deadline elapses
before _in_flight drains. Operators previously had no signal that
a run shut down with leaked slots.
4. Drop stale "JIT-refresh" / "epoch" docstring framing from
config.py, core.py, mutant_task.py, steady_state.py.
Tests
-----
- 13 new SOTA tests in tests/evolution/test_post_step_hook_rewire.py
cover hook firing semantics (added==0 / added>0 / mixed / failure /
unset), finite-timeout default + override, and WARNING emission via
loguru sink capture.
- Existing test_refresh_no_timeout_default_waits_through_brief_pause
renamed + assertion updated for the new finite default.
Verification
------------
Full audit of evolution engine consumers ran clean:
- tests/evolution/ (1000+ tests, all pass)
- tests/integration/test_acceptor_engine,advanced_scenarios,
complex_scenarios,evolution_engine_edge_cases (42 tests)
- tests/adversarial_pipeline/ (composition_injection, progress_sync,
steady_state_adversarial_e2e)
- tests/memory/ (ideas_tracker_pipeline, engine_integration,
dag_memory_flow, memory_e2e_pipeline)
- tests/concurrency/test_deadlock_prevention
- tests/integration/test_brittleness, mini_run, multigen_e2e,
engine_regression, ingest_regression
- tests/prompts/test_coevolution_sync
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(engine): two deadlock-class chaos-hacker findings + regressions
Closes the top two CRITICAL findings from the adversarial review on
commit 130fdbb2 (chaos-hacker agent a79a294de8502a7d8).
1) ParentRefresher: dedup parents by id before sorting + acquiring.
asyncio.Lock is NOT reentrant. If a parent bundle ever contains the
same program id twice (any ParentSelector returning duplicates, or a
future ParentRefreshSelector that walks lineage hitting the same id
via two paths), _acquire_all would call acquire() twice on the same
Lock from the same task and the mutant task hangs forever, holding
its in-flight slot. Eventually the engine starves.
Fix: fold duplicates by id inside refresh() before sort + lock
acquisition. First-seen wins. Test:
test_refresh_does_not_deadlock_on_duplicate_parent_ids and
test_refresh_selector_emitting_duplicates_does_not_deadlock both
would hang without this fix; with it, they complete and the parent
flips exactly once.
2) _final_ingestion_sweep: track inner task explicitly so cancellation
does not leak a detached poll_and_ingest.
asyncio.shield(coro) only protects the inner coroutine from being
cancelled — it does NOT prevent CancelledError from propagating to
the awaiter. The previous code did `await asyncio.shield(poll_and_
ingest(self))` and on cancellation broke out of the loop. The inner
then continued as a detached Task, racing _post_run_hook.on_run_
complete and engine teardown for access to storage, _in_flight, and
the post_step_hook.
Fix: wrap poll_and_ingest in an explicit asyncio.create_task; on
outer cancellation, cancel the inner and wait_for(timeout=1.0) so
no zombie coroutine outlives the method. New test
test_cancellation_does_not_leak_inner_task asserts the inner's
finally fires before we move on.
Chaos-hacker finding #1 (WeakValueDictionary GC race) was investigated
and dismissed: any task awaiting `lk.acquire()` keeps `lk` strongly
referenced on its suspended-coroutine frame, so the WeakValueDictionary
entry cannot be reclaimed while a waiter exists. The race the report
described requires a waiter without a strong ref, which is unreachable.
Verified all engine consumers green: evolution (1001 tests),
integration (83), adversarial+concurrency+memory+prompts (1424).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(engine): drop dead code + fix cancel propagation in final sweep
Cycles 5-6 of the auto-optimize sprint:
Cycle 5 — systems-architect proposal #5 (dead-code deletion):
- Delete EvolutionEngine.pause(), resume(), is_running() — zero callers
anywhere in gigaevo/, tests/, tools/, experiments/. Verified with
`git grep` (tests/evolution/test_strategy_base.py hits are on
strategy.pause/resume, not engine.pause/resume).
- Delete _set_state() — one-line shim with zero internal callers.
- Delete _paused field — written but never read.
- Delete _run_start_mutants field + its dead write in
steady_state.py:63 — never consumed anywhere.
Cycle 6 — chaos-hacker Findings 1 (HIGH) + 2 (MED) on 75203666:
Finding 1 (HIGH): _final_ingestion_sweep used
`contextlib.suppress(BaseException)` around `wait_for(inner, 1.0)`.
That suppress catches asyncio.CancelledError, KeyboardInterrupt, and
SystemExit — meaning a second cancellation (or SIGINT) during the
inner-task cleanup was silently absorbed and the sweep returned
"normally", letting `_post_run_hook.on_run_complete` run in a teardown
context the supervisor never authorised.
* Narrow to `suppress(Exception)` so only true exceptions (Redis
transient, network blip) are tolerated during cleanup.
* Track the cancel locally and re-raise CancelledError after the
inner is settled and the (skipped) WARNING block — so the cancel
reaches `run()`'s awaiter.
* In `run()`'s finally, catch the re-raised CancelledError around
the sweep call so the finalizer (`post_run_hook.on_run_complete`)
still executes — cancellation is a shutdown signal, not a "skip
cleanup" one — then re-raise.
* Skip the "deadline elapsed" WARNING when sweep exits via cancel
(the message is for diagnostics of leaked semaphore slots, not
for shutdown-was-aborted).
Finding 2 (MED): docstring claimed `wait_for(timeout=1.0)` was a
"tight" cap. In CPython 3.12 `wait_for` cancels the inner and then
waits for it to honor the cancel — wall-clock cost is bounded by
inner cleanup latency, not the parameter. Updated docstring to say
"best-effort timeout" and clarified that only `Exception` is
suppressed (BaseException family — CancelledError, KeyboardInterrupt,
SystemExit — propagates intact).
New regression tests in tests/evolution/test_post_step_hook_rewire.py
(TestFinalSweepCancellationSafety):
* test_cancellation_propagates_to_awaiter — pins Finding 1: cancel
must reach the engine awaiter; sweep_task.cancelled() must be true.
* test_normal_completion_returns_without_cancellederror — pins the
happy/timeout path so a future refactor of cancel plumbing doesn't
accidentally raise on deadline-elapsed.
Verified clean:
* tests/evolution/ + tests/integration/test_acceptor_engine.py +
test_advanced_scenarios.py + test_complex_scenarios.py +
test_evolution_engine_edge_cases.py → 1115 passed
* tests/adversarial_pipeline/ + tests/concurrency/ + tests/memory/ +
tests/prompts/ → 1581 passed, 5 skipped
* ruff check + format clean on the full repo
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(engine): drop dead mutation_ids branch + dead fields, lock schema with extra=forbid
Cycle 7 of auto-optimize-loop on PR #227. Synthesizes systems-architect's
stale-refs audit (12 ranked proposals, bundle 1-4 + 9) with the chaos-hacker
LOW findings from cycle 6.
Production cleanups
- `_ingest_completed_programs(mutation_ids=...)` parameter dropped — the only
production caller passed `mutation_ids=None`. The fast-discard branch had
no live caller. Function now does one job: deserialize non-archive DONE
programs, push through acceptor + strategy.
- `EngineConfig.generation_timeout` deleted. Documented "deprecated, no
longer used" since 31b66de7 (2026-04-19); zero production reads.
- `EngineMetrics.errors_encountered` deleted. Zero production readers/writers;
only test_engine_metrics.py mutated it. EngineSnapshot doesn't embed
EngineMetrics, so no Redis-snapshot break.
Defense-in-depth
- `EngineConfig` now uses `extra="forbid"`. Future field deletions will
crash callers passing the dead kwarg instead of silently dropping into
Pydantic's default `extra="ignore"`. Verified safe for live Hydra configs
(config/evolution/*.yaml only set declared fields).
- Swept 14 test sites still passing `generation_timeout=X` — chaos-hacker
flagged these as silent semantic drift if `extra="forbid"` is added without
the sweep.
Chaos-hacker LOW fixes (review of d5facada)
- `raise asyncio.CancelledError from None` on both sites in steady_state.py.
A Redis blip suppressed by the surrounding `contextlib.suppress(Exception)`
no longer dangles in `__context__` and misleads the operator.
- Tightened `test_cancellation_propagates_to_awaiter` assertion: drops the
`cancelled() or (done() and exception() is CancelledError)` OR-branch.
Probed: on Py3.12, `raise asyncio.CancelledError` inside a coroutine ALWAYS
produces `task.cancelled() == True`, and calling `.exception()` on a
cancelled task re-raises CancelledError (so the OR-branch was unreachable).
Tightening is strictly safer; future regressions that break the
`.cancelled()` contract now surface immediately.
Test cleanup
- Deleted `tests/evolution/test_ingest_mutation_ids.py` (299 LOC) — every
test pinned the dead `mutation_ids` branch.
- Removed stale "generation_timeout deprecated" zombie banner + module
docstring entry in test_evolution_engine_complex.py.
- Stripped `errors_encountered` assertions from test_engine_metrics.py.
Verification
- ruff: clean on touched dirs.
- Tests green:
* tests/evolution/ + selected integration (~700 tests, all dots)
* tests/concurrency/test_deadlock_prevention.py (all dots, 3 skipped)
* tests/integration/ + tests/benchmarks/ + tests/stages/ (all dots)
* tests/concurrency/ + tests/memory/ + tests/adversarial_pipeline/ +
tests/dag/ (all dots)
- chaos-hacker adversarial review of this diff: 1 HIGH (the
generation_timeout test-rot, fixed by the sweep above), 0 medium/low
remaining. Verdict: ship.
Adjacent finding (deferred)
- pre-existing observability gap: a second cancel landing during
`on_run_complete` skips the "[SteadyState] Stopped" log line.
Net behavior (cancellation reaches the awaiter) is correct; only the
log marker is missing. Out of scope for cycle 7.
LOC: -394 +32 (net -362). Full bytes-on-disk delta dominated by the
test_ingest_mutation_ids.py deletion (299 LOC).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(engine): drop dead error counters + step() vestige, inline helpers
Cycle 8 quality pass on PR #227 — systems-architect proposals #1, #3, #6, #8 +
partial #2.
- Delete `elites_selection_errors` and `mutations_creation_errors` fields
(always passed 0 in production — verified every call site)
- Delete `record_elite_selection_metrics`, `record_mutation_metrics`,
`record_reprocess_metrics` (single-line accumulators with one caller each
after dropping the errors arg)
- Inline `_pick_parents` helper (4-line single-caller wrapper)
- Delete `SteadyStateEvolutionEngine.step()` NotImplementedError vestige and
its test (no production caller; `run()` already raises in the abstract base)
- Fix dated docstring `elites_selected` "across all generations" → "Total
elites cumulatively selected for mutation" (JIT-refresh has no generations)
- Update `tools/benchmarks/bench_multirun.py` call site for consistency
Net: 32 insertions, 107 deletions (-75 LOC). All `tests/evolution/`,
`tests/integration/`, `tests/concurrency/`, `tests/benchmarks/`,
`tests/stages/`, `tests/memory/`, `tests/adversarial_pipeline/`,
`tests/dag/` pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(engine): eliminate ghost-persist by inlining single-mutant primitive
`generate_mutations(...)` wrapped `asyncio.gather(*tasks,
return_exceptions=True)`. If the outer awaiter (typically `run_one_mutant`,
spawned by the dispatcher and cancellable at engine teardown) was
cancelled after a child's `storage.add(program)` succeeded but before
`gather` returned, gather re-raised CancelledError to the caller — the
child's `except BaseException` handler still returned `persisted_id`,
but `results` was never bound. The program stayed in Redis with no
`_in_flight` tracking → ghost.
Refactor: extract `generate_one_mutation()` — a single-mutant primitive
with no gather. `mutant_task.run_one_mutant` calls it directly. The
function's `except BaseException` arm returns `persisted_id` to the
caller without any gather to swallow it. The caller registers the id in
`_in_flight` before the cancellation can re-propagate.
`generate_mutations(...)` is retained as a sequential batch wrapper for
the existing test suite (it loops over `generate_one_mutation` and
breaks on CancelledError, returning accumulated ids). Production
callers only ever passed `limit=1`, so there is no perf impact.
Adds `tests/evolution/test_engine_ghost_persist.py` with 7 deterministic
test cases covering: cancel-pre-persist (propagates cleanly), cancel-
post-persist (id surfaced), cancel-mid-lineage (id surfaced), an
integration test through `run_one_mutant` proving the id lands in
`_in_flight`, a gather-cancel regression-guard demonstrating the
historical failure mode, and backwards-compat checks for the batch
wrapper.
Files: 2 src changes (mutation.py refactor, mutant_task.py call-site),
1 new test file. 999/1000 evolution tests pass; 1 deselected test is a
pre-existing failure unrelated to this change (patches a non-existent
`steady_state.generate_mutations` symbol).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(engine): drop dead category banners in test_evolution_engine_complex
The file had empty Category A/F/H/J banner comments left over after
those categories' tests were removed. They created a false signal of
"these areas are covered" without any actual test bodies. Drop them
and the corresponding lines in the module docstring.
No production code touched; all 11 tests in this file still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(engine): annotate inlined parents var to satisfy mypy
The cycle-8 inlining of _pick_parents lost the helper's return type
annotation. Now that the assignment uses `next(..., [])` as the default,
mypy cannot infer the element type. Add an explicit `list[Program]` hint.
No behavior change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(engine): add SOTA invariant test suite for steady-state concurrency
Plugs 8 coverage gaps identified by test-obsessed-reviewer's audit. The
new file `tests/evolution/test_engine_invariants.py` (440 LOC, 15 tests)
guards the engine's 8 concurrency invariants (I1-I8):
Gap 1 (I1) — cancel-between-acquire-and-slot-transfer releases slot
* test_cancel_before_elite_select_releases_slot
* test_cancel_during_parent_refresh_releases_slot
Gap 2 (I6) — dispatcher cancel drains all active mutant tasks
* test_active_tasks_are_cancelled_on_dispatcher_cancel
Gap 3 (I7) — ingestor uses fast interval (0.25*loop_interval) saturated
* test_fast_interval_when_saturated
* test_slow_interval_when_idle (negative control)
Gap 4 (I6) — post_run_hook fires even on cancellation
* test_hook_fires_when_run_cancelled
Gap 5 (I4) — _in_flight_lock does not starve under contention
* test_many_waiters_all_progress (50 concurrent waiters, all land)
Gap 6 (I8) — _await_idle treats DISCARDED as idle (not active)
* test_discarded_only_returns_idle
* test_await_idle_returns_promptly_with_only_discarded
Gap 7 (I5) — snapshot version monotonic in Redis under concurrent writes
* test_concurrent_writes_versions_monotone (20 concurrent writes)
* test_in_memory_mirror_tracks_redis
Gap 8 (I1+I2) — double-poll same id releases slot exactly once
* test_id_not_double_released
* test_leaked_id_swept_once
Bonus (I3 deterministic) — slot_transferred flag is exclusive
* test_success_path_transfers_slot
* test_no_elite_releases_slot
All tests are deterministic — asyncio.Event for sync, no time.sleep
polling, no flaky timing assumptions. The full suite runs in <1s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(engine): await metrics_collector cancel before storage.close
Without `await` after `_metrics_collector_task.cancel()`, the collector
may still be mid `await storage.<call>` when `storage.close()` fires
below — raising ConnectionClosedError into an orphan coroutine that has
no caller. Bound the wait so a wedged collector cannot indefinitely
block shutdown.
Add two regression tests:
- test_collector_finished_before_storage_close: asserts the collector's
finally runs strictly before storage.close().
- test_wedged_collector_does_not_block_stop_forever: asserts stop()
returns within the 2s wait_for budget even when the collector
shields against cancel.
Cycle 11: chaos-hacker F4 finding from cycle 10 deadlock probe.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(engine): drop redundant CancelledError arm + tidy Any import
Two micro-simplifications surfaced during cycle-12 quality review:
1. `dispatcher.py`: the explicit `except CancelledError: raise` arm was a
no-op — `finally` runs regardless, and `CancelledError` propagates
naturally without an explicit re-raise. Removing the dead arm keeps
the loop's control flow obvious: try → finally.
2. `core.py`: TYPE_CHECKING-guarded `from typing import Any` was overhead
for a singleton typing import (zero cost). Promoted to top-level.
Regression test added (`TestDispatcherFinallyCancelsSpawnedMutants`):
monkey-patches `run_one_mutant` to a long-runner, cancels the dispatcher
mid-flight, asserts the spawned mutant received `CancelledError` via the
dispatcher's `finally` block. Pins the cancellation contract so a future
refactor cannot accidentally swallow the cancel.
Total invariant tests now 18 (was 17 in cycle 11).
ruff clean + full evolution+integration suite green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(engine): close two orphan paths in _final_ingestion_sweep
Cycle-13 chaos-hacker probe found a HIGH bug replaying the cycle-11 F4
shape through a different channel: the sweep's bounded-wait
(`suppress(Exception)` + `wait_for(timeout=1.0)`) had TWO escape paths
that left the `poll_and_ingest` inner task detached past the sweep's
return:
(1) Slow-cancel target — if inner takes >1s to honor cancel, wait_for
raises TimeoutError (Exception subclass), suppressed silently;
inner runs detached and races storage.close() in stop().
(2) Double-cancel — if a second cancel arrives during wait_for(inner),
wait_for re-raises CancelledError (BaseException, NOT Exception);
the suppress doesn't catch it, control exits the except arm with
`cancelled=True; break` skipped; inner is detached.
Both replay the cycle-11 metrics_collector orphan: ConnectionClosedError
fires into a coroutine that has no caller to surface it.
Fix (steady_state.py:174-205): explicit `suppress(CancelledError)`
catches the double-cancel and routes through the cancelled-flag path;
TimeoutError is logged as a WARNING so an operator can correlate the
orphan risk with whatever stranded the inner task in Redis. Generic
Exception still logs but does not let inner escape.
Regression coverage (+2 tests, total now 20):
- test_slow_cancel_inner_logs_timeout_but_no_orphan_on_normal_path
monkey-patches poll_and_ingest to a slow-cancel target (re-shields
the first cancel for 2s); asserts the WARN about "did not honor
cancel" / "orphan" is logged.
- test_double_cancel_routes_through_cancelled_flag — cancels the sweep
twice in succession; asserts the inner task still received its
CancelledError (no orphan).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(engine): persist-then-mirror snapshot write — no version skip on retry
Cycle 14 of the PR #227 quality sprint.
`EvolutionEngine._write_snapshot` previously incremented the in-memory
mirror (`self._snapshot` + `set_current_snapshot`) BEFORE the Redis
`save_run_state` call. On a transient Redis failure this left the mirror
reflecting an unpersisted version: the next successful save then wrote
version N+2, silently skipping N+1 in Redis. Resumers reading from Redis
would see a gap that doesn't exist in any operator-visible log.
Persist-then-mirror reorders the two operations so the in-memory mirror
only advances after Redis confirms. If `save_run_state` raises, the mirror
keeps the prior version, the next call retries the SAME version number,
and Redis stays gap-free. Mirror is now always `≤` Redis — acceptable
because Redis is the source of truth on resume.
Tests (tests/evolution/test_engine_invariants.py::TestWriteSnapshotPersistThenMirror):
- test_save_failure_leaves_mirror_at_old_version: asserts mirror stays
at version 0 when save_run_state raises RuntimeError
- test_successful_save_updates_mirror_and_redis_in_one_step: happy path
- test_retry_after_failure_uses_same_version: asserts saved_versions ==
[1, 1] (mirror-then-save form would have produced [1, 2])
Regression: 1060/1060 tests pass across tests/evolution/ and the four
integration suites (acceptor_engine, advanced_scenarios, complex_scenarios,
evolution_engine_edge_cases).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(engine): bound post_step_hook to 300s — prevent ingestor wedge
Long-lived post_step_hook (CompositionInjectionHook walks the full G
archive) was previously awaited without a wall-clock bound: a hung
hook (network call without timeout, infinite loop) would freeze the
ingestor — no further sweeps fire, no new mutants land in the archive.
Fix: wrap the hook call in `_run_bounded_post_step_hook`, which drives
the hook via an explicit asyncio.Task and bounds it with
`asyncio.wait(timeout=_POST_STEP_HOOK_TIMEOUT_S)`. On timeout we
cancel + grace-wait + log; on outer cancel we cancel + await briefly
+ re-raise.
Key load-bearing detail: ``asyncio.wait`` (NOT ``asyncio.wait_for``).
``wait_for`` cancels the inner task then awaits the cancel to be
honored before raising TimeoutError, so a hook that catches
CancelledError and keeps looping extends our wait indefinitely —
defeating the bound. Plain ``wait`` returns at the deadline regardless
of the inner task's state; we surface the orphan via the pending set
and log "potential orphan coroutine; ingestor proceeding".
Test suite adds TestPostStepHookTimeoutBound (5 tests):
- fast_hook_completes_normally — happy path, default budget
- hung_hook_cancelled_after_budget — sleeps 60s, monkeypatched to
0.1s budget, asserts WARN + hook_was_cancelled event set
- uncooperative_hook_logs_orphan_warn — bounded-badness stubborn
hook (swallows first cancel, honors second so test loop reaps it);
asserts elapsed < 1.0s and both WARN lines fire
- outer_cancel_propagates_to_hook — cancels poll_and_ingest mid-
hook, asserts hook cancelled and sweep re-raises
- default_timeout_is_generous — sanity: 60s ≤ T ≤ 3600s,
0.5s ≤ grace ≤ 30s
Regression: 1060+ evolution+integration tests green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(engine): post_step_hook timeout knobs; iteration-window stats; deadlock stress
* `EngineConfig` gains `post_step_hook_timeout_s` (default 300s) and
`post_step_hook_cancel_grace_s` (default 2s) so the wall-clock bound on
a single post-step hook invocation is tunable per run; ingestor no
longer carries module-private magic constants.
* `EvolutionaryStatisticsCollector` gains `iteration_window_size`
(default 8). The iteration cohort aggregates now use a trailing window
`[iter - N, iter]`, restoring the "stats over the last batch" signal
that the old generational engine produced from per-generation cohorts.
`N = 0` disables the feature and keeps the iteration fields None.
* New deadlock-stress suite in `tests/evolution/test_refresh_parents.py`
exercises 32-way same-parent storms, randomized-order overlapping
batches, and cancel-mid-acquire on the per-id parent lock.
* `tests/monitoring/test_experiment_monitor.py` helper now seeds both
`total_mutants` and `programs_processed` — the latter is the field
`RunSnapshot.generation` reads from, so the assertion-based tests
pass against the current snapshot schema.
* Scrub of historical refactor framing (cycle numbers, finding tags,
in-flight rewire wording) from comments, docstrings and one filename;
no behavioural change in those sites.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(llm): langfuse v4 handler init; pin langfuse>=4,<5
`LangchainCallbackHandler` no longer exposes `.client` in langfuse 4.x,
so `handler.client.flush_at = 1` raises AttributeError at
`MultiModelRouter.__init__` -> Hydra instantiation fails before the run
even starts. Fix: configure the singleton `Langfuse` client with
`flush_at=1, flush_interval=1` before constructing the handler — the
handler picks it up via `get_client()` internally.
Also tighten the pin (`langfuse>=2.0.0` was unconstrained upward and
silently admitted v4) to `langfuse>=4.0.0,<5` so this API contract
doesn't drift again without a deliberate bump.
Pre-existing bug on main (introduced 2026-04-03, commit 51a14631);
unrelated to the steady-state refactor branch but blocking E2E.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(run.py): drop stale cfg.max_generations reference
The steady-state engine refactor deleted the generation/epoch concept;
``cfg.max_generations`` no longer exists, so the startup log on line 74
raised ``ConfigAttributeError`` and aborted every launch immediately
after the engine printed its own start banner.
Replaced with ``cfg.max_mutants`` — the top-level constant that backs
``MaxMutantsStopper``, which is the canonical termination signal now.
The engine's own log already reports ``stopper=MaxMutantsStopper``;
this just adds the bound.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(engine): extend per-parent-id lock through child-DAG via ParentRefreshTicket
Producer→ingestor handoff for the per-parent-id lock acquired in
ParentRefresher. The lock now spans refresh + mutate + child-DAG, not
just refresh — closing the invariant "parents are not refreshed while a
child of theirs is in flight."
Why: a concurrent producer that selected the same parents could refresh
them while another producer's child was mid-DAG (state=RUNNING,
metrics={}). AncestrySelector picked up that unscored child as
ancestry, triggering the "missing fitness key" warning the user
reported on `run.py problem.name=heilbron llm=gemini3_flash`.
Changes:
* refresh.py — Add ParentRefreshTicket (idempotent release; holds
per-parent-id locks in sorted order). New refresh_with_ticket()
returns the ticket; back-compat refresh() now wraps it and
auto-releases on return. Failure paths release any partially-acquired
locks before re-raising.
* mutant_task.py — Acquire ticket via refresh_with_ticket();
transfer atomically with _in_flight.add() under _in_flight_lock;
finally-release ticket if not transferred (failure path). Two
ownership-handoff invariants now documented in module docstring:
slot + ticket.
* steady_state.py — _inflight_tickets: dict[mutant_id, ticket]
paired with _in_flight set.
* ingestor.py — Pop tickets under _in_flight_lock atomically with
slot release; release() outside the lock to keep the critical
section short.
Tests:
* test_refresh_parents.py — Add TestRefreshWithTicket (6 tests):
ticket holds lock until release, idempotent release, empty parents,
back-compat refresh() auto-release, failure-path lock release.
* test_engine_invariants.py — Add TestNoRefreshWhileChildInFlight
(4 tests): second producer blocks until child ingested,
failure-before-register releases ticket, accept/reject paths both
release ticket, leaked child releases ticket.
* test_engine_ghost_persist.py — Update _FakeEngine to implement the
ticket API.
* test_engine_invariants.py — Update two mocks to use
refresh_with_ticket instead of refresh.
Verified: all engine + refresh + invariant tests pass (95 cases);
test_engine_stress.py passes its full 36-case parametrise sweep.
* refactor(engine): collapse elite→parent indirection in mutant_task
Source the elite pool size from parent_selector.num_parents instead of
the now-vestigial max_elites_per_generation. With pool == num_parents,
parent_selector.create_parent_iterator(elites) is a no-op shuffle, so
mutant_task.py no longer needs to do next(iter(...), []) over it.
- _select_elites_for_mutation → _select_parents_for_mutation, returns
the actual parent set directly.
- mutant_task.run_one_mutant calls it once; single empty-archive guard.
- Stress test stub now honours the EvolutionStrategy.select_elites
contract (return at most `total`); the old behaviour relied on the
parent_iterator to subsample.
max_elites_per_generation stays in EngineConfig for legacy YAML
compatibility but is no longer read by the engine.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(cli): add `gigaevo profiler` subcommand for log flow profiling
Parses an evolution runner log and emits two artifacts per run:
- profile_<label>.txt -- pipeline summary (counts, refresh queue stats,
per-program timeline)
- profile_<label>.html -- interactive Plotly dashboard (lifecycle bars,
stage sub-bars, refresh + re-eval bands, accept/reject bars)
Resolution priority mirrors `logs`: --file <path> for arbitrary logs,
positional labels under -e for manifest resolution, no-args + -e to
profile every run in the manifest. Default output dir:
experiments/<exp>/profiler/.
Core renderer lives in gigaevo.monitoring.flow_profiler so the CLI is a
thin wrapper. Accept/reject markers use go.Bar (same width as the DAG
span bar) instead of scatter markers, so they sit on the program's
exact row at every zoom level. Min visual width clamped to 50ms (was
250ms) to keep sub-second events readable without smearing the early
timeline. Footer explains queue-wait pathology referencing
ParentRefresher._await_done() pinning in-flight slots during re-eval.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(scheduling): add CachedFirstPrioritizer for re-eval-first DAG launch
A program with non-empty stage_results has already been DAG-evaluated once,
so on re-eval most of its stages will hit cached_skip and finish in
milliseconds. Surfacing those to the front of the launch queue directly
unblocks producer tasks that are pinned on ParentRefresher._await_done()
(each pinned task holds an in-flight slot, so when N mutants x M-second
refresh queues collide, throughput collapses even though per-DAG exec is
near-zero).
The cache signal is sound: fresh mutants from Program.from_mutation_spec
inherit default_factory=dict (empty), re-eval candidates retain the dict
through batch_transition_by_ids (which only patches state + atomic_counter,
program.py:281 -> redis_program_storage.py:632-633), and dag_runner.mget
fetches without exclude=EXCLUDE_STAGE_RESULTS. No code path destroys the
field.
Implements a two-tier partition: cached programs first, fresh second.
Within each tier the input order is preserved -- Redis SMEMBERS hash
order, which the runner uses upstream, has no meaningful semantics.
No predictor needed -- the cache signal lives on the program itself.
7 new tests in tests/evolution/test_scheduling.py::TestCachedFirstPrioritizer.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(monitoring): emit LLM_CALL canonical event from MutationAgent
The MutationAgent overrides acall_llm to use a structured_llm pathway,
which bypassed the base BaseStrategyAgent._emit_event(LLMCall(...)) call.
As a result, /flow-profiler had no MutationAgent timings — only Lineage
and Insights showed up in canonical event aggregations.
Add a finally-block emission that records latency, token usage, model,
attempt count, and error_type on both success and failure, matching the
contract used by the base agent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(profiler): utilization view — LLM/exec overlap + mutation archetypes
Adds a torch-profiler-style "is the LLM fully hidden behind exec stages"
signal to /flow-profiler. Three new primitives:
* LLMCallEvent dataclass + LLM_CALL_RE — parse every canonical
`[LLM_CALL] {json}` line into (stage, end, duration_ms, ok, …).
* classify_stage(name) — bucket stages into llm / exec / orchestration.
LLM stages (LineageStage, InsightsStage, *Agent canonical names) and
program-exec stages (CallProgramFunction, CallValidatorFunction) are
the two sides of the overlap; orchestration is excluded.
* compute_utilization(...) — interval-union math returning total_llm_s,
total_exec_s, overlap_s, overlap_efficiency = overlap / min(L, E),
plus peak_concurrent_dags and per-archetype accept/reject counts.
Also:
* parse_log returns (programs, refreshes, llm_events) — 3-tuple.
* MUT_RE captures the optional `(model=…, archetype=…, prompt_id=…)`
suffix already emitted by the mutation operator, attaching it to
Program.mutation_archetype / .mutation_model.
* format_summary_text gains a "Utilization" section + archetype table.
* render_full_html gains a colored efficiency stat-bar (red <30%, amber
<60%, green ≥60%) and an archetype frequency table above the plot.
Smoke on experiments/heilbron/v1-honest-repro/run_A2_G.log:
LLM wall 76640s · exec wall 44860s · overlap 40377s (90% of min(L,E))
peak concurrent DAGs: 11 · 2421 LLM events (116 failed)
Computational Reinvention 91a/76r/49o · Guided Innovation 73a/61r/38o
Harmful Pattern Removal 12a/5r/4o · Solution Space Exploration 10a/4r/3o
19 new tests, all green; CLI smoke (10 tests) still green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore: ruff format follow-up on test_mutation_agent
Pre-push hook caught residual formatting in the LLM_CALL emission tests
added in c336eb58; reformat only.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(profiler): drop experiment-branded subtitle from page header
The HTML header used to render `<h1>flow profile · A2_G</h1>` followed by
a `<span class="sub">heilbron/v1-honest-repro / A2_G</span>` next to it,
which made the generic profiler tool look "branded" with whatever
experiment was being analyzed.
Drop the prominent subtitle and relegate the source path to a small
muted `source: ...` line in the footer. The browser tab title and h1
are now clean — just `flow profile · <label>`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(monitoring): live flow profiler daemon for run.py
Adds gigaevo/monitoring/live_profiler.py — a small helper that spawns a
daemon thread to periodically re-render the running experiment log into
profile_live.html inside the Hydra output dir. Writes are atomic
(.tmp + os.replace) so a browser reload mid-render never sees a partial
file, and exceptions on one tick are logged but don't kill the loop.
run.py picks up the new helper with a single line after setup_logger —
keeps the entry point minimal as requested.
Tests cover the render-once contract, daemon-thread bootstrap, lazy
log-creation wait, and atomic-write residue.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(profiler): inline Plotly so HTML renders in sandboxed previews
VS Code's HTML preview extension (and other sandboxed/offline viewers)
blocks external <script src="cdn.plot.ly/..."> loads, so the previous
include_plotlyjs="cdn" produced a blank page in those environments.
Switch to include_plotlyjs="inline" which embeds plotly.js directly into
the document. File grows from ~50KB to ~4.7MB, but it now renders
anywhere — VS Code preview, archived run artifacts, offline shares.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(specs): mutation-throughput two-semaphore redesign
Decouple "LLM/refresh in flight" from "produced-but-not-ingested"
so the DAG sees a freed slot back-to-back with the next ingest,
without waiting for a fresh refresh+LLM round-trip.
Single tunable (max_in_flight=N) sizes both semaphores. Steady-state
pipeline depth ~2N mutants: ~N producers (mix of LLM-running and
ready-result-held), ~N buffered (DAG queue + running). Ticket
ownership and orphan-window equivalence preserved.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(plans): two-sema mutation-throughput implementation plan
11 TDD tasks: config docstring rewrite (1), engine init + log line +
sweep doc updates (2), dispatcher producer_sema (3), mutant_task
buffer-sema acquire-after-LLM with paired finally (4), ingestor
buffer-sema release (5), ghost-persist test migration (6), slot-leak
chaos invariants (7), JIT DAG-refill behavioral property (8),
resume-after-kill (9), real-Redis end-to-end smoke (10), full-sweep
+ push (11).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(engine): rewrite max_in_flight docstring for two-sema semantics
Field name unchanged; semantics now apply symmetrically to producer
and buffer pools. Steady-state pipeline depth ~2N.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(engine): replace _in_flight_sema with _producer_sema + _buffer_sema
Two-semaphore backpressure for the steady-state engine. _producer_sema caps
concurrent LLM/refresh tasks; _buffer_sema caps produced-but-not-yet-ingested
mutants. Both sized to existing max_in_flight knob — no new config surface.
Touched: steady_state.py (init + log + sweep doc), dispatcher.py (acquire
_producer_sema), mutant_task.py (acquire _buffer_sema after LLM, paired
release in finally), ingestor.py (release _buffer_sema on DONE/DISCARDED).
Ghost-persist test still pinned to old single-sema model — migration lives
in T6 to keep this commit reviewable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(engine): migrate test suite from _in_flight_sema to two-sema pair
T2-T5 (95129056) replaced the single _in_flight_sema with _producer_sema
(dispatcher-side, always released in finally) and _buffer_sema (producer
acquires post-LLM, ingestor releases on DONE/DISCARDED). Migrate every
remaining test reference:
- caller-protocol acquire/release → _producer_sema (mirrors dispatcher)
- slot-accounting + len(_in_flight) conservation → _buffer_sema
(_in_flight membership is gated by _buffer_sema in the new model)
- 'all slots returned' assertions → both pools at full capacity
Test intent preserved; semantics translated 1:1 to the new model.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(engine): T7 - Slot-leak chaos test for two-sema architecture
Add comprehensive chaos test suite validating slot conservation under
adversarial timings and concurrent access patterns. Tests verify that
the two-semaphore model (producer_sema + buffer_sema) maintains
invariants across:
- Race conditions: rapid acquire/release cycles, concurrent transfers
- Backpressure: ingestor slow-release blocking producer
- Cancellation: mid-acquire/mid-flight cancellation with proper cleanup
- Edge cases: minimal (max_in_flight=1), large (max_in_flight=100), full drain
Key invariant validated: semaphore values stay in [0, max_in_flight] range
and in-flight mutants do not exceed max_in_flight, proving no slot leak
across dispatcher, producer, and ingestor phases.
15 ne…1 parent e3c5d69 commit 7fbe06e
142 files changed
Lines changed: 18276 additions & 6380 deletions
File tree
- config
- constants
- evolution
- experiment
- pipeline
- scheduling
- stopper
- docs/superpowers
- plans
- specs
- gigaevo
- adversarial
- cli
- entrypoint
- evolution
- bus
- engine
- mutation
- scheduling
- experiment
- llm
- agents
- monitoring
- programs/stages
- prompts/coevolution
- runner
- utils
- problems/heilbron_adversarial
- pop_a_gan
- pop_a_soft
- pop_b_soft
- tests
- adversarial_pipeline
- benchmarks
- cli
- concurrency
- config
- database
- entrypoint
- evolution
- bus
- experiment
- integration
- llm
- memory
- monitoring
- prompts
- stages
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
7 | | - | |
| 7 | + | |
8 | 8 | | |
9 | 9 | | |
10 | 10 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
6 | | - | |
7 | 6 | | |
8 | 7 | | |
9 | 8 | | |
10 | | - | |
| 9 | + | |
11 | 10 | | |
12 | 11 | | |
13 | 12 | | |
| 13 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
17 | 17 | | |
18 | 18 | | |
19 | 19 | | |
20 | | - | |
| 20 | + | |
21 | 21 | | |
22 | 22 | | |
23 | | - | |
24 | 23 | | |
25 | 24 | | |
26 | 25 | | |
| 26 | + | |
27 | 27 | | |
28 | 28 | | |
29 | 29 | | |
| |||
33 | 33 | | |
34 | 34 | | |
35 | 35 | | |
36 | | - | |
| 36 | + | |
37 | 37 | | |
38 | 38 | | |
39 | 39 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
11 | 11 | | |
12 | 12 | | |
13 | 13 | | |
14 | | - | |
15 | 14 | | |
16 | 15 | | |
17 | 16 | | |
18 | 17 | | |
19 | | - | |
20 | | - | |
21 | 18 | | |
22 | 19 | | |
23 | 20 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
13 | 13 | | |
14 | 14 | | |
15 | 15 | | |
16 | | - | |
| 16 | + | |
17 | 17 | | |
18 | 18 | | |
19 | 19 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
22 | 22 | | |
23 | 23 | | |
24 | 24 | | |
25 | | - | |
| 25 | + | |
26 | 26 | | |
27 | 27 | | |
28 | 28 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
26 | 26 | | |
27 | 27 | | |
28 | 28 | | |
29 | | - | |
| 29 | + | |
30 | 30 | | |
31 | 31 | | |
32 | 32 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
17 | 17 | | |
18 | 18 | | |
19 | 19 | | |
20 | | - | |
| 20 | + | |
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
22 | 22 | | |
23 | 23 | | |
24 | 24 | | |
25 | | - | |
| 25 | + | |
26 | 26 | | |
27 | 27 | | |
28 | 28 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
22 | | - | |
| 22 | + | |
23 | 23 | | |
24 | 24 | | |
25 | 25 | | |
| |||
0 commit comments