refactor(config): replace hydra/omegaconf with typed pydantic+tyro#21
Open
GrigoryEvko wants to merge 943 commits into
Open
refactor(config): replace hydra/omegaconf with typed pydantic+tyro#21GrigoryEvko wants to merge 943 commits into
GrigoryEvko wants to merge 943 commits into
Conversation
…rns AnyCard
Core change: normalize_memory_card returns MemoryCard | ProgramCard (Pydantic
models) instead of dict[str, Any]. All internal code uses attribute access
(card.description) not dict access (card.get("description")).
Production code changes:
- card_conversion.py: normalize_memory_card returns AnyCard; card_to_concept_content,
build_entity_meta, format_search_results, is_program_card all accept AnyCard
- memory.py: self.memory_cards is dict[str, AnyCard]; _persist_index serializes
via model_dump(); _synthesize_results uses model attribute access; save_card
accepts dict | AnyCard at boundary
- memory_write_example.py: load_memory_cards normalizes all output to AnyCard
- models.py: ProgramCard gains keywords, strategy, links fields;
validate_assignment=True for mutability
Boundary pattern:
- External input (JSON, API responses, user dicts) → normalize_memory_card → AnyCard
- Internal operations → attribute access (card.field)
- Serialization (JSON, API) → card.model_dump() at the boundary
- card_update_dedup.py stays dict-based (LLM output parsing) — callers pass
model_dump() when crossing the boundary
813 tests pass, ruff check + format clean.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
refactor: ideas_tracker cleanup — loguru + sys.path removal
refactor: dict → Pydantic — normalize_memory_card returns AnyCard
Create gigaevo/memory/__init__.py with curated public API: - AmemGamMemory, MemoryCard, ProgramCard, AnyCard, ConnectedIdea - normalize_memory_card, GigaEvoMemoryBase - LocalMemorySnapshot, MemoryCardExplanation, Strategy Update gigaevo/memory/shared_memory/__init__.py with same exports. Users can now import from `gigaevo.memory` instead of deep paths: from gigaevo.memory import AmemGamMemory, MemoryCard 5 tests verify: __all__ completeness, package imports, subpackage imports, normalize roundtrip, AmemGamMemory construction. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Test gap: add LocalMemorySnapshot and Strategy to test_import_from_package_root (previously 2/10 exports untested for importability) - Circular import fragility: change `from gigaevo.memory import config` to `import gigaevo.memory.config as config` in 3 files — avoids relying on partial parent-package init during import chain Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
refactor(memory): public API exports
Move all memory-related test files from tests/ root into tests/memory/ subdirectory. Zero test loss: 483 tests before, 483 tests after. Files moved: - test_amem_gam_memory.py (67 tests) - test_card_update_dedup_extended.py (75 tests) - test_memory_api_search.py (21 tests) - test_memory_card_update_dedup.py (6 tests) - test_memory_contracts.py (21 tests) - test_memory_cycle5.py (17 tests) - test_memory_deeper.py (21 tests) - test_memory_e2e_scenarios.py (21 tests) - test_memory_engine_interaction.py (10 tests) - test_memory_full_agentic.py (15 tests) - test_memory_integration.py (26 tests) - test_memory_known_bugs.py (20 tests) - test_memory_models.py (16 tests) - test_memory_operator_integration.py (14 tests) - test_memory_public_api.py (5 tests) - test_memory_with_fake_agentic.py (24 tests) - test_memory_write_example_extended.py (22 tests) - test_memory_write_program_cards.py (3 tests) - test_normalize_memory_card.py (66 tests) - test_pydantic_cards.py (13 tests) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tion refactor(memory): consolidate test files into tests/memory/
Add `from __future__ import annotations` to all 41 memory module files that were missing it. Remove duplicate `_safe_get` from a_mem_memory_creation.py (now imports from utils.py). Auto-fix 4 UP037 violations (unnecessary quoted type annotations now that future annotations are active). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Was `-> dict[str, Any]` but actually returns `AnyCard` (Pydantic model). Found by chaos-hacker review — prevents TypeError trap for future callers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
refactor(memory): type quality improvements
Replaced all print() calls with loguru logger in 6 files: - A_mem/agentic_memory/memory_system.py: 4 prints → logger - A_mem/agent/agent_class.py: 2 prints → logger - GAM_root/gam/agents/research_agent.py: 38 prints → logger - GAM_root/gam/retriever/index_retriever.py: 2 prints → logger - GAM_root/gam/schemas/page.py: 2 prints → logger - GAM_root/gam/schemas/memory.py: 2 prints → logger Also removed old-style `logger = logging.getLogger(__name__)` in memory_system.py (replaced by loguru import). 509 tests pass, ruff clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
refactor: replace 50 print() with loguru in A_mem + GAM_root
exp: hover/no-deep-retrieval — ablation of retrieve_deep (k=10)
Rename map: - test_normalize_memory_card → test_card_normalization - test_memory_card_update_dedup → test_card_dedup - test_card_update_dedup_extended → test_card_dedup_edge_cases - test_amem_gam_memory → test_memory_backend - test_memory_deeper → test_memory_backend_internal - test_memory_full_agentic → test_memory_backend_agentic - test_memory_with_fake_agentic → test_memory_backend_fakes - test_memory_cycle5 → test_api_sync - test_memory_operator_integration → test_mutation_operator - test_memory_engine_interaction → test_engine_integration - test_memory_write_example_extended → test_write_pipeline - test_memory_write_program_cards → test_write_programs - test_memory_e2e_scenarios → test_scenarios - test_memory_integration → test_roundtrip - test_memory_known_bugs → test_edge_cases - test_concept_api_client → test_api_client (moved to tests/memory/) - test_openai_inference → test_llm_inference (moved to tests/memory/) - test_data_components, test_runtime_config (moved to tests/memory/) 666 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
refactor: rename memory test files — descriptive names
Every suppression replaced with proper typing: - Stage base: ClassVar[type[StageIO]] instead of unbound TypeVars - json.py: single dumps/loads definitions with types.ModuleType backend - LLM agents: TypedDict fields widened to accept None (truthful initial state) - Redis coevolution: _get_redis() returns AsyncRedis instead of object - DAG/engine/trackers: invariant assertions replacing silent suppression - analyzer.py: fixed wrong return type (dict → IncomingIdeas) 4468 tests pass, lint clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
refactor: remove all 27 type: ignore comments
…p Dynamic Chains New problem variant: chains/hover/full7_no_deep (7-step max, standard retrieval only). Design approved by Reviewer-2. Two-phase protocol: Phase A builds memory bank, Phase B tests memory-augmented vs standard mutation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tr]) RecordCardExtended.aliases is list[dict[str, dict[str, str|list[str]]]] but MemoryCard.aliases expects list[str]. The _to_list() helper passed dicts through unchanged, causing Pydantic validation crash at the memory write pipeline step after ideas_tracker completes. Added _flatten_aliases() that extracts description strings from the nested dict format while preserving plain string aliases. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RecordCardExtended.aliases is list[dict] (version history with nested
{experiment_id: {description, programs, explanations}}). The Pydantic
migration (846299e) incorrectly typed MemoryCard.aliases as list[str],
causing a validation crash when memory_write_pipeline passes ideas_tracker
output through normalize_memory_card.
Root cause: Pydantic migration assumed aliases are simple strings, but
Petr's original design uses them as structured version history. Fix the
type at the model level instead of adding a flattening adapter.
Reverts the _flatten_aliases band-aid from the previous commit.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nLab#2, PR #161) Added tests that verify normalize_memory_card and load_memory_cards handle ideas_tracker's structured alias format (list[dict] version history) without crashing. This test would have caught the Pydantic type mismatch that crashed the memory write pipeline (aliases: list[str] → list[dict]). Tests added: - test_aliases_with_ideas_tracker_dict_format (test_normalize_memory_card.py) - test_aliases_mixed_types (test_normalize_memory_card.py) - test_ideas_tracker_dict_aliases_preserved (test_memory_write_example_extended.py) All 91 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Redis WATCH deprecation: use pipeline context in archive_storage.py - AsyncMock coroutine warnings: set storage.snapshot = MagicMock() (bump() is sync) - ast.Str deprecation: use ast.Constant only (Python 3.14 compat) - Optuna ExperimentalWarning: suppress around TPESampler/PedAnovaImportanceEvaluator - Unclosed file handles: pathlib.Path.read_text() in test_scheduling.py - matplotlib tight_layout: layout="tight" on subplots() in comparison.py - Island __len__ RuntimeWarning: suppress in intentional error test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous literal token was a real, live OpenRouter credential committed to the source tree. Switch both musique and musique_retrieval shared configs to read OPENROUTER_API_KEY from the environment so the value is no longer redistributed with the repo. The committed key must still be revoked and rotated.
…cstrings Rewrites the lineage-race and ingestion-atomicity class docstrings to describe the present-tense contract being exercised rather than narrating prior root-cause investigations or pinning to brittle source-file locations.
pydantic 2.x has no module-level ``configure`` API; the call has been raising ``AttributeError`` since the rename and the surrounding try/except swallowed it on every import. Removes the block, the import, and the silenced exception path.
… arg The Top-N path called list.sort() with a key returning float | None which raises TypeError if any program has None fitness. The previous filter suppressed that in practice, but a stray None would crash mid-sort. The fallback now substitutes -inf so None values sink to the end deterministically. _walk_lineage accepted a metric argument it never consulted; the chain walk depends only on parent edges. Drop it from the signature and call sites (including the regression tests).
…nce check _compute_pareto_front extracted fitness values inside the inner O(N^2) loop, re-running extract_fitness_values twice per pair. Pre-extract once per program then iterate over the cached vectors, matching the pattern already used by ParetoFrontArchiveRemover.order_candidates. Micro-benchmark with 3 fitness keys (N candidates): N=50: 9.53 ms -> 0.64 ms (14.9x) N=200: 109.32 ms -> 4.77 ms (22.9x) N=500: 365.29 ms -> 15.80 ms (23.1x) tests/evolution/test_migrant_selectors.py: 10 passed.
spec_from_file_location returns ModuleSpec | None and its loader is Optional. The previous direct .loader.exec_module() chain would raise AttributeError instead of a useful message if either was None. The assert documents the invariant and gives pyright the narrowing it needs.
…h threading.Lock The round-robin index was advanced under no synchronization. Concurrent callers from multiple OS threads could read the same value, write the same successor, and either repeat or skip an island. Wrap the RMW in a threading.Lock and pin the local index for the return. Covered by a new test that drives select_island from eight threads, each with its own event loop, and asserts a perfectly uniform island histogram across 200 advances.
…action RidgePredictor.predict previously held the model lock across the extractor.extract call and the sklearn predict call. Extraction is a pure, potentially expensive operation that does not touch any of the predictor's mutable state, so serializing concurrent predictions through it is wasted contention. Snapshot the (model, feature_keys) pair under the lock, then release it before extracting features and invoking predict on the captured local references. The sklearn model is immutable after fit, so the captured reference remains valid for the duration of the call. The no-model fallback behaviour is preserved exactly. A new probe test acquires the predictor lock non-blocking from inside a custom extractor and asserts it is free on every concurrent call, so a regression that re-introduces lock-held extraction would surface as a False in the recorded lock-state list.
EvolutionaryStatisticsCollector._process re-filtered the full population by iteration metadata for every program in the snapshot, repeating the O(N) scan N times. Bucket programs by iteration once in _ensure_population_cache (alongside the existing per-generation cache) then look up the iteration entry by key. Skipping programs whose iteration metadata is absent preserves the existing None-iteration fallback when the snapshot excludes metadata. Micro-benchmark of the filter pattern (M iterations across N programs): N=200 M=5: 2.99 ms -> 0.03 ms (~114x) N=1000 M=10: 111.59 ms -> 0.20 ms (~570x) N=5000 M=50: 8410.70 ms -> 2.62 ms (~3205x) tests/stages/test_collector.py: 29 passed. tests/benchmarks/test_collector_scaling.py: 12 passed.
The stepwise tool-step path passed (ref, outer_context, step_outputs) to _resolve_reference but omitted the per-sample dict, so $sample.X references silently resolved to the empty string. Latent today because no enabled stepwise consumer depends on $sample.* yet, but it is a correctness landmine for future tool inputs that need sample fields. Add a regression test covering the stepwise dispatch path plus the existing reference-resolution branches.
Every public field on the typed config schemas gains a one-sentence description. The CLI's --help layer (tyro) reads these and renders them next to each flag, so end users see what every override does instead of just the default value. Covers algorithm, engine, experiment, llm, logging, migration_bus, pipeline, problem, prompt, redis, runner, and scheduling. Internal fields kept under a clear class-level docstring (the discriminated-union markers and structural list fields whose semantics are explained in the class header) are left alone.
_process_sample read client.call_logs[0], which both IndexErrors when no log was appended and silently drops every retry attempt beyond the first. The retry decorator on LLMClient.__call__ can push multiple entries (each successful API hit appends one) before the call that yielded the returned response, so the existing read understated the sample's budget consumption. Introduce a private aggregator that sums prompt_tokens, completion_tokens, cost, and cost_utilization across all per-attempt entries, and falls back to a zero CallLog on the empty-list branch. The fix is contained to utils.py and does not touch the fenced client.
remove_boxed previously used bare ``assert`` statements to enforce
boxed-expression shape: ``\boxed{42xyz`` (trailing garbage) and
``\boxed{42`` (missing closing brace) both raised AssertionError, and
under ``python -O`` the assertions are stripped — turning structural
checks into silent fall-through that corrupts the returned slice.
Replace the asserts with explicit ``return None`` guards, matching the
existing "no boxed found" branch. Well-formed input keeps producing the
same string; malformed input now folds into the standard extraction-
failure path that callers already handle by counting None predictions.
Applied to all three sibling copies (chains/aime, prompts/aime,
prompts/gsm8k) and covered by a parametrized regression suite.
Each experiment module now states the problem, the algorithm / pipeline / engine / LLM choice it showcases, and any unusual constraint in 2-4 lines so a user scanning the experiments/ directory can pick the right starting point without reading the body. ``runner_presets`` gains the same compose-into-experiment example the other ``*_presets`` modules already carry so the surface is uniform across the preset layer.
The TYPE_CHECKING guard contained only a 'pass' placeholder. Remove it along with the unused TYPE_CHECKING import.
Fix B007 in tools/throughput_plot.py and tools/wizard/__main__.py where the loop control variable is discarded inside the body.
PIE790: each exception class already has a docstring, which satisfies the suite's body requirement on its own.
Three chain validators return `(metrics, failures)` tuples but advertise `-> dict`. The runtime contract in `CallValidatorFunction.parse_output` already accepts both shapes, so behaviour is unchanged — this is a pure annotation/docstring repair so type-checkers and readers see the actual return type. Touched: chains/hover/static, chains/hotpotqa/static_ra, chains/hotpotqa/static_a.
The per-instruction loop wrote the None-stripped kwargs dict back into `input["kwargs"][index]`. Because `DataFrame.to_dict(orient="records")` shares the underlying list cells with the source frame, that write poisoned the dataset for any subsequent validate() call that reused the cached frame. Filter into a local dict instead; the dataset stays pristine across iterations.
…scorable `calculate_fitness` returned `None` when no rule had multi-class coverage. The selectors call `extract_fitness_values`, which negates `value` for minimization objectives — a `None` propagates as a `TypeError` on `-None`. Substitute `0.0` so degenerate batches surface as "no signal" rather than crashing the engine, and annotate the function with `-> float` to document the contract.
`tyro.cli(..., args=["--help"])` always raises `SystemExit(0)` via argparse, so the trailing `return 0` could never execute. Remove the dead line and document the exit semantics inline so future readers don't reintroduce the assumption that control falls through.
- redis/metrics._flatten_numbers: the ternary on key construction had identical 'then' and 'else' expressions; collapse to a single literal. - tools/lineage: tools/**/*.py already ignores E402 globally, so the per-import noqa: E402 directives are redundant.
RUF059: serve_until_signal discards the 'done' set returned by asyncio.wait, and trajectory only reads prev_v from the trailing improvement_points tuple. Prefix with underscore so the intent is visible at the unpack site.
redis-py stubs share signatures between the sync and async clients, so return types widen to Union[Awaitable[T], T]. The sync client always returns the concrete value, but pyright narrows on the Awaitable side and flags every lrange/hgetall/keys site. Add typing.cast narrowings where the call sites are; behaviour at runtime is unchanged.
The smoothed array is built across five branches; one path yields a pandas Series.values whose dtype the numpy stubs cannot align with the boolean-indexed __setitem__ signature. asarray pins the runtime type without changing the produced values.
…arnings MagicMock spoofs isinstance checks by rebinding __class__; the type checker rejects the assignment, but the runtime pattern is documented behaviour. Annotate the two assignments with the standard misc ignore.
… literal The constant was annotated Final[str], so preset builders passed list[str] into BehaviorSpaceConfig.binning_types whose declared list[BinningType] is invariant. Re-typing the constant against the schema literal lets the presets typecheck without runtime change. The import is aliased with an underscore prefix so the defaults namespace stays free of foreign symbols.
init_composite returns CompositeLogger, which is a sibling of GenericLogger under LogWriter rather than a subclass. The previous GenericLogger return annotation misrepresented the concrete return and broke type narrowing on every caller; the unit test already asserts isinstance(writer, CompositeLogger).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
Replaces the Hydra/OmegaConf YAML configuration layer with typed Pydantic v2 schemas driven by a
tyroCLI entry point. The runtime no longer parses YAML or resolves string interpolation; every experiment is a regular Python module that builds anExperimentConfigand hands it to a renamed entry point atrun.py.Beyond the cutover itself, the audit surfaced and fixed 29 pre-existing bugs and latent defects — including a leaked API key in run-ID hashes, a race condition in the round-robin island selector, two O(N²) hot paths now 14–23× and 100–3200× faster, and a class of
assert-based validators that silently corrupted output underpython -O.What this delivers
gigaevo/config/schemas/cover every subsystem. Mistyped flags fail at parse time instead of mid-experiment.python run.py <experiment> --helplists every overridable field with its description;Field(description=...)is present on every user-facing field.gigaevo/config/{algorithm,engine,llm,pipeline,problem,runner}_presets.pycompose into experiments via plain function calls — IDE jump-to-definition, autocomplete, and type-checker support all work.output_dir/{experiment_id}/config.json, whereexperiment_id = sha256(model_dump_json())[:12]— stable across reruns of the same config, distinct under any change.gigaevo/sweep.pyruns a parameter grid as N independent subprocesses, immune to GIL / global-state / CUDA-context leaks across runs.hydra-core+omegaconf; adds the lightertyro. Faster import, fewer transitive deps.register_resolversautouse needed.experiments/demonstrate the patterns: env-driven secrets viadefault_factory, discriminated-union variants selected bykind=..., preset composition, override syntax.Performance wins
Two long-standing O(N²) hot paths reworked to O(N) with explicit micro-benchmarks:
_compute_pareto_frontgigaevo/evolution/strategies/migrant_selectors.pyEvolutionaryStatisticsCollector._processgigaevo/programs/stages/collector.pyChainFeatureExtractor.extractregex passgigaevo/evolution/scheduling/feature_extractor.pyre.finditerwalks)Reliability — bugs and latent defects fixed
The cutover audit went deep across the whole repo. 29 distinct issues were found and fixed:
Security / credentials
ChatOpenAIConfig.api_keyleaked into reproducibility artefacts. The key was being serialised intooutput_dir/<experiment_id>/config.json, and worse, intoexperiment_iditself (sha256 ofmodel_dump_json()), so run IDs depended on credential rotation. Pydanticexclude=Truekeeps the key in-memory only.problems/chains/musique/shared_config.pyandproblems/chains/musique_retrieval/shared_config.py. Replaced withos.environ.get("OPENROUTER_API_KEY", ""). (The leaked key remains in git history and should be rotated by its owner — third-party key, not anyone on this team.)Correctness
RecordCardExtended.__init__shadowed the dataclass init and never appliedfield(default_factory=...)defaults. Readingcard.usage,card.keywords,card.evolution_statistics,card.works_with, orcard.linksraisedAttributeError;dataclasses.asdict(card)exploded.change_motivationwas mandatory in body but missing fromrequired_fields, soimport_idea_extended(is_forced=False)was dead on arrival. Lock-in tests added.RoundRobinIslandSelector._idxrace condition. Concurrent threads double-skipped or repeated islands. Addedthreading.Lock; 8-thread × 25-call uniform-histogram test asserts exact balance.RidgePredictor.predictheld the model lock across CPU-bound extract. Snapshot under lock, release, extract + predict on captured locals. Concurrency lock-in test added.DagRunner.stopcancelled_metrics_collector_taskwithout await → "Task was destroyed but it is pending" warnings + writer-ref retained paststorage.close(). Now awaits withsuppress(CancelledError).DagRunner._launchfire-and-forget cancel on failed transition tasks → tasks lingered pending until GC. Routed through_cancel_task(which awaits with timeout).cfg.problem.build()called twice inbuild_object_graph→ doublemetrics.yamlreload per graph build. Threaded the already-builtProblemContext.sweep._run_oneaborted the entire pool on worker-spawn OSError (E2BIG/EMFILE/ENOMEM) — one bad spawn dropped every queued sibling run. Now logs and returns 1 to preserve "best-effort across all runs" semantics._dump_resolved_configorphaned.config.*.tmpfiles on write failure. Wrapped intry/finallywith idempotent unlink.chain_runner._run_chain_on_dataset_stepwise:429was dropping thesampleargument to_resolve_reference, so$sample.Xreferences silently resolved to"". Latent landmine — the only consumer (musique_retrieval) routes through the non-stepwise variant today, but any future stepwise consumer would have been silently broken. Threadeddataset[i]through. 11 unit tests; reverted patch confirms the regression.problems/prompts/utils.py:158client.call_logs[0]dropped retry call logs and wouldIndexErroron empty. New_aggregate_call_logshelper sums across all attempts; returns a zeroCallLogon empty input. 3 unit tests.remove_boxedin 3 problem helpers used bareassert s[:len(left)] == leftandassert s[-1] == "}". Under regular Python they raisedAssertionErroron\boxed{42(truncated) or\boxed{42}xyz(trailing garbage) and crashed the entireextract_answerloop invalidate.py. Underpython -Othe assertions were stripped, yielding a corrupted slice. Replaced with explicitreturn None. 21 parametrized tests.problems/prompts/ifbench/validate.py:37mutated source DataFrame in place.to_dict(orient="records")shares list-cell references with the source DataFrame; the in-place rewrite corrupted subsequent iterations. Replaced with a local binding.problems/prompts/jigsaw_community_rules/validate.py:46returnedNonefitness on degenerate input; the downstream consumer instrategies/utils.py:79does-value, which crashes onNone. Returns0.0now.-> dictannotations on(metrics, failures)tuple returns — annotation lied about the contract. Fixedhover/static,hotpotqa/static_ra,hotpotqa/static_a.gigaevo/__init__.pyhad a deadpydantic.config.configure(compile="jit")call — that API has never existed in pydantic 2.x. The surroundingtry/except Exception: passsilently swallowedAttributeErroron every package import. Removed.Latent bugs (would have bitten under specific conditions)
tools/lineage.py:226sort key returnedfloat | None→TypeErrorif any program had None fitness. Use-math.infsubstitute.tools/lineage.py::_walk_lineageno cycle guard → infinite loop on corrupted parent chain (A→B→A or self-loop). Visited-set guard + 5 regression tests.tools/redis2pd.pynon-atomicdf.to_csv→ corrupt CSVs on concurrent runs or interrupts. Added_atomic_write_csv(tempfile +os.replace).tools/{utils,fitness_vs_time,throughput_plot}.py— the throughput plotter scaled the leak with fan-out. Wrapped intry/finally.asyncio.get_event_loop()at 11 callsites acrosstest_bandit.py,test_coevolution_pipeline.py,test_redis_storage.py,test_wrapper_enhanced.py. Bites whenever any prior event loop in the thread has been closed (exactly whatpytest-asynciodoes between tests); Python 3.13 removes it entirely. Migrated toasyncio.run()/get_running_loop().stage_timeoutaccepted on 6 builder schemas whose runtime constructor ignored it — silent user surprise. Moved to the two builders that actually consume it; lock-in tests reject the field on the others viaextra="forbid".DEFAULT_BINNING_TYPE: Final[str]mistyped againstBinningType = Literal["linear"]→ 5 invariance errors acrossalgorithm_presets.py. Retyped.LoggingConfig.build_writerannotated-> GenericLoggerbut returnedCompositeLogger(siblings underLogWriter, not subclasses). The existing test already assertsCompositeLogger; annotation was the lie. Corrected.experiments/prompt_coevolution.pymain_redis_db=0was a literal — overriding--redis.db Non tyro broke the coevolved-prompt fetcher (main wrote to DB N while the fetcher stayed at 0). Threadedredis.dbthrough both sides.Why now
The Hydra layer was leaking OmegaConf semantics into the runtime:
MISSINGsentinels reached object construction,${ref:X}resolution ran lazily and produced unhelpful tracebacks, the global resolver registry made test isolation awkward, and YAML interpolation was being asked to do work that wanted real Python expressions. A typed config model removes that whole class of problem.What changes for users
Entry point rename
YAML → Python experiment
A YAML experiment becomes a Python module exposing a single
experiment()function that returnsExperimentConfig. The 9 reference experiments inexperiments/show the patterns.Overrides
Use tyro's dotted-path syntax instead of Hydra's
+key=value:python run.py experiments/<file>.py --helplists every overridable field with its description.Sweeps
gigaevo/sweep.pyruns a parameter grid as N independent subprocesses, isolating GIL / global-state issues. Each run is invoked exactly as a normalrun.pyinvocation; sweep definitions are Python dicts.Schema surface
schemas/experiment.pyExperimentConfigroot,experiment_idhash, cross-field validatorsschemas/algorithm.pyschemas/engine.pyBusedEngineConfigschemas/pipeline.pydefault,auto,context,optuna_opt,cma_opt,algotune_speed,structural_metrics,problem_specific)schemas/llm.pyChatOpenAIConfig/ bandit / heterogeneous router discriminated unionschemas/redis.py+schemas/migration_bus.pyschemas/problem.py,schemas/prompt.py,schemas/logging.py,schemas/scheduling.py,schemas/runner.pyField(description=...)is present on every user-facing field so--helpis self-documenting.Test plan
tests/test_tools/test_manifest.pycollection error unrelated to this branch) passes locally: ~5,900 tests green.python run.py <experiment> --dry-run.python run.py --helpandpython run.py <experiment> --helpproduce informative output (Field descriptions rendered inline).cfg.model_dump_json() → ExperimentConfig.model_validate_json(...)is identity across the reference experiments.threading.Lock-guarded counter, 8-thread × 25-call uniform-histogram assertion).RidgePredictor.predictnot holding the lock acrossextract._compute_pareto_front(14–23×) andEvolutionaryStatisticsCollectorsnapshot processing (114–3205×).Conflict map with open PRs
This branch deletes the YAML tree and reshapes the config surface — that overlaps several open PRs at file-level only; the intent is orthogonal in every case:
gigaevo/config/helpers.pywas reshaped here;gigaevo/utils/text_sanitize.pyis unchanged in this branchpytest.ini,pyproject.tomlgigaevo/entrypoint/constants.pygigaevo/runner/dag_runner.pyunchanged heregigaevo/config/helpers.py,gigaevo/entrypoint/default_pipelines.pypyproject.toml,gigaevo/llm/models.py,gigaevo/infra/*— config layer untouchedasyncio.get_event_loop()callers; #19 superset onmainis preferredgigaevo/config/schemas/*orexperiments/*No PR is blocked by this branch; merge order is reviewer's preference.
Out of scope / follow-ups
tests/test_tools/test_manifest.pyreferences atools.experiment.manifestmodule that has never existed in the repo (its production module was never committed); pre-existing collection error. Not addressed here.# type: ignore[misc]comments onMagicMock.__class__rebinding in tests — documented pattern, won't repay refactoring.tools/status.py/tools/fitness_vs_time.pyredis-pyAwaitabletype stubs are a known false-positive class; documented viatyping.cast, no runtime effect.