fix: resolve all 21 sub-issues of friction-report tracker #245 by sriumcp · Pull Request #267 · AI-native-Systems-Research/agentic-strategy-evolution

sriumcp · 2026-06-01T14:11:25Z

Summary

External campaign-author friction report from running the
paper-memorytime-mirage campaign on nous against BLIS surfaced
21 distinct points of friction. This PR resolves the entire tracker
in a single coherent change, organized by theme.

Spec-fidelity (F1, F2, F3, F4, F10, F13, F20): closes the gap left by HUMAN_DESIGN_GATE bypass under --auto-approve. campaign.locked_parameters ([F1] Add locked_parameters schema field to enforce campaign spec-fidelity in bundles #246) hard-fails bundle deviations regardless of --auto-approve. locked_workload ([F20] Schema: workload yaml deviations need locked_workload + workload_changes_from_canonical #265) does the same for workload yamls. Methodology prompt now declares the campaign > target-repo-docs hierarchy.
Apparatus discipline (F7, F14, F16): methodology prompt edits cover the attribution-vs-totals distinction (with the BLIS runningBatch/RequestMap worked example), the unlocked-parameters audit, and the rehearsal-as-instrument positive pattern.
Lifecycle / portability (F5, F11, F12, F19, F21): per-phase silence threshold ([F19] Per-phase silence threshold (DESIGN >= 600s, EXECUTE_ANALYZE lower); currently phase-agnostic #264/F19) closes the active stall where DESIGN's heavy reasoning trips an EXECUTE_ANALYZE-tuned watchdog. F21 introduces cumulative.patch + campaign.derived_from + nous lineage for cross-campaign code reuse.
Reproducibility (F17, F18): reproducibility_metadata auto-captured at INIT (target repo commit, hardware-config sha, language versions, latency-config snapshots). nous package tarballs work_dir + reproduce.sh + Dockerfile + README.
Hygiene (F6, F8, F9, F15): worktree_extras tracked-path warning, nous resume diagnostic, nous clean --orphaned, physical_realism_check schema.

The full per-F-entry → file map is in docs/friction-245-resolution.md. Every change is tagged in code with (#NNN / F<n>) so git blame + the issue tracker form a complete audit trail.

Why this PR is correct

A newcomer (or AI agent) navigating cold should be able to verify each F-entry's resolution in three steps:

Open docs/friction-245-resolution.md.
Find the F-entry's row; click through to the file/test the row references.
The code is tagged inline with (#NNN / F<n>) for cross-reference.

The architectural primitives are deliberately small and orthogonal:

One validator function per F (in orchestrator/validate.py).
One CLI subcommand per F when CLI-shaped (lineage, clean, package).
One schema field per F with documentation in-place.

No refactors. Every behavior change is opt-in (legacy campaigns without locked_parameters keep working unchanged), and every schema addition is backward-compatible (oneOf'd into the existing fields where shape changed).

New modules

orchestrator/reproducibility.py — F17 capture + per-iter snapshots
orchestrator/lineage.py — F21 cumulative.patch + derived_from resolution
orchestrator/plot_specs.py — F18 figure pipeline

New CLI surface

nous stop --immediate — F5 event-boundary halt
nous lineage <run_id> — F21 inheritance inspection
nous clean --orphaned — F9 stale-worktree cleanup
nous package <run_id> — F18 paper artifact tarball

New schema fields

Where	Field	Issue
campaign	`locked_parameters`	#246 / F1
campaign	`locked_workload`	#265 / F20
campaign	`derived_from`	#266 / F21
campaign	`plot_specs`	#263 / F18
campaign	`reproducibility_metadata` (auto-populated)	#262 / F17
campaign	`sdk_timeouts.turn_silence_threshold_seconds` (now accepts per-phase map)	#264 / F19
bundle	`experiment_spec.physical_realism_check`	#260 / F15
bundle	`experiment_spec.unlocked_parameters_audit`	#261 / F16
bundle	`workload_changes_from_canonical`	#265 / F20
bundle	`experiment_spec.rehearsal_subset.depth_overrides`	#248 / F3
bundle	`timing_observations.recommended_turn_silence_threshold_seconds` (per-phase map)	#264 / F19
state	`reproducibility_metadata`	#262 / F17

Test plan

pytest tests/test_friction_245.py — 32 new tests covering F1, F3, F4, F11, F15, F17, F18, F19, F20, F21 (the F-entries with behavioral changes; F2/F7/F16 are prompt-text changes; F5/F6/F8/F9/F10/F12/F13/F14 are CLI/docs/runtime side effects).
pytest tests/ — 1278 tests passing, 2 skipped, 0 regressions.
python -c "import yaml, jsonschema; ..." — schema additions parse and validate against jsonschema.

Authoring discipline (for posterity)

The PR introduces docs/campaign-authoring-guide.md which captures:

The "what to lock" inventory (avoiding the reactive failure mode that took five rounds to discover total_kv_blocks mattered in paper-memorytime-mirage).
The pre-lock unit-check discipline (a closed-form sanity check before locking parameters).
The rehearsal-as-instrument pattern (turning iter-1 into a diagnostic probe instead of a binary pass/fail).

CLAUDE.md and docs/architecture.md link to it.

Out of scope

uv.lock is untracked at session start (pre-existing). I deliberately did not commit it — that's a separate concern from #245 resolution.

🤖 Generated with Claude Code

…ystems-Research#245 Closes AI-native-Systems-Research#246 Closes AI-native-Systems-Research#247 Closes AI-native-Systems-Research#248 Closes AI-native-Systems-Research#249 Closes AI-native-Systems-Research#250 Closes AI-native-Systems-Research#251 Closes AI-native-Systems-Research#252 Closes AI-native-Systems-Research#253 Closes AI-native-Systems-Research#254 Closes AI-native-Systems-Research#255 Closes AI-native-Systems-Research#256 Closes AI-native-Systems-Research#257 Closes AI-native-Systems-Research#258 Closes AI-native-Systems-Research#259 Closes AI-native-Systems-Research#260 Closes AI-native-Systems-Research#261 Closes AI-native-Systems-Research#262 Closes AI-native-Systems-Research#263 Closes AI-native-Systems-Research#264 Closes AI-native-Systems-Research#265 Closes AI-native-Systems-Research#266 Closes AI-native-Systems-Research#245 External campaign-author friction report from running the paper-memorytime-mirage campaign on nous against BLIS surfaced 21 distinct points of friction, clustered around five themes: A. Spec-fidelity (F1, F2, F3, F4, F10, F13, F20) — nous validated *self-consistency* (executor matches bundle) but not *spec-fidelity* (bundle matches campaign) under --auto-approve. Headline architectural primitive: campaign.locked_parameters (AI-native-Systems-Research#246/F1) hard-fails any deviation, regardless of --auto-approve. Adoption: locked_workload (AI-native-Systems-Research#265/F20), unlocked_parameters_audit (AI-native-Systems-Research#261/F16), methodology hierarchy (AI-native-Systems-Research#247/F2), depth_overrides+invalidates (AI-native-Systems-Research#248/F3), gate-summary diff (AI-native-Systems-Research#249/F4), auto-approve safety docs (AI-native-Systems-Research#255/F10), create-campaign scaffold + authoring guide (AI-native-Systems-Research#258/F13). B. Apparatus discipline (F7, F14, F16) — invariants must validate ATTRIBUTION, not upstream totals. Methodology prompt sections in design.md/execute_analyze.md cover the bug-class question with the BLIS runningBatch vs RequestMap worked example. Authoring guide covers rehearsal-as-instrument and pre-lock unit checks. C. Lifecycle / portability (F5, F11, F12, F19, F21) — per-phase silence threshold (AI-native-Systems-Research#264/F19) closes the active stall where DESIGN's heavy reasoning trips an EXECUTE_ANALYZE-tuned watchdog. F21 lands cross-campaign code reuse via cumulative.patch + derived_from + nous lineage. F5 stop --immediate, F11 high-BUILD warning, F12 asyncio race fix. D. Reproducibility (F17, F18) — reproducibility_metadata auto-captured at INIT (target repo commit, hardware-config sha, language versions, latency-config snapshots). nous package tarballs work_dir + reproduce.sh + Dockerfile + README for paper artifact evaluation. E. Hygiene (F6, F8, F9, F15) — F6 worktree_extras tracked-path warning at campaign load; F8 nous resume diagnostic for work_dir confusion; F9 nous clean --orphaned; F15 physical_realism_check schema + soft warning. See docs/friction-245-resolution.md for the per-F-entry → file map. Every change is tagged in code with (#NNN / F<n>) so git blame + issue tracker form a complete audit trail. New modules: orchestrator/reproducibility.py — F17 capture orchestrator/lineage.py — F21 cumulative patches + derived_from orchestrator/plot_specs.py — F18 figure pipeline New CLI subcommands: nous lineage, nous clean, nous package, nous stop --immediate New schema fields: campaign: locked_parameters, locked_workload, derived_from, plot_specs, reproducibility_metadata, sdk_timeouts.turn_silence_threshold_seconds (per-phase map) bundle: physical_realism_check, unlocked_parameters_audit, workload_changes_from_canonical, rehearsal_subset.depth_overrides, timing_observations.recommended_turn_silence_threshold_seconds (per-phase map) Tests: 32 new in tests/test_friction_245.py, 1278 total passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…native-Systems-Research#267 Addresses every finding from the multi-agent PR review: Critical: - C1: drop ``except ImportError: pass`` for orchestrator.lineage in iteration.py. Hoist imports to module top — a broken intra-package module is a self-inflicted bug, not an optional dependency, and the preceding comment promised "loud failure" anyway. - C2: ``_validate_locked_workload`` now surfaces malformed/unreadable workload yamls as deviations instead of silently skipping them. Catching workload drift is the whole point of F20. Important: - I1: ``invoke_plot_specs`` no longer guesses ``campaign_yaml_dir`` from work_dir's parent. Threaded ``campaign_path`` through ``setup_work_dir`` (recorded in state.json[\"config_ref\"]) and a new ``_campaign_yaml_dir_from_state`` reader resolves it for the finalize step. Plot script paths now resolve correctly in production, not just tests. - I2: ``_pick_interpreter`` rewritten as ``_build_command``, returning the proper argv list. The previous version invoked executable scripts with themselves as ``argv[1]``. - I3: aclose cleanup's broad ``except Exception: pass`` now logs at WARNING instead of swallowing silently. The narrow tuple of documented races (TimeoutError, CancelledError, RuntimeError, GeneratorExit) is still silent — only the unknown-class fallback gains observability. - I4-I10: 25 new behavioral tests covering F4 auto-approve diff emission, F19 ``_resolve_turn_silence_threshold`` per-phase + scalar back-compat (uncovered a real bug — fix below), F17 attach_to_state idempotency + repo_dirty capture + snapshot_iter_files, F11 boundary at total_files=4/5 + formula assertion, F12 RuntimeError + non-documented exception handling, F20 declared-deviation pin, F21 apply_derived_from_patch round-trip + cumulative.patch.error sidecar. Bug surfaced and fixed by F19 tests: - When ``sdk_timeouts.turn_silence_threshold_seconds`` was unset entirely, the old init applied 600 to every phase (defeating F19's per-phase split). Now distinguished from the explicit scalar form: absent → per-phase defaults stand; explicit scalar → applied uniformly (back-compat). Code-reviewer suggestions: - S1: ``attach_to_state`` is no longer dead code. Docstring updated; RuntimeError on JSON decode failure (was: silent return). - S2: ``nous package`` stages reproduce.sh / Dockerfile / PACKAGE_README.md to a temp directory and tars from there — the work_dir on disk is unchanged. - S3: redundant ``except (OSError, Exception)`` in ``summarize_lineage`` replaced with narrow handlers that record ``campaign_yaml_error`` into the summary so ``nous lineage`` shows the operator why derived_from couldn't be determined. - S4: ``campaign.plot_specs`` schema description corrected — runs per-iteration during finalize, not as a separate end-of-campaign rollup. - S5: snapshot guard short-circuit removed; idempotency lives in ``snapshot_iter_files`` itself. Comment errors fixed: - ``reproducibility.attach_to_state`` 24h re-init claim removed. - ``_emit_high_build_warning`` docstring no longer claims an unimplemented OR clause. - ``compute_campaign_spec_diff`` docstring says "five sub-keys", not "three". - campaign schema: silence-threshold fallback chain corrected. - ``nous resume`` error: dropped the false "re-emit reproducibility metadata" claim (first-capture-wins). - sdk_dispatch path-walk comment expanded to count all four ``.parent`` calls. Dead ``except (IndexError, ValueError)`` replaced with a real boundary check. - ``--immediate`` help: "tool-call return" → "event boundary". - README precondition list: principles.json clarification moved out of the numbered list (not a precondition). - BLIS quote in execute_analyze.md gains a "see repo_commit in reproducibility_metadata to reproduce the snapshot" pointer. - ``_walk_locked_workload`` tenant-tracking ternary documented. - ``_resolve_turn_silence_threshold`` docstring acknowledges that step 3 and step 4 of the resolution chain are merged in ``_phase_silence_thresholds``. Cumulative.patch failure visibility: - ``emit_cumulative_patch`` now writes a ``patches/cumulative.patch.error`` sidecar with git stderr when emission fails. ``summarize_lineage`` reads it; ``nous lineage`` surfaces the message. Without the sidecar, a failed emission was a single warning line in orchestrator.log that downstream ``derived_from`` campaigns would silently miss months later. Tests: 1303 passing (was 1278), 2 skipped, 0 regressions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

sriumcp merged commit a30f09c into AI-native-Systems-Research:reflective Jun 1, 2026
2 checks passed

sriumcp mentioned this pull request Jun 1, 2026

review: address all PR #267 review findings (critical / important / suggestions) #268

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: resolve all 21 sub-issues of friction-report tracker #245#267

fix: resolve all 21 sub-issues of friction-report tracker #245#267
sriumcp merged 1 commit into
AI-native-Systems-Research:reflectivefrom
sriumcp:friction-report-245

sriumcp commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sriumcp commented Jun 1, 2026

Summary

Why this PR is correct

New modules

New CLI surface

New schema fields

Test plan

Authoring discipline (for posterity)

Out of scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant