Skip to content

fix: resolve all 21 sub-issues of friction-report tracker #245#267

Merged
sriumcp merged 1 commit into
AI-native-Systems-Research:reflectivefrom
sriumcp:friction-report-245
Jun 1, 2026
Merged

fix: resolve all 21 sub-issues of friction-report tracker #245#267
sriumcp merged 1 commit into
AI-native-Systems-Research:reflectivefrom
sriumcp:friction-report-245

Conversation

@sriumcp

@sriumcp sriumcp commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

Summary

External campaign-author friction report from running the
paper-memorytime-mirage campaign on nous against BLIS surfaced
21 distinct points of friction. This PR resolves the entire tracker
in a single coherent change, organized by theme.

  • Spec-fidelity (F1, F2, F3, F4, F10, F13, F20): closes the gap left by HUMAN_DESIGN_GATE bypass under --auto-approve. campaign.locked_parameters ([F1] Add locked_parameters schema field to enforce campaign spec-fidelity in bundles #246) hard-fails bundle deviations regardless of --auto-approve. locked_workload ([F20] Schema: workload yaml deviations need locked_workload + workload_changes_from_canonical #265) does the same for workload yamls. Methodology prompt now declares the campaign > target-repo-docs hierarchy.
  • Apparatus discipline (F7, F14, F16): methodology prompt edits cover the attribution-vs-totals distinction (with the BLIS runningBatch/RequestMap worked example), the unlocked-parameters audit, and the rehearsal-as-instrument positive pattern.
  • Lifecycle / portability (F5, F11, F12, F19, F21): per-phase silence threshold ([F19] Per-phase silence threshold (DESIGN >= 600s, EXECUTE_ANALYZE lower); currently phase-agnostic #264/F19) closes the active stall where DESIGN's heavy reasoning trips an EXECUTE_ANALYZE-tuned watchdog. F21 introduces cumulative.patch + campaign.derived_from + nous lineage for cross-campaign code reuse.
  • Reproducibility (F17, F18): reproducibility_metadata auto-captured at INIT (target repo commit, hardware-config sha, language versions, latency-config snapshots). nous package tarballs work_dir + reproduce.sh + Dockerfile + README.
  • Hygiene (F6, F8, F9, F15): worktree_extras tracked-path warning, nous resume diagnostic, nous clean --orphaned, physical_realism_check schema.

The full per-F-entry → file map is in docs/friction-245-resolution.md. Every change is tagged in code with (#NNN / F<n>) so git blame + the issue tracker form a complete audit trail.

Why this PR is correct

A newcomer (or AI agent) navigating cold should be able to verify each F-entry's resolution in three steps:

  1. Open docs/friction-245-resolution.md.
  2. Find the F-entry's row; click through to the file/test the row references.
  3. The code is tagged inline with (#NNN / F<n>) for cross-reference.

The architectural primitives are deliberately small and orthogonal:

  • One validator function per F (in orchestrator/validate.py).
  • One CLI subcommand per F when CLI-shaped (lineage, clean, package).
  • One schema field per F with documentation in-place.

No refactors. Every behavior change is opt-in (legacy campaigns without locked_parameters keep working unchanged), and every schema addition is backward-compatible (oneOf'd into the existing fields where shape changed).

New modules

New CLI surface

  • nous stop --immediate — F5 event-boundary halt
  • nous lineage <run_id> — F21 inheritance inspection
  • nous clean --orphaned — F9 stale-worktree cleanup
  • nous package <run_id> — F18 paper artifact tarball

New schema fields

Where Field Issue
campaign locked_parameters #246 / F1
campaign locked_workload #265 / F20
campaign derived_from #266 / F21
campaign plot_specs #263 / F18
campaign reproducibility_metadata (auto-populated) #262 / F17
campaign sdk_timeouts.turn_silence_threshold_seconds (now accepts per-phase map) #264 / F19
bundle experiment_spec.physical_realism_check #260 / F15
bundle experiment_spec.unlocked_parameters_audit #261 / F16
bundle workload_changes_from_canonical #265 / F20
bundle experiment_spec.rehearsal_subset.depth_overrides #248 / F3
bundle timing_observations.recommended_turn_silence_threshold_seconds (per-phase map) #264 / F19
state reproducibility_metadata #262 / F17

Test plan

  • pytest tests/test_friction_245.py — 32 new tests covering F1, F3, F4, F11, F15, F17, F18, F19, F20, F21 (the F-entries with behavioral changes; F2/F7/F16 are prompt-text changes; F5/F6/F8/F9/F10/F12/F13/F14 are CLI/docs/runtime side effects).
  • pytest tests/ — 1278 tests passing, 2 skipped, 0 regressions.
  • python -c "import yaml, jsonschema; ..." — schema additions parse and validate against jsonschema.

Authoring discipline (for posterity)

The PR introduces docs/campaign-authoring-guide.md which captures:

  • The "what to lock" inventory (avoiding the reactive failure mode that took five rounds to discover total_kv_blocks mattered in paper-memorytime-mirage).
  • The pre-lock unit-check discipline (a closed-form sanity check before locking parameters).
  • The rehearsal-as-instrument pattern (turning iter-1 into a diagnostic probe instead of a binary pass/fail).

CLAUDE.md and docs/architecture.md link to it.

Out of scope

uv.lock is untracked at session start (pre-existing). I deliberately did not commit it — that's a separate concern from #245 resolution.

🤖 Generated with Claude Code

…ystems-Research#245

Closes AI-native-Systems-Research#246
Closes AI-native-Systems-Research#247
Closes AI-native-Systems-Research#248
Closes AI-native-Systems-Research#249
Closes AI-native-Systems-Research#250
Closes AI-native-Systems-Research#251
Closes AI-native-Systems-Research#252
Closes AI-native-Systems-Research#253
Closes AI-native-Systems-Research#254
Closes AI-native-Systems-Research#255
Closes AI-native-Systems-Research#256
Closes AI-native-Systems-Research#257
Closes AI-native-Systems-Research#258
Closes AI-native-Systems-Research#259
Closes AI-native-Systems-Research#260
Closes AI-native-Systems-Research#261
Closes AI-native-Systems-Research#262
Closes AI-native-Systems-Research#263
Closes AI-native-Systems-Research#264
Closes AI-native-Systems-Research#265
Closes AI-native-Systems-Research#266
Closes AI-native-Systems-Research#245

External campaign-author friction report from running the
paper-memorytime-mirage campaign on nous against BLIS surfaced 21
distinct points of friction, clustered around five themes:

A. Spec-fidelity (F1, F2, F3, F4, F10, F13, F20) — nous validated
   *self-consistency* (executor matches bundle) but not *spec-fidelity*
   (bundle matches campaign) under --auto-approve. Headline
   architectural primitive: campaign.locked_parameters (AI-native-Systems-Research#246/F1)
   hard-fails any deviation, regardless of --auto-approve. Adoption:
   locked_workload (AI-native-Systems-Research#265/F20), unlocked_parameters_audit (AI-native-Systems-Research#261/F16),
   methodology hierarchy (AI-native-Systems-Research#247/F2), depth_overrides+invalidates
   (AI-native-Systems-Research#248/F3), gate-summary diff (AI-native-Systems-Research#249/F4), auto-approve safety docs
   (AI-native-Systems-Research#255/F10), create-campaign scaffold + authoring guide (AI-native-Systems-Research#258/F13).

B. Apparatus discipline (F7, F14, F16) — invariants must validate
   ATTRIBUTION, not upstream totals. Methodology prompt sections in
   design.md/execute_analyze.md cover the bug-class question with the
   BLIS runningBatch vs RequestMap worked example. Authoring guide
   covers rehearsal-as-instrument and pre-lock unit checks.

C. Lifecycle / portability (F5, F11, F12, F19, F21) — per-phase
   silence threshold (AI-native-Systems-Research#264/F19) closes the active stall where DESIGN's
   heavy reasoning trips an EXECUTE_ANALYZE-tuned watchdog. F21 lands
   cross-campaign code reuse via cumulative.patch + derived_from +
   nous lineage. F5 stop --immediate, F11 high-BUILD warning, F12
   asyncio race fix.

D. Reproducibility (F17, F18) — reproducibility_metadata auto-captured
   at INIT (target repo commit, hardware-config sha, language versions,
   latency-config snapshots). nous package tarballs work_dir +
   reproduce.sh + Dockerfile + README for paper artifact evaluation.

E. Hygiene (F6, F8, F9, F15) — F6 worktree_extras tracked-path
   warning at campaign load; F8 nous resume diagnostic for work_dir
   confusion; F9 nous clean --orphaned; F15 physical_realism_check
   schema + soft warning.

See docs/friction-245-resolution.md for the per-F-entry → file map.
Every change is tagged in code with (#NNN / F<n>) so git blame +
issue tracker form a complete audit trail.

New modules:
  orchestrator/reproducibility.py  — F17 capture
  orchestrator/lineage.py          — F21 cumulative patches + derived_from
  orchestrator/plot_specs.py       — F18 figure pipeline
New CLI subcommands:
  nous lineage, nous clean, nous package, nous stop --immediate
New schema fields:
  campaign:  locked_parameters, locked_workload, derived_from,
             plot_specs, reproducibility_metadata, sdk_timeouts.turn_silence_threshold_seconds (per-phase map)
  bundle:    physical_realism_check, unlocked_parameters_audit,
             workload_changes_from_canonical, rehearsal_subset.depth_overrides,
             timing_observations.recommended_turn_silence_threshold_seconds (per-phase map)

Tests: 32 new in tests/test_friction_245.py, 1278 total passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sriumcp sriumcp merged commit a30f09c into AI-native-Systems-Research:reflective Jun 1, 2026
2 checks passed
sriumcp added a commit to sriumcp/agentic-strategy-evolution that referenced this pull request Jun 1, 2026
…native-Systems-Research#267

Addresses every finding from the multi-agent PR review:

Critical:
- C1: drop ``except ImportError: pass`` for orchestrator.lineage in
  iteration.py. Hoist imports to module top — a broken intra-package
  module is a self-inflicted bug, not an optional dependency, and the
  preceding comment promised "loud failure" anyway.
- C2: ``_validate_locked_workload`` now surfaces malformed/unreadable
  workload yamls as deviations instead of silently skipping them.
  Catching workload drift is the whole point of F20.

Important:
- I1: ``invoke_plot_specs`` no longer guesses ``campaign_yaml_dir``
  from work_dir's parent. Threaded ``campaign_path`` through
  ``setup_work_dir`` (recorded in state.json[\"config_ref\"]) and a
  new ``_campaign_yaml_dir_from_state`` reader resolves it for the
  finalize step. Plot script paths now resolve correctly in
  production, not just tests.
- I2: ``_pick_interpreter`` rewritten as ``_build_command``,
  returning the proper argv list. The previous version invoked
  executable scripts with themselves as ``argv[1]``.
- I3: aclose cleanup's broad ``except Exception: pass`` now logs
  at WARNING instead of swallowing silently. The narrow tuple of
  documented races (TimeoutError, CancelledError, RuntimeError,
  GeneratorExit) is still silent — only the unknown-class fallback
  gains observability.
- I4-I10: 25 new behavioral tests covering F4 auto-approve diff
  emission, F19 ``_resolve_turn_silence_threshold`` per-phase +
  scalar back-compat (uncovered a real bug — fix below), F17
  attach_to_state idempotency + repo_dirty capture +
  snapshot_iter_files, F11 boundary at total_files=4/5 +
  formula assertion, F12 RuntimeError + non-documented exception
  handling, F20 declared-deviation pin, F21 apply_derived_from_patch
  round-trip + cumulative.patch.error sidecar.

Bug surfaced and fixed by F19 tests:
- When ``sdk_timeouts.turn_silence_threshold_seconds`` was unset
  entirely, the old init applied 600 to every phase (defeating
  F19's per-phase split). Now distinguished from the explicit
  scalar form: absent → per-phase defaults stand;
  explicit scalar → applied uniformly (back-compat).

Code-reviewer suggestions:
- S1: ``attach_to_state`` is no longer dead code. Docstring updated;
  RuntimeError on JSON decode failure (was: silent return).
- S2: ``nous package`` stages reproduce.sh / Dockerfile /
  PACKAGE_README.md to a temp directory and tars from there —
  the work_dir on disk is unchanged.
- S3: redundant ``except (OSError, Exception)`` in
  ``summarize_lineage`` replaced with narrow handlers that record
  ``campaign_yaml_error`` into the summary so ``nous lineage``
  shows the operator why derived_from couldn't be determined.
- S4: ``campaign.plot_specs`` schema description corrected — runs
  per-iteration during finalize, not as a separate end-of-campaign
  rollup.
- S5: snapshot guard short-circuit removed; idempotency lives in
  ``snapshot_iter_files`` itself.

Comment errors fixed:
- ``reproducibility.attach_to_state`` 24h re-init claim removed.
- ``_emit_high_build_warning`` docstring no longer claims an
  unimplemented OR clause.
- ``compute_campaign_spec_diff`` docstring says "five sub-keys",
  not "three".
- campaign schema: silence-threshold fallback chain corrected.
- ``nous resume`` error: dropped the false "re-emit reproducibility
  metadata" claim (first-capture-wins).
- sdk_dispatch path-walk comment expanded to count all four
  ``.parent`` calls. Dead ``except (IndexError, ValueError)``
  replaced with a real boundary check.
- ``--immediate`` help: "tool-call return" → "event boundary".
- README precondition list: principles.json clarification moved
  out of the numbered list (not a precondition).
- BLIS quote in execute_analyze.md gains a "see repo_commit in
  reproducibility_metadata to reproduce the snapshot" pointer.
- ``_walk_locked_workload`` tenant-tracking ternary documented.
- ``_resolve_turn_silence_threshold`` docstring acknowledges that
  step 3 and step 4 of the resolution chain are merged in
  ``_phase_silence_thresholds``.

Cumulative.patch failure visibility:
- ``emit_cumulative_patch`` now writes a
  ``patches/cumulative.patch.error`` sidecar with git stderr when
  emission fails. ``summarize_lineage`` reads it; ``nous lineage``
  surfaces the message. Without the sidecar, a failed emission
  was a single warning line in orchestrator.log that downstream
  ``derived_from`` campaigns would silently miss months later.

Tests: 1303 passing (was 1278), 2 skipped, 0 regressions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant