Skip to content

Tighten effectiveness evidence report traceability#33

Merged
baskduf merged 3 commits into
baskduf:mainfrom
jihwan4155:evidence/effectiveness-small-pass
Jun 4, 2026
Merged

Tighten effectiveness evidence report traceability#33
baskduf merged 3 commits into
baskduf:mainfrom
jihwan4155:evidence/effectiveness-small-pass

Conversation

@jihwan4155
Copy link
Copy Markdown
Contributor

I updated the PR to address the requested changes:

  • Renamed the report to docs/examples/effectiveness-small-evidence-report.md so it is included by scripts/check_effectiveness_plan.py.
  • Updated the docs/evaluation.md link to the renamed report.
  • Strengthened traceability in the task outcome records by replacing weak local refs with stable repository/commit references.
  • Narrowed the drift metric by separating the single docs-drift violation from broader review/feedback-loop gaps.
  • Reworded the report claim to stay narrow: operational outcome evidence, not proof of general agent effectiveness.

Checks run:

  • python scripts/check_docs_drift.py
  • python scripts/check_structure.py
  • python scripts/check_encoding_hygiene.py
  • python scripts/check_effectiveness_plan.py
  • python scripts/check_decision_memory.py
  • python scripts/harness_doctor.py --target .
  • python -m unittest discover -s tests

All checks passed. harness_doctor.py --target . reports 100/100 baseline evidence.

Copy link
Copy Markdown
Owner

@baskduf baskduf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tightening the evidence report. I don’t think this is ready to approve yet because the main verification claim is still broken.

Blocking:

  • docs/examples/effectiveness-small-evidence-report.md is not actually included by scripts/check_effectiveness_plan.py. The checker only treats Markdown files as reports when the filename contains the contiguous substring effectiveness-report, but this filename contains effectiveness-small-evidence-report, so is_report(...) returns false. That means CI can pass while skipping this new report.
    Please either rename the file to something like effectiveness-report-small-evidence.md, or update the checker and tests to intentionally include this naming pattern.

Also worth fixing before merge:

  • The task outcome records now include jihwan4155/recipe-api@99af81..., but that repo/commit is not accessible to me via GitHub, and the aggregate report still says Repository refs compared: local practice branch snapshots. This weakens the traceability goal from docs/evaluation.md.
  • The aggregate Review gaps detected count looks ambiguous: the report says 3, but task 001 lists 3 review gaps and task 002 also records the missing PATCH tests as a review gap. Please reconcile the count or narrow the metric definition.

Checks I ran on the PR head:

  • python3 -m unittest discover -s tests
  • python3 -m py_compile ...
  • python3 scripts/check_docs_drift.py
  • python3 scripts/check_structure.py
  • python3 scripts/check_encoding_hygiene.py
  • python3 scripts/check_effectiveness_plan.py
  • python3 scripts/check_failure_memory.py
  • python3 scripts/check_decision_memory.py --base ca367dbd3da9e89aa2653cf26bcdd00180b792a9
  • python3 scripts/harness_doctor.py --target .

All passed, but the first issue means the new evidence report is currently not being validated.

@baskduf
Copy link
Copy Markdown
Owner

baskduf commented Jun 4, 2026

I don’t think this is ready to approve yet, but this should be a small fix.

The blocking issue is that the new report is still not included by scripts/check_effectiveness_plan.py: the checker only recognizes Markdown files whose names contain the contiguous substring effectiveness-report, while this file is named effectiveness-small-evidence-report.md. So CI can pass while skipping the new report.

Please either rename it to something like effectiveness-report-small-evidence.md, or update the checker/tests to intentionally include this naming pattern.

Non-blocking: it would also help to tighten the source refs / aggregate count wording, but the validation mismatch is the main thing blocking approval.

@jihwan4155
Copy link
Copy Markdown
Contributor Author

I updated the PR to address the blocking validation issue and the additional evidence-quality concerns.

Changes made:

  • Renamed the report to docs/examples/effectiveness-report-small-evidence.md so it is included by scripts/check_effectiveness_plan.py.
  • Updated the docs/evaluation.md link to the renamed report.
  • Updated the report language to keep the claim narrow and separate harness health from observed task outcomes.
  • Separated the single docs-drift violation from broader review/feedback-loop gaps.
  • Reconciled the review gap count so the aggregate report matches the task outcome records.
  • Updated the task outcome records/report traceability language.

Checks run:

  • python scripts/check_docs_drift.py
  • python scripts/check_structure.py
  • python scripts/check_encoding_hygiene.py
  • python scripts/check_effectiveness_plan.py
  • python scripts/check_failure_memory.py
  • python scripts/check_decision_memory.py
  • python scripts/harness_doctor.py --target .
  • python -m unittest discover -s tests

All checks pass locally. python -m unittest discover -s tests ran 116 tests with OK (skipped=1).

@baskduf
Copy link
Copy Markdown
Owner

baskduf commented Jun 4, 2026

The previous blockers look resolved now.

  • The effectiveness report is discovered by check_effectiveness_plan.py.
  • The review-gap count is reconciled.
  • The referenced source commit jihwan4155/recipe-api@99af81bf0da4a8bfecb19e5ca0af817b276f49b6 is now reachable.

I’m approving this from a review standpoint. Please make sure the GitHub Actions Harness Check is approved/re-run and passing before merge, since the latest PR run is currently action-required rather than green.

Copy link
Copy Markdown
Owner

@baskduf baskduf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous blockers are resolved: the effectiveness report is discovered by the checker, the review-gap count is reconciled, and the referenced source commit is now reachable. Local harness checks passed in review.

@baskduf baskduf merged commit 2e7bad5 into baskduf:main Jun 4, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants