diff --git a/docs/evaluation.md b/docs/evaluation.md index 367cf64..e66d66b 100644 --- a/docs/evaluation.md +++ b/docs/evaluation.md @@ -95,3 +95,7 @@ Treat the data as operational evidence, not a controlled scientific study. Small repositories and changing agents can introduce noise. The useful signal is whether the same classes of mistakes become less frequent, easier to detect, or cheaper to correct after the harness becomes part of the repository. + +## Example Evidence Passes + +- [Small harness outcome evidence report](examples/effectiveness-report-small-evidence.md) records three harnessed task outcomes and summarizes a narrow operational evidence pass without treating Harness Doctor scores or passing checks as proof of agent effectiveness. \ No newline at end of file diff --git a/docs/examples/effectiveness-report-small-evidence.md b/docs/examples/effectiveness-report-small-evidence.md new file mode 100644 index 0000000..601b41a --- /dev/null +++ b/docs/examples/effectiveness-report-small-evidence.md @@ -0,0 +1,114 @@ +# Small Harness Effectiveness Evidence Pass + +## Target + +- Repository: local Django REST Framework recipe-api practice target +- Stack and framework: Python, Django, Django REST Framework +- Evaluation date or window: 2026-06-03 +- Agent or model: AI coding agent with human review +- Evaluation mode: harnessed-only tracking + +## Primary Metric + +The primary metric for this pass is whether review-relevant agent outcome gaps became observable and correctable through harness artifacts and checks. + +This pass tracks: + +- wrong-file edits +- repeated known mistakes +- first-pass verification result +- drift violations detected +- human rework minutes +- reverted files + +Harness Doctor scores and passing checks are recorded only as harness health signals. They are not treated as proof of agent effectiveness. + +## Task Set + +| Task ID | Scenario | Expected boundary | Common failure | +| --- | --- | --- | --- | +| recipe-api-harness-adoption-cleanup | Refine initial harness adoption | AGENTS.md, docs, .harness, scripts, README, tests | Generic scaffolding, weak tests, docs drift | +| recipe-api-add-category-feature | Add Category model/API support | recipes app, migration, tests, README, domain docs, decision record | Missing migration, wrong-file edits, missing decision memory | +| recipe-api-category-update-test-hardening | Add category PATCH coverage and fix dependency ADR | tests and one ADR | Incomplete verification coverage, truncated record | + +## Results + +| Metric | Baseline | Harnessed | Delta | +| --- | --- | --- | --- | +| Wrong-file edits | Not available | 0 across 3 recorded tasks | Inconclusive; no baseline | +| Repeated mistakes | Not available | 0 repeated known mistakes observed | Inconclusive; no baseline | +| First-pass verification success | Not available | 1 pass, 1 fail-then-pass, 1 pass-with-review-gap | Mixed; review still needed | +| Docs-drift violations detected | Not available | 1 docs-drift violation detected and fixed | Narrow positive signal | +| Review gaps detected | Not available | 4 review or feedback-loop gaps detected and fixed | Positive operational signal | +| Human rework minutes | Not available | Approx. 55 minutes across 3 tasks | Initial benchmark only | +| Reverted files | Not available | 0 | Inconclusive; no baseline | + +The four review gaps were: +- README existed but was not useful project documentation. +- Django tests initially found zero tests. +- Some generated harness guidance remained too generic for the Django REST Framework target. +- Category update and clear behavior was documented but not covered by PATCH tests on the first feature pass. + +## Run Log + +| Condition | Task ID | Run | Verification result | Notes | +| --- | --- | --- | --- | --- | +| harnessed-only | recipe-api-harness-adoption-cleanup | recipe-api-001 | failed_then_passed_after_review | Review exposed zero tests, README quality issue, one docs-drift issue, and generic scaffold residue. | +| harnessed-only | recipe-api-add-category-feature | recipe-api-002 | passed_with_review_gap | Migration, docs, and ADR were present; review found missing PATCH tests. | +| harnessed-only | recipe-api-category-update-test-hardening | recipe-api-003 | passed | Follow-up stayed within expected files and completed missing coverage/ADR cleanup. | + +## Source Records + +- Task outcome records reviewed: + - `docs/examples/task-outcomes/001-recipe-api-harness-adoption.yaml` + - `docs/examples/task-outcomes/002-recipe-api-category-feature.yaml` + - `docs/examples/task-outcomes/003-recipe-api-category-update-tests.yaml` +- Repository refs compared: jihwan4155/recipe-api@99af81bf0da4a8bfecb19e5ca0af817b276f49b6 +- Prompt refs compared: local adoption, review, refresh, and feature prompts +- Target-local decision records reviewed: + - Initial API design decision + - Python dependency management decision + - Recipe category model/API decision +- Verification commands compared: + - `python manage.py check` + - `python manage.py test` + - `python manage.py makemigrations --check --dry-run` + - `python scripts/check_docs_drift.py` + - `python scripts/check_structure.py` + - `python scripts/check_decision_memory.py --fail-on-warning` + - `python scripts/check_encoding_hygiene.py` + +## Interpretation + +### What improved + +- Harness review and refresh made weak tests, one docs-drift issue, incomplete README content, and decision-memory gaps visible. +- The Category feature stayed within expected Django app boundaries and included migration, tests, README, domain glossary, and decision memory. +- Follow-up work was narrow and did not require reverting unrelated files. + +### What did not improve + +- First-pass verification was not consistently complete; the Category feature still missed PATCH category update and clear tests. +- There is no pre-harness baseline for the same tasks, so improvement cannot be quantified. + +### Confounders or limitations + +- This is a small harnessed-only evidence pass, not a controlled experiment. +- Human review remained active and may have prevented or corrected issues before they became committed defects. +- Metrics such as human rework minutes are approximate. +- The tasks came from a small practice repository, not a production system. + +### Narrow claim + +This pass provides operational evidence that harness artifacts made review and verification gaps more observable and easier to correct in a small Django REST Framework practice workflow. + +It does not prove that harness adoption generally improves agent effectiveness. + +## Follow-Up + +- Next review window: next 2-3 comparable Django or TodayBus dogfood tasks +- Owner or reviewer: maintainer or dogfood reviewer +- Related target-local decision records: + - Initial API design decision + - Python dependency management decision + - Recipe category model/API decision \ No newline at end of file diff --git a/docs/examples/task-outcomes/001-recipe-api-harness-adoption.yaml b/docs/examples/task-outcomes/001-recipe-api-harness-adoption.yaml new file mode 100644 index 0000000..0e070a4 --- /dev/null +++ b/docs/examples/task-outcomes/001-recipe-api-harness-adoption.yaml @@ -0,0 +1,72 @@ +schema_version: 1 + +target: + repository: jihwan4155/recipe-api + repository_ref: jihwan4155/recipe-api@99af81bf0da4a8bfecb19e5ca0af817b276f49b6 + stack_or_framework: Django REST Framework + date: 2026-06-03 + agent_or_model: AI coding agent + reviewer: human maintainer + +task: + id: recipe-api-harness-adoption-cleanup + run_id: recipe-api-001 + prompt_summary: Apply and refine a minimal Django harness adoption. + prompt_ref: local harness adoption and review prompts + prompt_hash: not recorded + comparable_task_group: django-harness-adoption + condition: harnessed-only + expected_boundary: + - AGENTS.md + - docs/** + - .harness/** + - scripts/** + - recipes/tests.py + - README.md + known_failure_mode: Generic scaffold drift and weak feedback loops after initial adoption. + +harness_context: + harness_doctor_score: recorded locally but not treated as effectiveness proof + harness_source: + kit_url: https://github.com/baskduf/harness-starter-kit + kit_commit: baskduf/harness-starter-kit@94e416b354facffafead6bbb9691af1598139389 + source_tracking_ref: recipe-api/harness-starter-kit + relevant_instructions: + - AGENTS.md + - docs/conventions/coding.md + relevant_constraints: + - python manage.py check + - python manage.py test + - python scripts/check_docs_drift.py + - python scripts/check_decision_memory.py --fail-on-warning + relevant_memory_records: + - docs/decisions/001-initial-api-design.md + - docs/failures/000-template.md + +outcome: + files_changed: + - AGENTS.md + - README.md + - docs/conventions/coding.md + - docs/decisions/001-initial-api-design.md + - docs/domain/glossary.md + - recipes/tests.py + wrong_file_edits: 0 + repeated_known_mistake: false + verification_command: python manage.py check && python manage.py test && python scripts/check_docs_drift.py + first_pass_verification: + result: failed_then_passed_after_review + drift_violations_detected: + - docs/domain/glossary.md used an API route string that docs-drift treated as a missing path. + review_gaps_detected: + - README existed but was not useful project documentation. + - Django tests initially found zero tests. + - Some generated harness guidance remained too generic for the Django REST Framework target. + human_rework_minutes: 35 + reverted_files: [] + notes: Review surfaced weak tests, generic scaffold text, and docs drift. The harness made those gaps explicit but did not prevent all first-pass issues. + +follow_up: + harness_change_needed: false + decision_or_failure_record: No failure record added; review findings were one-time adoption cleanup. + include_in_effectiveness_report: true \ No newline at end of file diff --git a/docs/examples/task-outcomes/002-recipe-api-category-feature.yaml b/docs/examples/task-outcomes/002-recipe-api-category-feature.yaml new file mode 100644 index 0000000..35223a8 --- /dev/null +++ b/docs/examples/task-outcomes/002-recipe-api-category-feature.yaml @@ -0,0 +1,72 @@ +schema_version: 1 + +target: + repository: jihwan4155/recipe-api + repository_ref: jihwan4155/recipe-api@99af81bf0da4a8bfecb19e5ca0af817b276f49b6 + stack_or_framework: Django REST Framework + date: 2026-06-03 + agent_or_model: AI coding agent + reviewer: human maintainer + +task: + id: recipe-api-add-category-feature + run_id: recipe-api-002 + prompt_summary: Add Category support to the Recipe API. + prompt_ref: local feature prompt requiring AGENTS.md and docs review + prompt_hash: not recorded + comparable_task_group: django-model-api-change + condition: harnessed-only + expected_boundary: + - recipes/models.py + - recipes/admin.py + - recipes/serializers.py + - recipes/tests.py + - recipes/migrations/** + - README.md + - docs/domain/glossary.md + - docs/decisions/** + known_failure_mode: Model changes without migrations, missing durable decision memory, or API docs drift. + +harness_context: + harness_doctor_score: recorded locally but not treated as effectiveness proof + harness_source: + kit_url: https://github.com/baskduf/harness-starter-kit + kit_commit: baskduf/harness-starter-kit@94e416b354facffafead6bbb9691af1598139389 + source_tracking_ref: recipe-api/harness-starter-kit + relevant_instructions: + - AGENTS.md + - docs/conventions/coding.md + relevant_constraints: + - python manage.py makemigrations --check --dry-run + - python manage.py check + - python manage.py test + - python scripts/check_decision_memory.py --fail-on-warning + relevant_memory_records: + - docs/decisions/001-initial-api-design.md + - docs/decisions/003-add-recipe-categories.md + - docs/domain/glossary.md + +outcome: + files_changed: + - recipes/models.py + - recipes/admin.py + - recipes/serializers.py + - recipes/tests.py + - recipes/migrations/0002_category_recipe_category.py + - README.md + - docs/domain/glossary.md + - docs/decisions/003-add-recipe-categories.md + wrong_file_edits: 0 + repeated_known_mistake: false + verification_command: python manage.py check && python manage.py test && python manage.py makemigrations --check --dry-run + first_pass_verification: + result: passed_with_review_gap + drift_violations_detected: [] + human_rework_minutes: 15 + reverted_files: [] + notes: Agent respected file boundaries and created migration plus decision memory. Review found missing PATCH tests for assigning and clearing category_id. + +follow_up: + harness_change_needed: false + decision_or_failure_record: docs/decisions/003-add-recipe-categories.md + include_in_effectiveness_report: true \ No newline at end of file diff --git a/docs/examples/task-outcomes/003-recipe-api-category-update-tests.yaml b/docs/examples/task-outcomes/003-recipe-api-category-update-tests.yaml new file mode 100644 index 0000000..17082e6 --- /dev/null +++ b/docs/examples/task-outcomes/003-recipe-api-category-update-tests.yaml @@ -0,0 +1,58 @@ +schema_version: 1 + +target: + repository: jihwan4155/recipe-api + repository_ref: jihwan4155/recipe-api@99af81bf0da4a8bfecb19e5ca0af817b276f49b6 + stack_or_framework: Django REST Framework + date: 2026-06-03 + agent_or_model: AI coding agent + reviewer: human maintainer + +task: + id: recipe-api-category-update-test-hardening + run_id: recipe-api-003 + prompt_summary: Add PATCH tests for assigning and clearing recipe categories, and fix truncated dependency ADR. + prompt_ref: local review follow-up prompt + prompt_hash: not recorded + comparable_task_group: django-test-hardening + condition: harnessed-only + expected_boundary: + - recipes/tests.py + - docs/decisions/002-use-requirements-file.md + known_failure_mode: Incomplete test coverage for documented PUT/PATCH behavior and truncated decision record. + +harness_context: + harness_doctor_score: recorded locally but not treated as effectiveness proof + harness_source: + kit_url: https://github.com/baskduf/harness-starter-kit + kit_commit: baskduf/harness-starter-kit@94e416b354facffafead6bbb9691af1598139389 + source_tracking_ref: recipe-api/harness-starter-kit + relevant_instructions: + - AGENTS.md + - docs/conventions/coding.md + relevant_constraints: + - python manage.py check + - python manage.py test + - python scripts/check_docs_drift.py + relevant_memory_records: + - docs/decisions/002-use-requirements-file.md + - docs/decisions/003-add-recipe-categories.md + +outcome: + files_changed: + - recipes/tests.py + - docs/decisions/002-use-requirements-file.md + wrong_file_edits: 0 + repeated_known_mistake: false + verification_command: python manage.py check && python manage.py test && python scripts/check_docs_drift.py + first_pass_verification: + result: passed + drift_violations_detected: [] + human_rework_minutes: 5 + reverted_files: [] + notes: Follow-up tightened test coverage and completed a truncated ADR without broad rewrites. + +follow_up: + harness_change_needed: false + decision_or_failure_record: Existing ADR updated. + include_in_effectiveness_report: true \ No newline at end of file