|
| 1 | +# Small Harness Effectiveness Evidence Pass |
| 2 | + |
| 3 | +## Target |
| 4 | + |
| 5 | +- Repository: local Django REST Framework recipe-api practice target |
| 6 | +- Stack and framework: Python, Django, Django REST Framework |
| 7 | +- Evaluation date or window: 2026-06-03 |
| 8 | +- Agent or model: AI coding agent with human review |
| 9 | +- Evaluation mode: harnessed-only tracking |
| 10 | + |
| 11 | +## Primary Metric |
| 12 | + |
| 13 | +The primary metric for this pass is whether review-relevant agent outcome gaps became observable and correctable through harness artifacts and checks. |
| 14 | + |
| 15 | +This pass tracks: |
| 16 | + |
| 17 | +- wrong-file edits |
| 18 | +- repeated known mistakes |
| 19 | +- first-pass verification result |
| 20 | +- drift violations detected |
| 21 | +- human rework minutes |
| 22 | +- reverted files |
| 23 | + |
| 24 | +Harness Doctor scores and passing checks are recorded only as harness health signals. They are not treated as proof of agent effectiveness. |
| 25 | + |
| 26 | +## Task Set |
| 27 | + |
| 28 | +| Task ID | Scenario | Expected boundary | Common failure | |
| 29 | +| --- | --- | --- | --- | |
| 30 | +| recipe-api-harness-adoption-cleanup | Refine initial harness adoption | AGENTS.md, docs, .harness, scripts, README, tests | Generic scaffolding, weak tests, docs drift | |
| 31 | +| recipe-api-add-category-feature | Add Category model/API support | recipes app, migration, tests, README, domain docs, decision record | Missing migration, wrong-file edits, missing decision memory | |
| 32 | +| recipe-api-category-update-test-hardening | Add category PATCH coverage and fix dependency ADR | tests and one ADR | Incomplete verification coverage, truncated record | |
| 33 | + |
| 34 | +## Results |
| 35 | + |
| 36 | +| Metric | Baseline | Harnessed | Delta | |
| 37 | +| --- | --- | --- | --- | |
| 38 | +| Wrong-file edits | Not available | 0 across 3 recorded tasks | Inconclusive; no baseline | |
| 39 | +| Repeated mistakes | Not available | 0 repeated known mistakes observed | Inconclusive; no baseline | |
| 40 | +| First-pass verification success | Not available | 1 pass, 1 fail-then-pass, 1 pass-with-review-gap | Mixed; review still needed | |
| 41 | +| Docs-drift violations detected | Not available | 1 docs-drift violation detected and fixed | Narrow positive signal | |
| 42 | +| Review gaps detected | Not available | 4 review or feedback-loop gaps detected and fixed | Positive operational signal | |
| 43 | +| Human rework minutes | Not available | Approx. 55 minutes across 3 tasks | Initial benchmark only | |
| 44 | +| Reverted files | Not available | 0 | Inconclusive; no baseline | |
| 45 | + |
| 46 | +The four review gaps were: |
| 47 | +- README existed but was not useful project documentation. |
| 48 | +- Django tests initially found zero tests. |
| 49 | +- Some generated harness guidance remained too generic for the Django REST Framework target. |
| 50 | +- Category update and clear behavior was documented but not covered by PATCH tests on the first feature pass. |
| 51 | + |
| 52 | +## Run Log |
| 53 | + |
| 54 | +| Condition | Task ID | Run | Verification result | Notes | |
| 55 | +| --- | --- | --- | --- | --- | |
| 56 | +| harnessed-only | recipe-api-harness-adoption-cleanup | recipe-api-001 | failed_then_passed_after_review | Review exposed zero tests, README quality issue, one docs-drift issue, and generic scaffold residue. | |
| 57 | +| harnessed-only | recipe-api-add-category-feature | recipe-api-002 | passed_with_review_gap | Migration, docs, and ADR were present; review found missing PATCH tests. | |
| 58 | +| harnessed-only | recipe-api-category-update-test-hardening | recipe-api-003 | passed | Follow-up stayed within expected files and completed missing coverage/ADR cleanup. | |
| 59 | + |
| 60 | +## Source Records |
| 61 | + |
| 62 | +- Task outcome records reviewed: |
| 63 | + - `docs/examples/task-outcomes/001-recipe-api-harness-adoption.yaml` |
| 64 | + - `docs/examples/task-outcomes/002-recipe-api-category-feature.yaml` |
| 65 | + - `docs/examples/task-outcomes/003-recipe-api-category-update-tests.yaml` |
| 66 | +- Repository refs compared: jihwan4155/recipe-api@99af81bf0da4a8bfecb19e5ca0af817b276f49b6 |
| 67 | +- Prompt refs compared: local adoption, review, refresh, and feature prompts |
| 68 | +- Target-local decision records reviewed: |
| 69 | + - Initial API design decision |
| 70 | + - Python dependency management decision |
| 71 | + - Recipe category model/API decision |
| 72 | +- Verification commands compared: |
| 73 | + - `python manage.py check` |
| 74 | + - `python manage.py test` |
| 75 | + - `python manage.py makemigrations --check --dry-run` |
| 76 | + - `python scripts/check_docs_drift.py` |
| 77 | + - `python scripts/check_structure.py` |
| 78 | + - `python scripts/check_decision_memory.py --fail-on-warning` |
| 79 | + - `python scripts/check_encoding_hygiene.py` |
| 80 | + |
| 81 | +## Interpretation |
| 82 | + |
| 83 | +### What improved |
| 84 | + |
| 85 | +- Harness review and refresh made weak tests, one docs-drift issue, incomplete README content, and decision-memory gaps visible. |
| 86 | +- The Category feature stayed within expected Django app boundaries and included migration, tests, README, domain glossary, and decision memory. |
| 87 | +- Follow-up work was narrow and did not require reverting unrelated files. |
| 88 | + |
| 89 | +### What did not improve |
| 90 | + |
| 91 | +- First-pass verification was not consistently complete; the Category feature still missed PATCH category update and clear tests. |
| 92 | +- There is no pre-harness baseline for the same tasks, so improvement cannot be quantified. |
| 93 | + |
| 94 | +### Confounders or limitations |
| 95 | + |
| 96 | +- This is a small harnessed-only evidence pass, not a controlled experiment. |
| 97 | +- Human review remained active and may have prevented or corrected issues before they became committed defects. |
| 98 | +- Metrics such as human rework minutes are approximate. |
| 99 | +- The tasks came from a small practice repository, not a production system. |
| 100 | + |
| 101 | +### Narrow claim |
| 102 | + |
| 103 | +This pass provides operational evidence that harness artifacts made review and verification gaps more observable and easier to correct in a small Django REST Framework practice workflow. |
| 104 | + |
| 105 | +It does not prove that harness adoption generally improves agent effectiveness. |
| 106 | + |
| 107 | +## Follow-Up |
| 108 | + |
| 109 | +- Next review window: next 2-3 comparable Django or TodayBus dogfood tasks |
| 110 | +- Owner or reviewer: maintainer or dogfood reviewer |
| 111 | +- Related target-local decision records: |
| 112 | + - Initial API design decision |
| 113 | + - Python dependency management decision |
| 114 | + - Recipe category model/API decision |
0 commit comments