Skip to content

Commit 2e7bad5

Browse files
authored
Merge pull request #33 from jihwan4155/evidence/effectiveness-small-pass
Tighten effectiveness evidence report traceability
2 parents ca367db + bc63ca3 commit 2e7bad5

5 files changed

Lines changed: 320 additions & 0 deletions

docs/evaluation.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -95,3 +95,7 @@ Treat the data as operational evidence, not a controlled scientific study. Small
9595
repositories and changing agents can introduce noise. The useful signal is
9696
whether the same classes of mistakes become less frequent, easier to detect, or
9797
cheaper to correct after the harness becomes part of the repository.
98+
99+
## Example Evidence Passes
100+
101+
- [Small harness outcome evidence report](examples/effectiveness-report-small-evidence.md) records three harnessed task outcomes and summarizes a narrow operational evidence pass without treating Harness Doctor scores or passing checks as proof of agent effectiveness.
Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
# Small Harness Effectiveness Evidence Pass
2+
3+
## Target
4+
5+
- Repository: local Django REST Framework recipe-api practice target
6+
- Stack and framework: Python, Django, Django REST Framework
7+
- Evaluation date or window: 2026-06-03
8+
- Agent or model: AI coding agent with human review
9+
- Evaluation mode: harnessed-only tracking
10+
11+
## Primary Metric
12+
13+
The primary metric for this pass is whether review-relevant agent outcome gaps became observable and correctable through harness artifacts and checks.
14+
15+
This pass tracks:
16+
17+
- wrong-file edits
18+
- repeated known mistakes
19+
- first-pass verification result
20+
- drift violations detected
21+
- human rework minutes
22+
- reverted files
23+
24+
Harness Doctor scores and passing checks are recorded only as harness health signals. They are not treated as proof of agent effectiveness.
25+
26+
## Task Set
27+
28+
| Task ID | Scenario | Expected boundary | Common failure |
29+
| --- | --- | --- | --- |
30+
| recipe-api-harness-adoption-cleanup | Refine initial harness adoption | AGENTS.md, docs, .harness, scripts, README, tests | Generic scaffolding, weak tests, docs drift |
31+
| recipe-api-add-category-feature | Add Category model/API support | recipes app, migration, tests, README, domain docs, decision record | Missing migration, wrong-file edits, missing decision memory |
32+
| recipe-api-category-update-test-hardening | Add category PATCH coverage and fix dependency ADR | tests and one ADR | Incomplete verification coverage, truncated record |
33+
34+
## Results
35+
36+
| Metric | Baseline | Harnessed | Delta |
37+
| --- | --- | --- | --- |
38+
| Wrong-file edits | Not available | 0 across 3 recorded tasks | Inconclusive; no baseline |
39+
| Repeated mistakes | Not available | 0 repeated known mistakes observed | Inconclusive; no baseline |
40+
| First-pass verification success | Not available | 1 pass, 1 fail-then-pass, 1 pass-with-review-gap | Mixed; review still needed |
41+
| Docs-drift violations detected | Not available | 1 docs-drift violation detected and fixed | Narrow positive signal |
42+
| Review gaps detected | Not available | 4 review or feedback-loop gaps detected and fixed | Positive operational signal |
43+
| Human rework minutes | Not available | Approx. 55 minutes across 3 tasks | Initial benchmark only |
44+
| Reverted files | Not available | 0 | Inconclusive; no baseline |
45+
46+
The four review gaps were:
47+
- README existed but was not useful project documentation.
48+
- Django tests initially found zero tests.
49+
- Some generated harness guidance remained too generic for the Django REST Framework target.
50+
- Category update and clear behavior was documented but not covered by PATCH tests on the first feature pass.
51+
52+
## Run Log
53+
54+
| Condition | Task ID | Run | Verification result | Notes |
55+
| --- | --- | --- | --- | --- |
56+
| harnessed-only | recipe-api-harness-adoption-cleanup | recipe-api-001 | failed_then_passed_after_review | Review exposed zero tests, README quality issue, one docs-drift issue, and generic scaffold residue. |
57+
| harnessed-only | recipe-api-add-category-feature | recipe-api-002 | passed_with_review_gap | Migration, docs, and ADR were present; review found missing PATCH tests. |
58+
| harnessed-only | recipe-api-category-update-test-hardening | recipe-api-003 | passed | Follow-up stayed within expected files and completed missing coverage/ADR cleanup. |
59+
60+
## Source Records
61+
62+
- Task outcome records reviewed:
63+
- `docs/examples/task-outcomes/001-recipe-api-harness-adoption.yaml`
64+
- `docs/examples/task-outcomes/002-recipe-api-category-feature.yaml`
65+
- `docs/examples/task-outcomes/003-recipe-api-category-update-tests.yaml`
66+
- Repository refs compared: jihwan4155/recipe-api@99af81bf0da4a8bfecb19e5ca0af817b276f49b6
67+
- Prompt refs compared: local adoption, review, refresh, and feature prompts
68+
- Target-local decision records reviewed:
69+
- Initial API design decision
70+
- Python dependency management decision
71+
- Recipe category model/API decision
72+
- Verification commands compared:
73+
- `python manage.py check`
74+
- `python manage.py test`
75+
- `python manage.py makemigrations --check --dry-run`
76+
- `python scripts/check_docs_drift.py`
77+
- `python scripts/check_structure.py`
78+
- `python scripts/check_decision_memory.py --fail-on-warning`
79+
- `python scripts/check_encoding_hygiene.py`
80+
81+
## Interpretation
82+
83+
### What improved
84+
85+
- Harness review and refresh made weak tests, one docs-drift issue, incomplete README content, and decision-memory gaps visible.
86+
- The Category feature stayed within expected Django app boundaries and included migration, tests, README, domain glossary, and decision memory.
87+
- Follow-up work was narrow and did not require reverting unrelated files.
88+
89+
### What did not improve
90+
91+
- First-pass verification was not consistently complete; the Category feature still missed PATCH category update and clear tests.
92+
- There is no pre-harness baseline for the same tasks, so improvement cannot be quantified.
93+
94+
### Confounders or limitations
95+
96+
- This is a small harnessed-only evidence pass, not a controlled experiment.
97+
- Human review remained active and may have prevented or corrected issues before they became committed defects.
98+
- Metrics such as human rework minutes are approximate.
99+
- The tasks came from a small practice repository, not a production system.
100+
101+
### Narrow claim
102+
103+
This pass provides operational evidence that harness artifacts made review and verification gaps more observable and easier to correct in a small Django REST Framework practice workflow.
104+
105+
It does not prove that harness adoption generally improves agent effectiveness.
106+
107+
## Follow-Up
108+
109+
- Next review window: next 2-3 comparable Django or TodayBus dogfood tasks
110+
- Owner or reviewer: maintainer or dogfood reviewer
111+
- Related target-local decision records:
112+
- Initial API design decision
113+
- Python dependency management decision
114+
- Recipe category model/API decision
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
schema_version: 1
2+
3+
target:
4+
repository: jihwan4155/recipe-api
5+
repository_ref: jihwan4155/recipe-api@99af81bf0da4a8bfecb19e5ca0af817b276f49b6
6+
stack_or_framework: Django REST Framework
7+
date: 2026-06-03
8+
agent_or_model: AI coding agent
9+
reviewer: human maintainer
10+
11+
task:
12+
id: recipe-api-harness-adoption-cleanup
13+
run_id: recipe-api-001
14+
prompt_summary: Apply and refine a minimal Django harness adoption.
15+
prompt_ref: local harness adoption and review prompts
16+
prompt_hash: not recorded
17+
comparable_task_group: django-harness-adoption
18+
condition: harnessed-only
19+
expected_boundary:
20+
- AGENTS.md
21+
- docs/**
22+
- .harness/**
23+
- scripts/**
24+
- recipes/tests.py
25+
- README.md
26+
known_failure_mode: Generic scaffold drift and weak feedback loops after initial adoption.
27+
28+
harness_context:
29+
harness_doctor_score: recorded locally but not treated as effectiveness proof
30+
harness_source:
31+
kit_url: https://github.com/baskduf/harness-starter-kit
32+
kit_commit: baskduf/harness-starter-kit@94e416b354facffafead6bbb9691af1598139389
33+
source_tracking_ref: recipe-api/harness-starter-kit
34+
relevant_instructions:
35+
- AGENTS.md
36+
- docs/conventions/coding.md
37+
relevant_constraints:
38+
- python manage.py check
39+
- python manage.py test
40+
- python scripts/check_docs_drift.py
41+
- python scripts/check_decision_memory.py --fail-on-warning
42+
relevant_memory_records:
43+
- docs/decisions/001-initial-api-design.md
44+
- docs/failures/000-template.md
45+
46+
outcome:
47+
files_changed:
48+
- AGENTS.md
49+
- README.md
50+
- docs/conventions/coding.md
51+
- docs/decisions/001-initial-api-design.md
52+
- docs/domain/glossary.md
53+
- recipes/tests.py
54+
wrong_file_edits: 0
55+
repeated_known_mistake: false
56+
verification_command: python manage.py check && python manage.py test && python scripts/check_docs_drift.py
57+
first_pass_verification:
58+
result: failed_then_passed_after_review
59+
drift_violations_detected:
60+
- docs/domain/glossary.md used an API route string that docs-drift treated as a missing path.
61+
review_gaps_detected:
62+
- README existed but was not useful project documentation.
63+
- Django tests initially found zero tests.
64+
- Some generated harness guidance remained too generic for the Django REST Framework target.
65+
human_rework_minutes: 35
66+
reverted_files: []
67+
notes: Review surfaced weak tests, generic scaffold text, and docs drift. The harness made those gaps explicit but did not prevent all first-pass issues.
68+
69+
follow_up:
70+
harness_change_needed: false
71+
decision_or_failure_record: No failure record added; review findings were one-time adoption cleanup.
72+
include_in_effectiveness_report: true
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
schema_version: 1
2+
3+
target:
4+
repository: jihwan4155/recipe-api
5+
repository_ref: jihwan4155/recipe-api@99af81bf0da4a8bfecb19e5ca0af817b276f49b6
6+
stack_or_framework: Django REST Framework
7+
date: 2026-06-03
8+
agent_or_model: AI coding agent
9+
reviewer: human maintainer
10+
11+
task:
12+
id: recipe-api-add-category-feature
13+
run_id: recipe-api-002
14+
prompt_summary: Add Category support to the Recipe API.
15+
prompt_ref: local feature prompt requiring AGENTS.md and docs review
16+
prompt_hash: not recorded
17+
comparable_task_group: django-model-api-change
18+
condition: harnessed-only
19+
expected_boundary:
20+
- recipes/models.py
21+
- recipes/admin.py
22+
- recipes/serializers.py
23+
- recipes/tests.py
24+
- recipes/migrations/**
25+
- README.md
26+
- docs/domain/glossary.md
27+
- docs/decisions/**
28+
known_failure_mode: Model changes without migrations, missing durable decision memory, or API docs drift.
29+
30+
harness_context:
31+
harness_doctor_score: recorded locally but not treated as effectiveness proof
32+
harness_source:
33+
kit_url: https://github.com/baskduf/harness-starter-kit
34+
kit_commit: baskduf/harness-starter-kit@94e416b354facffafead6bbb9691af1598139389
35+
source_tracking_ref: recipe-api/harness-starter-kit
36+
relevant_instructions:
37+
- AGENTS.md
38+
- docs/conventions/coding.md
39+
relevant_constraints:
40+
- python manage.py makemigrations --check --dry-run
41+
- python manage.py check
42+
- python manage.py test
43+
- python scripts/check_decision_memory.py --fail-on-warning
44+
relevant_memory_records:
45+
- docs/decisions/001-initial-api-design.md
46+
- docs/decisions/003-add-recipe-categories.md
47+
- docs/domain/glossary.md
48+
49+
outcome:
50+
files_changed:
51+
- recipes/models.py
52+
- recipes/admin.py
53+
- recipes/serializers.py
54+
- recipes/tests.py
55+
- recipes/migrations/0002_category_recipe_category.py
56+
- README.md
57+
- docs/domain/glossary.md
58+
- docs/decisions/003-add-recipe-categories.md
59+
wrong_file_edits: 0
60+
repeated_known_mistake: false
61+
verification_command: python manage.py check && python manage.py test && python manage.py makemigrations --check --dry-run
62+
first_pass_verification:
63+
result: passed_with_review_gap
64+
drift_violations_detected: []
65+
human_rework_minutes: 15
66+
reverted_files: []
67+
notes: Agent respected file boundaries and created migration plus decision memory. Review found missing PATCH tests for assigning and clearing category_id.
68+
69+
follow_up:
70+
harness_change_needed: false
71+
decision_or_failure_record: docs/decisions/003-add-recipe-categories.md
72+
include_in_effectiveness_report: true
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
schema_version: 1
2+
3+
target:
4+
repository: jihwan4155/recipe-api
5+
repository_ref: jihwan4155/recipe-api@99af81bf0da4a8bfecb19e5ca0af817b276f49b6
6+
stack_or_framework: Django REST Framework
7+
date: 2026-06-03
8+
agent_or_model: AI coding agent
9+
reviewer: human maintainer
10+
11+
task:
12+
id: recipe-api-category-update-test-hardening
13+
run_id: recipe-api-003
14+
prompt_summary: Add PATCH tests for assigning and clearing recipe categories, and fix truncated dependency ADR.
15+
prompt_ref: local review follow-up prompt
16+
prompt_hash: not recorded
17+
comparable_task_group: django-test-hardening
18+
condition: harnessed-only
19+
expected_boundary:
20+
- recipes/tests.py
21+
- docs/decisions/002-use-requirements-file.md
22+
known_failure_mode: Incomplete test coverage for documented PUT/PATCH behavior and truncated decision record.
23+
24+
harness_context:
25+
harness_doctor_score: recorded locally but not treated as effectiveness proof
26+
harness_source:
27+
kit_url: https://github.com/baskduf/harness-starter-kit
28+
kit_commit: baskduf/harness-starter-kit@94e416b354facffafead6bbb9691af1598139389
29+
source_tracking_ref: recipe-api/harness-starter-kit
30+
relevant_instructions:
31+
- AGENTS.md
32+
- docs/conventions/coding.md
33+
relevant_constraints:
34+
- python manage.py check
35+
- python manage.py test
36+
- python scripts/check_docs_drift.py
37+
relevant_memory_records:
38+
- docs/decisions/002-use-requirements-file.md
39+
- docs/decisions/003-add-recipe-categories.md
40+
41+
outcome:
42+
files_changed:
43+
- recipes/tests.py
44+
- docs/decisions/002-use-requirements-file.md
45+
wrong_file_edits: 0
46+
repeated_known_mistake: false
47+
verification_command: python manage.py check && python manage.py test && python scripts/check_docs_drift.py
48+
first_pass_verification:
49+
result: passed
50+
drift_violations_detected: []
51+
human_rework_minutes: 5
52+
reverted_files: []
53+
notes: Follow-up tightened test coverage and completed a truncated ADR without broad rewrites.
54+
55+
follow_up:
56+
harness_change_needed: false
57+
decision_or_failure_record: Existing ADR updated.
58+
include_in_effectiveness_report: true

0 commit comments

Comments
 (0)