Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,3 +95,7 @@ Treat the data as operational evidence, not a controlled scientific study. Small
repositories and changing agents can introduce noise. The useful signal is
whether the same classes of mistakes become less frequent, easier to detect, or
cheaper to correct after the harness becomes part of the repository.

## Example Evidence Passes

- [Small harness outcome evidence report](examples/effectiveness-report-small-evidence.md) records three harnessed task outcomes and summarizes a narrow operational evidence pass without treating Harness Doctor scores or passing checks as proof of agent effectiveness.
114 changes: 114 additions & 0 deletions docs/examples/effectiveness-report-small-evidence.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Small Harness Effectiveness Evidence Pass

## Target

- Repository: local Django REST Framework recipe-api practice target
- Stack and framework: Python, Django, Django REST Framework
- Evaluation date or window: 2026-06-03
- Agent or model: AI coding agent with human review
- Evaluation mode: harnessed-only tracking

## Primary Metric

The primary metric for this pass is whether review-relevant agent outcome gaps became observable and correctable through harness artifacts and checks.

This pass tracks:

- wrong-file edits
- repeated known mistakes
- first-pass verification result
- drift violations detected
- human rework minutes
- reverted files

Harness Doctor scores and passing checks are recorded only as harness health signals. They are not treated as proof of agent effectiveness.

## Task Set

| Task ID | Scenario | Expected boundary | Common failure |
| --- | --- | --- | --- |
| recipe-api-harness-adoption-cleanup | Refine initial harness adoption | AGENTS.md, docs, .harness, scripts, README, tests | Generic scaffolding, weak tests, docs drift |
| recipe-api-add-category-feature | Add Category model/API support | recipes app, migration, tests, README, domain docs, decision record | Missing migration, wrong-file edits, missing decision memory |
| recipe-api-category-update-test-hardening | Add category PATCH coverage and fix dependency ADR | tests and one ADR | Incomplete verification coverage, truncated record |

## Results

| Metric | Baseline | Harnessed | Delta |
| --- | --- | --- | --- |
| Wrong-file edits | Not available | 0 across 3 recorded tasks | Inconclusive; no baseline |
| Repeated mistakes | Not available | 0 repeated known mistakes observed | Inconclusive; no baseline |
| First-pass verification success | Not available | 1 pass, 1 fail-then-pass, 1 pass-with-review-gap | Mixed; review still needed |
| Docs-drift violations detected | Not available | 1 docs-drift violation detected and fixed | Narrow positive signal |
| Review gaps detected | Not available | 4 review or feedback-loop gaps detected and fixed | Positive operational signal |
| Human rework minutes | Not available | Approx. 55 minutes across 3 tasks | Initial benchmark only |
| Reverted files | Not available | 0 | Inconclusive; no baseline |

The four review gaps were:
- README existed but was not useful project documentation.
- Django tests initially found zero tests.
- Some generated harness guidance remained too generic for the Django REST Framework target.
- Category update and clear behavior was documented but not covered by PATCH tests on the first feature pass.

## Run Log

| Condition | Task ID | Run | Verification result | Notes |
| --- | --- | --- | --- | --- |
| harnessed-only | recipe-api-harness-adoption-cleanup | recipe-api-001 | failed_then_passed_after_review | Review exposed zero tests, README quality issue, one docs-drift issue, and generic scaffold residue. |
| harnessed-only | recipe-api-add-category-feature | recipe-api-002 | passed_with_review_gap | Migration, docs, and ADR were present; review found missing PATCH tests. |
| harnessed-only | recipe-api-category-update-test-hardening | recipe-api-003 | passed | Follow-up stayed within expected files and completed missing coverage/ADR cleanup. |

## Source Records

- Task outcome records reviewed:
- `docs/examples/task-outcomes/001-recipe-api-harness-adoption.yaml`
- `docs/examples/task-outcomes/002-recipe-api-category-feature.yaml`
- `docs/examples/task-outcomes/003-recipe-api-category-update-tests.yaml`
- Repository refs compared: jihwan4155/recipe-api@99af81bf0da4a8bfecb19e5ca0af817b276f49b6
- Prompt refs compared: local adoption, review, refresh, and feature prompts
- Target-local decision records reviewed:
- Initial API design decision
- Python dependency management decision
- Recipe category model/API decision
- Verification commands compared:
- `python manage.py check`
- `python manage.py test`
- `python manage.py makemigrations --check --dry-run`
- `python scripts/check_docs_drift.py`
- `python scripts/check_structure.py`
- `python scripts/check_decision_memory.py --fail-on-warning`
- `python scripts/check_encoding_hygiene.py`

## Interpretation

### What improved

- Harness review and refresh made weak tests, one docs-drift issue, incomplete README content, and decision-memory gaps visible.
- The Category feature stayed within expected Django app boundaries and included migration, tests, README, domain glossary, and decision memory.
- Follow-up work was narrow and did not require reverting unrelated files.

### What did not improve

- First-pass verification was not consistently complete; the Category feature still missed PATCH category update and clear tests.
- There is no pre-harness baseline for the same tasks, so improvement cannot be quantified.

### Confounders or limitations

- This is a small harnessed-only evidence pass, not a controlled experiment.
- Human review remained active and may have prevented or corrected issues before they became committed defects.
- Metrics such as human rework minutes are approximate.
- The tasks came from a small practice repository, not a production system.

### Narrow claim

This pass provides operational evidence that harness artifacts made review and verification gaps more observable and easier to correct in a small Django REST Framework practice workflow.

It does not prove that harness adoption generally improves agent effectiveness.

## Follow-Up

- Next review window: next 2-3 comparable Django or TodayBus dogfood tasks
- Owner or reviewer: maintainer or dogfood reviewer
- Related target-local decision records:
- Initial API design decision
- Python dependency management decision
- Recipe category model/API decision
72 changes: 72 additions & 0 deletions docs/examples/task-outcomes/001-recipe-api-harness-adoption.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
schema_version: 1

target:
repository: jihwan4155/recipe-api
repository_ref: jihwan4155/recipe-api@99af81bf0da4a8bfecb19e5ca0af817b276f49b6
stack_or_framework: Django REST Framework
date: 2026-06-03
agent_or_model: AI coding agent
reviewer: human maintainer

task:
id: recipe-api-harness-adoption-cleanup
run_id: recipe-api-001
prompt_summary: Apply and refine a minimal Django harness adoption.
prompt_ref: local harness adoption and review prompts
prompt_hash: not recorded
comparable_task_group: django-harness-adoption
condition: harnessed-only
expected_boundary:
- AGENTS.md
- docs/**
- .harness/**
- scripts/**
- recipes/tests.py
- README.md
known_failure_mode: Generic scaffold drift and weak feedback loops after initial adoption.

harness_context:
harness_doctor_score: recorded locally but not treated as effectiveness proof
harness_source:
kit_url: https://github.com/baskduf/harness-starter-kit
kit_commit: baskduf/harness-starter-kit@94e416b354facffafead6bbb9691af1598139389
source_tracking_ref: recipe-api/harness-starter-kit
relevant_instructions:
- AGENTS.md
- docs/conventions/coding.md
relevant_constraints:
- python manage.py check
- python manage.py test
- python scripts/check_docs_drift.py
- python scripts/check_decision_memory.py --fail-on-warning
relevant_memory_records:
- docs/decisions/001-initial-api-design.md
- docs/failures/000-template.md

outcome:
files_changed:
- AGENTS.md
- README.md
- docs/conventions/coding.md
- docs/decisions/001-initial-api-design.md
- docs/domain/glossary.md
- recipes/tests.py
wrong_file_edits: 0
repeated_known_mistake: false
verification_command: python manage.py check && python manage.py test && python scripts/check_docs_drift.py
first_pass_verification:
result: failed_then_passed_after_review
drift_violations_detected:
- docs/domain/glossary.md used an API route string that docs-drift treated as a missing path.
review_gaps_detected:
- README existed but was not useful project documentation.
- Django tests initially found zero tests.
- Some generated harness guidance remained too generic for the Django REST Framework target.
human_rework_minutes: 35
reverted_files: []
notes: Review surfaced weak tests, generic scaffold text, and docs drift. The harness made those gaps explicit but did not prevent all first-pass issues.

follow_up:
harness_change_needed: false
decision_or_failure_record: No failure record added; review findings were one-time adoption cleanup.
include_in_effectiveness_report: true
72 changes: 72 additions & 0 deletions docs/examples/task-outcomes/002-recipe-api-category-feature.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
schema_version: 1

target:
repository: jihwan4155/recipe-api
repository_ref: jihwan4155/recipe-api@99af81bf0da4a8bfecb19e5ca0af817b276f49b6
stack_or_framework: Django REST Framework
date: 2026-06-03
agent_or_model: AI coding agent
reviewer: human maintainer

task:
id: recipe-api-add-category-feature
run_id: recipe-api-002
prompt_summary: Add Category support to the Recipe API.
prompt_ref: local feature prompt requiring AGENTS.md and docs review
prompt_hash: not recorded
comparable_task_group: django-model-api-change
condition: harnessed-only
expected_boundary:
- recipes/models.py
- recipes/admin.py
- recipes/serializers.py
- recipes/tests.py
- recipes/migrations/**
- README.md
- docs/domain/glossary.md
- docs/decisions/**
known_failure_mode: Model changes without migrations, missing durable decision memory, or API docs drift.

harness_context:
harness_doctor_score: recorded locally but not treated as effectiveness proof
harness_source:
kit_url: https://github.com/baskduf/harness-starter-kit
kit_commit: baskduf/harness-starter-kit@94e416b354facffafead6bbb9691af1598139389
source_tracking_ref: recipe-api/harness-starter-kit
relevant_instructions:
- AGENTS.md
- docs/conventions/coding.md
relevant_constraints:
- python manage.py makemigrations --check --dry-run
- python manage.py check
- python manage.py test
- python scripts/check_decision_memory.py --fail-on-warning
relevant_memory_records:
- docs/decisions/001-initial-api-design.md
- docs/decisions/003-add-recipe-categories.md
- docs/domain/glossary.md

outcome:
files_changed:
- recipes/models.py
- recipes/admin.py
- recipes/serializers.py
- recipes/tests.py
- recipes/migrations/0002_category_recipe_category.py
- README.md
- docs/domain/glossary.md
- docs/decisions/003-add-recipe-categories.md
wrong_file_edits: 0
repeated_known_mistake: false
verification_command: python manage.py check && python manage.py test && python manage.py makemigrations --check --dry-run
first_pass_verification:
result: passed_with_review_gap
drift_violations_detected: []
human_rework_minutes: 15
reverted_files: []
notes: Agent respected file boundaries and created migration plus decision memory. Review found missing PATCH tests for assigning and clearing category_id.

follow_up:
harness_change_needed: false
decision_or_failure_record: docs/decisions/003-add-recipe-categories.md
include_in_effectiveness_report: true
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
schema_version: 1

target:
repository: jihwan4155/recipe-api
repository_ref: jihwan4155/recipe-api@99af81bf0da4a8bfecb19e5ca0af817b276f49b6
stack_or_framework: Django REST Framework
date: 2026-06-03
agent_or_model: AI coding agent
reviewer: human maintainer

task:
id: recipe-api-category-update-test-hardening
run_id: recipe-api-003
prompt_summary: Add PATCH tests for assigning and clearing recipe categories, and fix truncated dependency ADR.
prompt_ref: local review follow-up prompt
prompt_hash: not recorded
comparable_task_group: django-test-hardening
condition: harnessed-only
expected_boundary:
- recipes/tests.py
- docs/decisions/002-use-requirements-file.md
known_failure_mode: Incomplete test coverage for documented PUT/PATCH behavior and truncated decision record.

harness_context:
harness_doctor_score: recorded locally but not treated as effectiveness proof
harness_source:
kit_url: https://github.com/baskduf/harness-starter-kit
kit_commit: baskduf/harness-starter-kit@94e416b354facffafead6bbb9691af1598139389
source_tracking_ref: recipe-api/harness-starter-kit
relevant_instructions:
- AGENTS.md
- docs/conventions/coding.md
relevant_constraints:
- python manage.py check
- python manage.py test
- python scripts/check_docs_drift.py
relevant_memory_records:
- docs/decisions/002-use-requirements-file.md
- docs/decisions/003-add-recipe-categories.md

outcome:
files_changed:
- recipes/tests.py
- docs/decisions/002-use-requirements-file.md
wrong_file_edits: 0
repeated_known_mistake: false
verification_command: python manage.py check && python manage.py test && python scripts/check_docs_drift.py
first_pass_verification:
result: passed
drift_violations_detected: []
human_rework_minutes: 5
reverted_files: []
notes: Follow-up tightened test coverage and completed a truncated ADR without broad rewrites.

follow_up:
harness_change_needed: false
decision_or_failure_record: Existing ADR updated.
include_in_effectiveness_report: true
Loading