Skip to content

Commit 2e73a0f

Browse files
committed
docs: refine effectiveness report classification
1 parent 73fea8f commit 2e73a0f

3 files changed

Lines changed: 32 additions & 3 deletions

File tree

docs/evaluation.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,12 @@ Use one of these modes:
4545
Record the mode in the effectiveness report. Do not infer improvement from
4646
harnessed-only tracking until there is a later comparison point.
4747

48+
Separate non-comparable setup runs from product-task outcomes. Adoption,
49+
template setup, placeholder-prompt, or other workflow-preparation records can be
50+
useful operational evidence, but they should not enter comparable product-task
51+
counts unless they had a concrete task, expected boundary, known failure mode,
52+
and verification command.
53+
4854
## Metrics
4955

5056
| Metric | Definition | Example observation |
@@ -74,6 +80,10 @@ harnessed-only tracking until there is a later comparison point.
7480
7. For each task outcome record, include the repository ref, prompt reference,
7581
run id, reviewer, harness source, and verification command so later reviewers
7682
can tell whether two runs are actually comparable.
83+
8. In aggregate reports, compare expected boundaries with actual changed files,
84+
distinguish unknown human rework from 0 minutes, and treat
85+
`include_in_effectiveness_report` as separate from inclusion in comparable
86+
product-task counts.
7787

7888
## Minimum Adoption-Time Plan
7989

@@ -98,4 +108,4 @@ cheaper to correct after the harness becomes part of the repository.
98108

99109
## Example Evidence Passes
100110

101-
- [Small harness outcome evidence report](examples/effectiveness-report-small-evidence.md) records three harnessed task outcomes and summarizes a narrow operational evidence pass without treating Harness Doctor scores or passing checks as proof of agent effectiveness.
111+
- [Small harness outcome evidence report](examples/effectiveness-report-small-evidence.md) records three harnessed task outcomes and summarizes a narrow operational evidence pass without treating Harness Doctor scores or passing checks as proof of agent effectiveness.

docs/templates/effectiveness-report.md

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
- Stack and framework: TODO
77
- Evaluation date or window: TODO
88
- Agent or model: TODO
9-
- Evaluation mode: TODO: baseline versus harnessed, or harnessed-only tracking
9+
- Evaluation mode: TODO: baseline-vs-harnessed, harnessed-only-initial-benchmark, or other named mode
1010

1111
## Task Set
1212

@@ -25,12 +25,28 @@
2525
| Human rework minutes | TODO | TODO | TODO |
2626
| Reverted files | TODO | TODO | TODO |
2727

28+
## Non-Comparable Setup Runs
29+
30+
Use this section for adoption, template setup, placeholder-prompt, or other
31+
non-product observations that should be preserved but excluded from comparable
32+
product-task counts.
33+
34+
| Run | Reason excluded | Use in metrics |
35+
| --- | --- | --- |
36+
| TODO | TODO | Excluded from comparable product-task count |
37+
2838
## Run Log
2939

3040
| Condition | Task ID | Run | Verification result | Notes |
3141
| --- | --- | ---: | --- | --- |
3242
| TODO | TODO | TODO | TODO | TODO |
3343

44+
## Changed-Files Consistency
45+
46+
| Task ID | Expected boundary | Actual changed files | Wrong-file edit result |
47+
| --- | --- | --- | --- |
48+
| TODO | TODO | TODO | TODO |
49+
3450
## Source Records
3551

3652
- Task outcome records reviewed: TODO: list `docs/effectiveness/task-outcomes/...`
@@ -40,10 +56,12 @@
4056

4157
## Interpretation
4258

43-
- What improved: TODO
59+
- Observed benchmark: TODO: for harnessed-only reports, summarize observations without claiming improvement
60+
- What improved: TODO: use only when a comparable baseline or later comparison supports this
4461
- What did not improve: TODO
4562
- Confounders or limitations: TODO
4663
- Harness changes to make next: TODO
64+
- Human rework interpretation: TODO: distinguish unknown from 0 minutes
4765

4866
## Follow-Up
4967

docs/templates/task-outcome.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,3 +56,4 @@ follow_up:
5656
harness_change_needed: TODO
5757
decision_or_failure_record: TODO
5858
include_in_effectiveness_report: TODO
59+
include_in_comparable_product_task_count: TODO

0 commit comments

Comments
 (0)