You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/evaluation.md
+11-1Lines changed: 11 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -45,6 +45,12 @@ Use one of these modes:
45
45
Record the mode in the effectiveness report. Do not infer improvement from
46
46
harnessed-only tracking until there is a later comparison point.
47
47
48
+
Separate non-comparable setup runs from product-task outcomes. Adoption,
49
+
template setup, placeholder-prompt, or other workflow-preparation records can be
50
+
useful operational evidence, but they should not enter comparable product-task
51
+
counts unless they had a concrete task, expected boundary, known failure mode,
52
+
and verification command.
53
+
48
54
## Metrics
49
55
50
56
| Metric | Definition | Example observation |
@@ -74,6 +80,10 @@ harnessed-only tracking until there is a later comparison point.
74
80
7. For each task outcome record, include the repository ref, prompt reference,
75
81
run id, reviewer, harness source, and verification command so later reviewers
76
82
can tell whether two runs are actually comparable.
83
+
8. In aggregate reports, compare expected boundaries with actual changed files,
84
+
distinguish unknown human rework from 0 minutes, and treat
85
+
`include_in_effectiveness_report` as separate from inclusion in comparable
86
+
product-task counts.
77
87
78
88
## Minimum Adoption-Time Plan
79
89
@@ -98,4 +108,4 @@ cheaper to correct after the harness becomes part of the repository.
98
108
99
109
## Example Evidence Passes
100
110
101
-
-[Small harness outcome evidence report](examples/effectiveness-report-small-evidence.md) records three harnessed task outcomes and summarizes a narrow operational evidence pass without treating Harness Doctor scores or passing checks as proof of agent effectiveness.
111
+
-[Small harness outcome evidence report](examples/effectiveness-report-small-evidence.md) records three harnessed task outcomes and summarizes a narrow operational evidence pass without treating Harness Doctor scores or passing checks as proof of agent effectiveness.
0 commit comments