docs: refine effectiveness report classification

baskduf · baskduf · commit 2e73a0fda06e · 2026-06-04T16:52:12.000+09:00
diff --git a/docs/evaluation.md b/docs/evaluation.md
@@ -45,6 +45,12 @@ Use one of these modes:
 Record the mode in the effectiveness report. Do not infer improvement from
 harnessed-only tracking until there is a later comparison point.
 
+Separate non-comparable setup runs from product-task outcomes. Adoption,
+template setup, placeholder-prompt, or other workflow-preparation records can be
+useful operational evidence, but they should not enter comparable product-task
+counts unless they had a concrete task, expected boundary, known failure mode,
+and verification command.
+
 ## Metrics
 
 | Metric | Definition | Example observation |
@@ -74,6 +80,10 @@ harnessed-only tracking until there is a later comparison point.
 7. For each task outcome record, include the repository ref, prompt reference,
    run id, reviewer, harness source, and verification command so later reviewers
    can tell whether two runs are actually comparable.
+8. In aggregate reports, compare expected boundaries with actual changed files,
+   distinguish unknown human rework from 0 minutes, and treat
+   `include_in_effectiveness_report` as separate from inclusion in comparable
+   product-task counts.
 
 ## Minimum Adoption-Time Plan
 
@@ -98,4 +108,4 @@ cheaper to correct after the harness becomes part of the repository.
 
 ## Example Evidence Passes
 
-- [Small harness outcome evidence report](examples/effectiveness-report-small-evidence.md) records three harnessed task outcomes and summarizes a narrow operational evidence pass without treating Harness Doctor scores or passing checks as proof of agent effectiveness.
+- [Small harness outcome evidence report](examples/effectiveness-report-small-evidence.md) records three harnessed task outcomes and summarizes a narrow operational evidence pass without treating Harness Doctor scores or passing checks as proof of agent effectiveness.
diff --git a/docs/templates/effectiveness-report.md b/docs/templates/effectiveness-report.md
@@ -6,7 +6,7 @@
 - Stack and framework: TODO
 - Evaluation date or window: TODO
 - Agent or model: TODO
-- Evaluation mode: TODO: baseline versus harnessed, or harnessed-only tracking
+- Evaluation mode: TODO: baseline-vs-harnessed, harnessed-only-initial-benchmark, or other named mode
 
 ## Task Set
 
@@ -25,12 +25,28 @@
 | Human rework minutes | TODO | TODO | TODO |
 | Reverted files | TODO | TODO | TODO |
 
+## Non-Comparable Setup Runs
+
+Use this section for adoption, template setup, placeholder-prompt, or other
+non-product observations that should be preserved but excluded from comparable
+product-task counts.
+
+| Run | Reason excluded | Use in metrics |
+| --- | --- | --- |
+| TODO | TODO | Excluded from comparable product-task count |
+
 ## Run Log
 
 | Condition | Task ID | Run | Verification result | Notes |
 | --- | --- | ---: | --- | --- |
 | TODO | TODO | TODO | TODO | TODO |
 
+## Changed-Files Consistency
+
+| Task ID | Expected boundary | Actual changed files | Wrong-file edit result |
+| --- | --- | --- | --- |
+| TODO | TODO | TODO | TODO |
+
 ## Source Records
 
 - Task outcome records reviewed: TODO: list `docs/effectiveness/task-outcomes/...`
@@ -40,10 +56,12 @@
 
 ## Interpretation
 
-- What improved: TODO
+- Observed benchmark: TODO: for harnessed-only reports, summarize observations without claiming improvement
+- What improved: TODO: use only when a comparable baseline or later comparison supports this
 - What did not improve: TODO
 - Confounders or limitations: TODO
 - Harness changes to make next: TODO
+- Human rework interpretation: TODO: distinguish unknown from 0 minutes
 
 ## Follow-Up
 
diff --git a/docs/templates/task-outcome.yaml b/docs/templates/task-outcome.yaml
@@ -56,3 +56,4 @@ follow_up:
   harness_change_needed: TODO
   decision_or_failure_record: TODO
   include_in_effectiveness_report: TODO
+  include_in_comparable_product_task_count: TODO