Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions docs/checklists/dogfood-evidence-adoption.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# Dogfood Evidence Adoption Checklist

Use this checklist before adding a target repository as a dogfood report,
README badge, lifecycle result, validation note, or effectiveness example in
this kit.

Dogfood evidence should make the kit easier to evaluate. It should not turn a
single target run into an unsupported effectiveness claim.

## Required Before Adoption

- Source tracking exists and names the kit source, commit, applied profile, and
adoption or setup context, usually in `.harness/source.json`.
- The target repository commit or PR being cited is stable and linkable.
- The report separates non-comparable setup work from comparable product-task
outcomes.
- Each counted product task has a task outcome record with repository ref,
prompt ref or prompt hash, expected boundary, known failure mode, files
changed, first-pass verification, final verification, and inclusion flags.
- The target normal completion gate is named from the target's real workflow.
- Deterministic, local, non-network, reasonably fast behavior checks are either
included in that normal gate or have a recorded reason for focused/manual
placement.
- Live API, credential, provider-uptime, visual, device, slow, watcher, or
otherwise fragile checks are kept outside the normal gate unless the target
intentionally expects them in normal verification.
- Failure records exist for non-transient failed setup checks, failed harness
checks, recurring agent mistakes, cross-environment mismatches, or high-risk
bug paths that should not recur.
- Each failure record names a regression test, fixture, smoke check, lint rule,
drift check, CI gate, or manual review point that detects or prevents
recurrence, or explains why no check is practical.
- Aggregate reports state clearly whether the evidence is baseline-vs-harnessed
or harnessed-only tracking.
- Harnessed-only reports explicitly say they do not prove effectiveness
improvement without a later comparison point.

## Required Checks

Run the target's normal gate and this kit's report validators before adopting
the evidence:

```bash
python scripts/check_harness.py
python /path/to/harness-starter-kit/scripts/check_effectiveness_plan.py
python /path/to/harness-starter-kit/scripts/check_failure_memory.py
```

Use the target's real normal gate if it is not `python scripts/check_harness.py`.
For JavaScript targets, this might be `npm run check:harness`; for framework
targets, it might be `make test`, `just check`, Maven, Gradle, Django, or
another local command.

## Reject Or Defer Adoption When

- The evidence relies on local-only paths without stable repository refs or
prompt hashes.
- Setup failures are excluded from metrics but not evaluated for failure
memory.
- A template or placeholder task outcome is included in the effectiveness report
or comparable product-task count.
- The aggregate report says product tasks are complete while also saying no
product-task records are complete.
- The report uses Harness Doctor, passing checks, or fixture tests as proof of
agent effectiveness.
- The target adopted starter-kit defaults blindly instead of preserving its own
architecture, package manager, docs, commands, and conventions.
- The example would require copying target-specific architecture into generic
templates.

## Report Placement

Use the smallest durable placement that fits the evidence:

- `docs/examples/effectiveness-report-<target>-dogfood.md` for an aggregate
dogfood report.
- `docs/examples/lifecycle-pilot-results.md` for a short lifecycle or dogfood
summary.
- `docs/evaluation.md` for the example index.
- `docs/validation.md` when the target is used as validation or dogfood
evidence.
- README badges only when the target repository is public and intentionally
maintained as dogfood evidence.

## Review Questions

- Does the report preserve the target repository as the source of truth?
- Does it count only comparable product-task outcomes?
- Does it name the target's real normal gate and gate-placement decisions?
- Does it record misses honestly, including wrong-file edits and failed first
verification?
- Does it link failure memory to detection or prevention?
- Does it avoid claiming improvement unless there is a comparable baseline or
later comparison window?
1 change: 1 addition & 0 deletions docs/component-map.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ This map connects harness engineering concepts to files in a target repository.
| External API work recipe | server-only API boundary, redaction, live/mock fallback, and smoke checks | `docs/checklists/external-api-work.md` |
| Decision and failure memory guidance | examples for when to record ADRs, failure notes, domain docs, or final-report notes | `docs/checklists/decision-failure-memory.md` |
| Verification script patterns | custom smoke checks and transparent `check:harness` composition | `docs/checklists/verification-scripts.md` |
| Dogfood evidence adoption | source tracking, task outcome, failure memory, gate placement, and claim-boundary review | `docs/checklists/dogfood-evidence-adoption.md` |
| Stack-specific rules | lint/type/pre-commit/framework snippets | `templates/profiles/*` |
| Stack profile guide | available profiles and how to treat snippets as reference material | `docs/profiles.md` |
| Profile absorption | checklist for turning profile snippets into project rules | `docs/checklists/profile-absorption.md` |
Expand Down
91 changes: 91 additions & 0 deletions docs/decisions/0005-validate-dogfood-evidence-consistency.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# 0005. Validate Dogfood Evidence Consistency Before Adoption

## Status

Accepted

## Date

2026-06-06

## Context

Dogfood evidence is useful only when it preserves the difference between
harness health, setup evidence, comparable product-task outcomes, and actual
agent effectiveness.

During Harness ERP dogfood review, the evidence was directionally strong but
initially exposed two adoption-quality gaps:

- an aggregate effectiveness report could say product-task runs were complete
while later text still said no product-task records were complete
- a task outcome template could accidentally keep inclusion flags enabled and
contaminate future mechanical counts

The target repository remained the source of truth, and the right response was
not to make adoption automatic. The kit needed a small validation and checklist
layer so future dogfood evidence can be accepted or deferred using repeatable
criteria.

## Decision

Extend `scripts/check_effectiveness_plan.py` to validate dogfood evidence
consistency:

- effectiveness reports that claim completed product-task outcomes must not
also contain stale "no completed records yet" language or "record outcomes as
they run" follow-up language
- task outcome templates or placeholder task outcomes must not be included in
effectiveness reports or comparable product-task counts

Ship the same checker behavior in
`templates/generic/scripts/check_effectiveness_plan.py` so target repositories
receive the guard during adoption.

Add `docs/checklists/dogfood-evidence-adoption.md` as a prompt-first review
checklist for deciding whether a dogfood target should become a report,
lifecycle note, validation note, or README badge in this kit.

## Rationale

- These checks catch concrete evidence-quality gaps without inferring
effectiveness improvement from passing tests or Harness Doctor scores.
- The validation remains lightweight and local, using the same standard-library
checker style as the rest of the kit.
- The checklist keeps dogfood adoption prompt-first and reviewable instead of
making the installer or checker copy target-specific architecture into
generic templates.
- Template inclusion flags are high-risk for future aggregation because a
parser can count them even when a human reader understands they are
placeholders.

## Alternatives Considered

- Manual review only: rejected because the Harness ERP review showed the same
stale aggregate text and template inclusion risk can survive until a later
reviewer notices it.
- Parse every task outcome as full YAML: rejected for now because the kit avoids
external dependencies and only needs a few scalar fields for this guard.
- Require baseline-vs-harnessed evidence before dogfood adoption: rejected
because harnessed-only dogfood is still useful operational evidence when it
is labeled correctly and does not claim improvement.

## Consequences

- `scripts/check_effectiveness_plan.py` now checks selected task outcome YAML
records in addition to adoption and effectiveness Markdown reports.
- Dogfood reports that contain stale aggregate completion language fail local
validation.
- Target-local template task outcome files must set inclusion flags to false,
unknown, TODO, or another non-truthy value.
- Future dogfood adoption should cite or run the dogfood evidence checklist
before adding README badges or validation examples.

## Agent Guidance

When adding dogfood evidence to this kit, run the target's normal gate and this
kit's effectiveness and failure-memory validators. Do not adopt the evidence
when template task outcomes are countable, stale aggregate language contradicts
the completed records, setup failures have not been evaluated for failure
memory, or the report implies effectiveness improvement without a comparison
point.
5 changes: 5 additions & 0 deletions docs/evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,3 +111,8 @@ cheaper to correct after the harness becomes part of the repository.
- [Small harness outcome evidence report](examples/effectiveness-report-small-evidence.md) records three harnessed task outcomes and summarizes a narrow operational evidence pass without treating Harness Doctor scores or passing checks as proof of agent effectiveness.
- [TodayBus harnessed-only dogfood benchmark](examples/effectiveness-report-todaybus-dogfood.md) records three product-task outcomes, excludes a non-comparable setup run, and treats the result as an initial benchmark rather than proof of effectiveness improvement.
- [Harness ERP Spring/Maven dogfood benchmark](examples/effectiveness-report-harness-erp-dogfood.md) records five backend product-task outcomes, one honest boundary miss, prompt hashes, failure-memory linkage, and source tracking as initial benchmark evidence rather than proof of effectiveness improvement.

Before adding a new dogfood report to this kit, use
[`docs/checklists/dogfood-evidence-adoption.md`](checklists/dogfood-evidence-adoption.md)
to verify source tracking, task outcomes, failure memory, gate placement, and
claim boundaries.
Original file line number Diff line number Diff line change
Expand Up @@ -69,4 +69,5 @@ outcome:
follow_up:
harness_change_needed: false
decision_or_failure_record: No failure record added; review findings were one-time adoption cleanup.
include_in_effectiveness_report: true
include_in_effectiveness_report: true
include_in_comparable_product_task_count: false
Original file line number Diff line number Diff line change
Expand Up @@ -69,4 +69,5 @@ outcome:
follow_up:
harness_change_needed: false
decision_or_failure_record: docs/decisions/003-add-recipe-categories.md
include_in_effectiveness_report: true
include_in_effectiveness_report: true
include_in_comparable_product_task_count: true
Original file line number Diff line number Diff line change
Expand Up @@ -55,4 +55,5 @@ outcome:
follow_up:
harness_change_needed: false
decision_or_failure_record: Existing ADR updated.
include_in_effectiveness_report: true
include_in_effectiveness_report: true
include_in_comparable_product_task_count: false
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# 0006. Dogfood Evidence Consistency Gaps Were Not Checked

## Date Observed

2026-06-06

## Failure Type

Harness maintenance gap and repeated agent mistake risk.

## Goal

Dogfood evidence adopted into this kit should not contain stale aggregate
effectiveness language or count placeholder task outcome templates as real
evidence.

## What Happened Or Was Tried

Harness ERP was used as Spring/Maven dogfood evidence. A review found the
evidence was useful but initially not adoptable as-is:

- the aggregate effectiveness report said five comparable product-task runs
were complete while the interpretation still said no completed product-task
records existed yet
- the target-local task outcome template had inclusion flags set to true, which
could contaminate future mechanical aggregation

The evidence was corrected in the target repository before adoption, but the
starter kit did not yet have a local check that would catch those two gaps for
future dogfood targets.

## Why It Failed

- `scripts/check_effectiveness_plan.py` validated required report sections and
TODO markers, but did not inspect consistency between completed-outcome claims
and stale no-records language.
- The checker did not inspect task outcome YAML records, so a placeholder or
template record could keep inclusion flags enabled without failing
validation.
- Dogfood adoption criteria were implicit in review judgment instead of written
as a reusable checklist.

## Current Replacement

`scripts/check_effectiveness_plan.py` now validates:

- aggregate effectiveness reports that claim completed product-task outcomes do
not also use stale no-completed-records language
- task outcome templates and placeholder task outcomes are not included in
effectiveness reports or comparable product-task counts

`templates/generic/scripts/check_effectiveness_plan.py` carries the same guard
for target repositories. `docs/checklists/dogfood-evidence-adoption.md`
documents the source tracking, task outcome, failure memory, gate placement,
and claim-boundary criteria for adding dogfood evidence to this kit.

## Detection Or Prevention Check

`tests/test_check_effectiveness_plan.py` covers aggregate completion-language
contradictions, task outcome templates with truthy inclusion flags, and
placeholder task outcomes with truthy inclusion flags. `scripts/check_effectiveness_plan.py`
is the local checker that prevents those evidence-quality gaps from passing.

## Agent Guidance

Before adopting dogfood evidence, run `scripts/check_effectiveness_plan.py` and
review `docs/checklists/dogfood-evidence-adoption.md`. Do not count setup-only
runs as comparable product tasks, do not leave template task outcomes countable,
and do not claim effectiveness improvement from harnessed-only evidence.
4 changes: 4 additions & 0 deletions docs/validation.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,10 @@ claiming effectiveness improvement:
- [Harness ERP Spring/Maven dogfood benchmark](examples/effectiveness-report-harness-erp-dogfood.md)
for a Spring Boot backend target

Use the
[dogfood evidence adoption checklist](checklists/dogfood-evidence-adoption.md)
before adding another target as validation or effectiveness evidence.

## Example Reports

Use these examples when checking whether a target adoption report is complete:
Expand Down
Loading
Loading