You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Refactor to put tasks in guide folders, and removes prompts. also add… (#459)
* refactor to put tasks in guide folders, and removes prompts. also adds base app, base app name, and prompt to eval result, to use in dashboard instead of pulling from deployed task and base_app dirs
* run lint, typecheck, update tests
* clean up code. add further optimizations to collection.ts which uses guideDir. set tasks' first prompts to use the most recent content (not from prompts.md files)
* sync the tasks (first prompt) with exactly what was in their corresponding -task.md files
* make dashboard test and mock results more realistic
Copy file name to clipboardExpand all lines: .agents/skills/project-evals/SKILL.md
+11-14Lines changed: 11 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,23 +15,22 @@ This is the third of three stages in creating guidance:
15
15
16
16
**Real-world coding agents see only `guide.md`** — retrieved automatically via the RAG skills system when a developer asks for help. Every other file in a use case directory is eval infrastructure.
17
17
18
-
**The eval harness** runs a separate coding agent in a controlled environment to test whether the guidance works. This eval agent receives a prompt from `prompts.md` and has access to `guide.md` via the same RAG system. The harness then runs `grader.ts` against the eval agent's output.
18
+
**The eval harness** runs a separate coding agent in a controlled environment to test whether the guidance works. This eval agent receives the first prompt from `tasks/task.md` and has access to `guide.md` via the same RAG system. The harness then runs `grader.ts` against the eval agent's output.
19
19
20
20
None of the following are ever seen by real-world coding agents:
21
21
22
22
| File | Role in eval pipeline |
23
23
|---|---|
24
-
|`prompts.md`| Simulated developer prompts fed to the eval agent by the harness |
24
+
|`tasks/task.md`| Simulated developer prompts and base application name fed to the eval agent by the harness |
25
25
|`demo.html`| Reference implementation — grader runs against it to confirm tests pass on correct code |
26
26
|`negative-demo.html`| Anti-example — grader runs against it to confirm tests fail on incorrect code |
27
27
|`expectations.md`| Spec used to generate `grader.ts`|
28
28
|`grader.ts`| Playwright tests run against the eval agent's output |
29
29
30
30
## How the eval files work together
31
31
32
-
`prompts.md`, `expectations.md`, and `grader.ts` form a tightly coupled pipeline:
33
-
34
-
1.**`prompts.md`** — Simulated developer prompts used only by the eval harness. Each prompt should sound like a real developer request, without naming specific APIs or best practices — the eval agent is expected to discover those by reading `guide.md` via RAG. The first prompt is the most important: it drives negative demo generation and is used as the default task.
32
+
`tasks/task.md`, `expectations.md`, and `grader.ts` form a tightly coupled pipeline:
33
+
1.**`tasks/task.md`** — Simulated developer prompts used only by the eval harness. It must start with a YAML frontmatter specifying the `base_app`, followed by a list of prompts. Each prompt should sound like a real developer request, without naming specific APIs or best practices — the eval agent is expected to discover those by reading `guide.md` via RAG. The first prompt is the most important: it is used as the default task.
35
34
36
35
2.**`expectations.md`** — The ground truth for what a correct implementation looks like. Each bullet becomes exactly one test in `grader.ts`. Write expectations assuming the eval agent read `guide.md` and implemented it faithfully; they describe the observable output, not the implementation approach.
37
36
@@ -67,22 +66,21 @@ This command will automatically:
67
66
1. Generate a `negative-demo.html` based on the guidance.
68
67
2. Generate a `grader.ts` Playwright test that asserts your `expectations.md` against both `demo.html` (should pass) and `negative-demo.html` (should fail).
69
68
3. Test and calibrate the grader by running the test suite.
70
-
4. Generate realistic developer `prompts.md` and a task file to run an agent end-to-end.
71
-
72
-
The human can manually write or revise the grader or prompts if they wish.
73
-
74
-
## Writing `prompts.md`
75
69
76
-
`prompts.md` contains realistic developer prompts used to run AI agents end-to-end against the guide's grader. Each line is a separate prompt the harness can send to an agent.
70
+
## Writing `tasks/task.md`
77
71
78
-
**Format:** Each prompt must be on its own line, prefixed with `- `. No other structure is required.
72
+
`tasks/task.md` contains realistic developer prompts used to run AI agents end-to-end against the guide's grader, prefixed by a YAML frontmatter specifying the base application.
79
73
74
+
**Format:**
80
75
```md
76
+
---
77
+
base_app: daily-grind
78
+
---
81
79
- make my images load faster on the page
82
80
- Optimize the priority of my LCP image 'hero.jpg' and deprioritize the gallery images below the fold.
83
81
```
84
82
85
-
**Critical:** The **first prompt** is the most important — it is what the harness uses for negative demo generation and as the task body in `harness/tasks/<guide-name>-task.md`. The task file body **must always match the first prompt exactly**. If you update the first prompt, update the task file body to match. It must be specific enough to produce a grader-testable result.
83
+
**Critical:** The **first prompt** is the most important. It is used as the default task for the harness, and it must be specific enough to produce a grader-testable result.
86
84
87
85
**Rules:**
88
86
- DO write prompts as a developer talking to an AI coding assistant — casual, lowercase, sometimes vague.
@@ -105,4 +103,3 @@ The human can manually write or revise the grader or prompts if they wish.
105
103
If `gd dev` fails to calibrate the grader:
106
104
* Read the command output to see which assertions failed.
107
105
* If the grader logic generated by the pipeline is wrong, you may need to tweak the language in `expectations.md` so the generated grader is more accurate, or simply run `gd dev` again (it attempts to fix itself using failure context).
Copy file name to clipboardExpand all lines: .agents/skills/project-guides/SKILL.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,7 +22,7 @@ When a developer asks an AI coding assistant to implement something, the assista
22
22
|`negative-demo.html`| Incorrect implementation used to verify the grader catches failures | ❌ No |
23
23
|`expectations.md`| Source used to generate `grader.ts`| ❌ No |
24
24
|`grader.ts`| Playwright tests run against the eval agent's output | ❌ No |
25
-
|`prompts.md`| Simulated developer prompts used only by the eval harness | ❌ No |
25
+
|`tasks/task.md`| Simulated developer prompts and base application name fed to the eval agent by the harness | ❌ No |
26
26
27
27
**Implication for `demo.html`:** Because real agents never see `demo.html`, it does not need to be a polished, production-ready example. It just needs to be a correct, minimal implementation that the grader can pass against. Do not over-engineer it.
@@ -65,19 +64,19 @@ Each guide lives in its own directory (e.g. `guides/performance/batch-analytics-
65
64
|`expectations.md`| SME (human) | Natural-language bulleted list of assertions that must be true if the guidance is followed correctly. Used as input for grader generation. |
66
65
|`negative-demo.html`| Generated (Gemini CLI) | A deliberately incorrect implementation. Must score 0% against the grader. Used for grader calibration. |
67
66
|`grader.ts`| Generated (Gemini CLI) | A Playwright test file that grades any HTML file against the expectations. May include both browser automation checks and static content checks. |
68
-
|`prompts.md`| Generated (Gemini CLI) |Realistic developer prompts (1-2) that an AI coding assistant might receive. Used for agent testing.|
67
+
|`task.md`| Generated (Gemini CLI) |Simulated developer prompts and base_app fed to the eval agent by the harness|
69
68
70
-
Additionally, each guide that is ready for evaluation has a **task file**in `harness/tasks/`:
69
+
The **task file**looks like:
71
70
72
71
```yaml
73
72
---
74
73
base_app: daily-grind
75
-
grader: batch-analytics-events
76
74
---
77
-
Implement Core Web Vitals monitoring on a web page...
75
+
- Implement Core Web Vitals monitoring on a web page...
76
+
- Alternative prompt...
78
77
```
79
78
80
-
The task file connects a grader (by guide directory name), a base application the agent will modify, and the prompt the agent receives.
79
+
The task file connects a base application the agent will modify, and the prompt the agent receives (first prompt in the list). The grader is implicit (the same directory).
81
80
82
81
### Guide maturity stages
83
82
@@ -87,8 +86,8 @@ A guide progresses through these stages:
87
86
2. **Incomplete**: Has `guide.md` content but is missing `demo.html` and/or `expectations.md`.
88
87
3. **Needs expectations**: Has guide + demo but no `expectations.md` (or it's empty). Cannot proceed to automated generation without this.
89
88
4. **Needs calibration**: Has all three human-authored files. Ready for `gd dev` to generate `negative-demo.html`, `grader.ts`, and calibrate.
90
-
5.**Needs test**: Grader is calibrated but missing `prompts.md` or a task file. Agent tests haven't been run.
91
-
6.**Eval-ready**: All artifacts exist. The guide is included in `gd eval suite` runs.
89
+
5. **Needs test**: Grader is calibrated but missing `task.md`. Agent tests haven't been run.
90
+
6. **Eval-ready**: All artifacts exist. The guide is included in `gd eval` runs.
92
91
93
92
---
94
93
@@ -122,7 +121,7 @@ pnpm link --global && gd setup-completion
122
121
123
122
| Command | What it does |
124
123
|---|---|
125
-
|`gd eval`| Run the full evaluation suite (discovers all tasks in `harness/tasks/`) |
124
+
| `gd eval` | Run the full evaluation suite (discovers all tasks in guide folders) |
126
125
| `gd eval [task1] [task2]` | Run specific tasks only |
127
126
| `gd eval --config <custom_config>` | Run with config overrides (`--config my_custom_config.ts`, defaults to `config.ts`, or falls back to defaults in `harness/config.ts`) |
0 commit comments