Skip to content

Commit eb420ed

Browse files
authored
Refactor to put tasks in guide folders, and removes prompts. also add… (#459)
* refactor to put tasks in guide folders, and removes prompts. also adds base app, base app name, and prompt to eval result, to use in dashboard instead of pulling from deployed task and base_app dirs * run lint, typecheck, update tests * clean up code. add further optimizations to collection.ts which uses guideDir. set tasks' first prompts to use the most recent content (not from prompts.md files) * sync the tasks (first prompt) with exactly what was in their corresponding -task.md files * make dashboard test and mock results more realistic
1 parent e13c43e commit eb420ed

155 files changed

Lines changed: 463 additions & 718 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.agents/skills/project-evals/SKILL.md

Lines changed: 11 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -15,23 +15,22 @@ This is the third of three stages in creating guidance:
1515

1616
**Real-world coding agents see only `guide.md`** — retrieved automatically via the RAG skills system when a developer asks for help. Every other file in a use case directory is eval infrastructure.
1717

18-
**The eval harness** runs a separate coding agent in a controlled environment to test whether the guidance works. This eval agent receives a prompt from `prompts.md` and has access to `guide.md` via the same RAG system. The harness then runs `grader.ts` against the eval agent's output.
18+
**The eval harness** runs a separate coding agent in a controlled environment to test whether the guidance works. This eval agent receives the first prompt from `tasks/task.md` and has access to `guide.md` via the same RAG system. The harness then runs `grader.ts` against the eval agent's output.
1919

2020
None of the following are ever seen by real-world coding agents:
2121

2222
| File | Role in eval pipeline |
2323
|---|---|
24-
| `prompts.md` | Simulated developer prompts fed to the eval agent by the harness |
24+
| `tasks/task.md` | Simulated developer prompts and base application name fed to the eval agent by the harness |
2525
| `demo.html` | Reference implementation — grader runs against it to confirm tests pass on correct code |
2626
| `negative-demo.html` | Anti-example — grader runs against it to confirm tests fail on incorrect code |
2727
| `expectations.md` | Spec used to generate `grader.ts` |
2828
| `grader.ts` | Playwright tests run against the eval agent's output |
2929

3030
## How the eval files work together
3131

32-
`prompts.md`, `expectations.md`, and `grader.ts` form a tightly coupled pipeline:
33-
34-
1. **`prompts.md`** — Simulated developer prompts used only by the eval harness. Each prompt should sound like a real developer request, without naming specific APIs or best practices — the eval agent is expected to discover those by reading `guide.md` via RAG. The first prompt is the most important: it drives negative demo generation and is used as the default task.
32+
`tasks/task.md`, `expectations.md`, and `grader.ts` form a tightly coupled pipeline:
33+
1. **`tasks/task.md`** — Simulated developer prompts used only by the eval harness. It must start with a YAML frontmatter specifying the `base_app`, followed by a list of prompts. Each prompt should sound like a real developer request, without naming specific APIs or best practices — the eval agent is expected to discover those by reading `guide.md` via RAG. The first prompt is the most important: it is used as the default task.
3534

3635
2. **`expectations.md`** — The ground truth for what a correct implementation looks like. Each bullet becomes exactly one test in `grader.ts`. Write expectations assuming the eval agent read `guide.md` and implemented it faithfully; they describe the observable output, not the implementation approach.
3736

@@ -67,22 +66,21 @@ This command will automatically:
6766
1. Generate a `negative-demo.html` based on the guidance.
6867
2. Generate a `grader.ts` Playwright test that asserts your `expectations.md` against both `demo.html` (should pass) and `negative-demo.html` (should fail).
6968
3. Test and calibrate the grader by running the test suite.
70-
4. Generate realistic developer `prompts.md` and a task file to run an agent end-to-end.
71-
72-
The human can manually write or revise the grader or prompts if they wish.
73-
74-
## Writing `prompts.md`
7569

76-
`prompts.md` contains realistic developer prompts used to run AI agents end-to-end against the guide's grader. Each line is a separate prompt the harness can send to an agent.
70+
## Writing `tasks/task.md`
7771

78-
**Format:** Each prompt must be on its own line, prefixed with `- `. No other structure is required.
72+
`tasks/task.md` contains realistic developer prompts used to run AI agents end-to-end against the guide's grader, prefixed by a YAML frontmatter specifying the base application.
7973

74+
**Format:**
8075
```md
76+
---
77+
base_app: daily-grind
78+
---
8179
- make my images load faster on the page
8280
- Optimize the priority of my LCP image 'hero.jpg' and deprioritize the gallery images below the fold.
8381
```
8482

85-
**Critical:** The **first prompt** is the most important — it is what the harness uses for negative demo generation and as the task body in `harness/tasks/<guide-name>-task.md`. The task file body **must always match the first prompt exactly**. If you update the first prompt, update the task file body to match. It must be specific enough to produce a grader-testable result.
83+
**Critical:** The **first prompt** is the most important. It is used as the default task for the harness, and it must be specific enough to produce a grader-testable result.
8684

8785
**Rules:**
8886
- DO write prompts as a developer talking to an AI coding assistant — casual, lowercase, sometimes vague.
@@ -105,4 +103,3 @@ The human can manually write or revise the grader or prompts if they wish.
105103
If `gd dev` fails to calibrate the grader:
106104
* Read the command output to see which assertions failed.
107105
* If the grader logic generated by the pipeline is wrong, you may need to tweak the language in `expectations.md` so the generated grader is more accurate, or simply run `gd dev` again (it attempts to fix itself using failure context).
108-

.agents/skills/project-guides/SKILL.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ When a developer asks an AI coding assistant to implement something, the assista
2222
| `negative-demo.html` | Incorrect implementation used to verify the grader catches failures | ❌ No |
2323
| `expectations.md` | Source used to generate `grader.ts` | ❌ No |
2424
| `grader.ts` | Playwright tests run against the eval agent's output | ❌ No |
25-
| `prompts.md` | Simulated developer prompts used only by the eval harness | ❌ No |
25+
| `tasks/task.md` | Simulated developer prompts and base application name fed to the eval agent by the harness | ❌ No |
2626

2727
**Implication for `demo.html`:** Because real agents never see `demo.html`, it does not need to be a polished, production-ready example. It just needs to be a correct, minimal implementation that the grader can pass against. Do not over-engineer it.
2828

CONTEXT.md

Lines changed: 14 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,6 @@ guidance/
3838
config.ts # Central configuration (agent selection, MCP servers, etc.)
3939
run_suite.ts # Suite runner (discovers tasks, runs agents, grades output)
4040
evaluate.ts # Evaluation and reporting
41-
tasks/ # Task files that define eval scenarios
4241
base_apps/ # Base applications that agents modify (e.g. daily-grind)
4342
agents/ # Agent runner scripts (gemini_cli, claude_code, jetski)
4443
lib/ # Shared utilities (isolation, file helpers)
@@ -65,19 +64,19 @@ Each guide lives in its own directory (e.g. `guides/performance/batch-analytics-
6564
| `expectations.md` | SME (human) | Natural-language bulleted list of assertions that must be true if the guidance is followed correctly. Used as input for grader generation. |
6665
| `negative-demo.html` | Generated (Gemini CLI) | A deliberately incorrect implementation. Must score 0% against the grader. Used for grader calibration. |
6766
| `grader.ts` | Generated (Gemini CLI) | A Playwright test file that grades any HTML file against the expectations. May include both browser automation checks and static content checks. |
68-
| `prompts.md` | Generated (Gemini CLI) | Realistic developer prompts (1-2) that an AI coding assistant might receive. Used for agent testing. |
67+
| `task.md` | Generated (Gemini CLI) | Simulated developer prompts and base_app fed to the eval agent by the harness |
6968

70-
Additionally, each guide that is ready for evaluation has a **task file** in `harness/tasks/`:
69+
The **task file** looks like:
7170

7271
```yaml
7372
---
7473
base_app: daily-grind
75-
grader: batch-analytics-events
7674
---
77-
Implement Core Web Vitals monitoring on a web page...
75+
- Implement Core Web Vitals monitoring on a web page...
76+
- Alternative prompt...
7877
```
7978
80-
The task file connects a grader (by guide directory name), a base application the agent will modify, and the prompt the agent receives.
79+
The task file connects a base application the agent will modify, and the prompt the agent receives (first prompt in the list). The grader is implicit (the same directory).
8180
8281
### Guide maturity stages
8382
@@ -87,8 +86,8 @@ A guide progresses through these stages:
8786
2. **Incomplete**: Has `guide.md` content but is missing `demo.html` and/or `expectations.md`.
8887
3. **Needs expectations**: Has guide + demo but no `expectations.md` (or it's empty). Cannot proceed to automated generation without this.
8988
4. **Needs calibration**: Has all three human-authored files. Ready for `gd dev` to generate `negative-demo.html`, `grader.ts`, and calibrate.
90-
5. **Needs test**: Grader is calibrated but missing `prompts.md` or a task file. Agent tests haven't been run.
91-
6. **Eval-ready**: All artifacts exist. The guide is included in `gd eval suite` runs.
89+
5. **Needs test**: Grader is calibrated but missing `task.md`. Agent tests haven't been run.
90+
6. **Eval-ready**: All artifacts exist. The guide is included in `gd eval` runs.
9291

9392
---
9493

@@ -122,7 +121,7 @@ pnpm link --global && gd setup-completion
122121

123122
| Command | What it does |
124123
|---|---|
125-
| `gd eval` | Run the full evaluation suite (discovers all tasks in `harness/tasks/`) |
124+
| `gd eval` | Run the full evaluation suite (discovers all tasks in guide folders) |
126125
| `gd eval [task1] [task2]` | Run specific tasks only |
127126
| `gd eval --config <custom_config>` | Run with config overrides (`--config my_custom_config.ts`, defaults to `config.ts`, or falls back to defaults in `harness/config.ts`) |
128127
| `gd dashboard` | Start the eval results dashboard (eval-view) |
@@ -151,8 +150,7 @@ Runs the grader against both `demo.html` (should pass 100%) and `negative-demo.h
151150

152151
### Step 5: Agent test (runs by default)
153152
After successful calibration:
154-
1. Generates `prompts.md` if missing (via Gemini CLI, using the base app as context)
155-
2. Finds or creates a task file in `harness/tasks/`
153+
1. Generates `task.md` if missing (via Gemini CLI, using the base app as context)
156154
3. Grades the base app as-is (pre-score baseline)
157155
4. Runs the configured agent in both **unguided** (no MCP guide access) and **guided** (with MCP guide access) modes
158156
5. Grades both outputs and prints a comparison showing guide impact
@@ -171,12 +169,12 @@ The eval harness measures whether guides actually improve agent output.
171169

172170
### How a suite run works (`gd eval`)
173171

174-
1. **Build MCP index**: Compiles all guides into the MCP server's searchable index.
175-
2. **Discover tasks**: Scans `harness/tasks/*.md` for task definitions (or uses explicitly configured tasks).
172+
1. **Build Guide Index**: Compiles all guides into a searchable index (RAG).
173+
2. **Discover tasks**: Scans guide directories for `task.md` definitions (or uses explicitly configured tasks).
176174
3. **For each task, for each run** (configurable `numRuns`, default 2):
177175
- Set up an isolated working directory with the base app
178-
- Run the agent in **unguided mode** (no MCP servers)
179-
- Run the agent in **guided mode** (with configured MCP servers)
176+
- Run the agent in **unguided mode** (no guidance)
177+
- Run the agent in **guided mode** (with configured guidance)
180178
- Grade both outputs using the task's grader
181179
4. **Generate reports**: JSON results + HTML report in the output directory.
182180
5. **Upload** (optional): `pnpm upload <suite-name>` pushes results to GCS for the dashboard.
@@ -365,7 +363,7 @@ MCP_API_KEY=...
365363

366364
Suite configuration in `harness/config.ts`:
367365
- `numRuns`: Number of agent runs per task (default: 2)
368-
- `tasks`: Empty array = discover all tasks in `harness/tasks/`. Set explicitly to run a subset.
366+
- `tasks`: Empty array = discover all tasks by scanning guide folders. Set explicitly to run a subset.
369367
- `mcpServersToEnable`: Which MCP servers agents can access (`['modern-web']`, `['google-developer-knowledge']`, or both)
370368
- `serving`: The approach used to serve guidance (`skills_cli`, `skills`, or `mcp`)
371369
- `agent`: Which agent to use (`Agents.GEMINI_CLI`, `Agents.CLAUDE_CODE`, `Agents.JETSKI`)

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ gd dev [dir] [options] # auto-generate/calibrate
7070

7171
# Evaluation
7272
gd eval # run the full evaluation suite
73-
gd eval [task1] [task2] # run specific tasks
73+
gd eval [task1] [task2] # run specific tasks (which are the names of guides e.g. `batch-analytics-events`)
7474
gd eval --config <custom_config> # run with config overrides (defaults to config.ts, or harness/config.ts)
7575
gd dashboard # start the evaluation dashboard
7676

bin/gd.ts

Lines changed: 4 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,8 @@ import omelette from 'omelette';
88
import { pathToFileURL } from 'url';
99
import { cRed, cCyan, cBold, cDim } from '../lib/colors.ts';
1010
import { Serving, mergeSuiteConfig, type SuiteConfig } from '../harness/config.ts';
11-
import { rootDir, guidesDir, tasksDir, baseAppsDir, evalViewDir } from '../lib/paths.ts';
11+
import { rootDir, guidesDir, baseAppsDir, evalViewDir } from '../lib/paths.ts';
12+
import { getTaskMap } from '../lib/guide-validation.ts';
1213

1314
// Load environment variables (Node 20.12+)
1415
try {
@@ -38,12 +39,12 @@ function listGuideDirs(): string[] {
3839
const completion = omelette('gd <command> <arg1> <arg2>');
3940

4041
completion.on('command', ({ reply }) => {
41-
reply(['dev', 'dev-all', 'grade', 'test', 'gen', 'audit', 'eval', 'run', 'dashboard', 'deploy', 'upload', 'baselinestatus', 'setup-completion', 'gen-negative-suite']);
42+
reply(['dev', 'dev-all', 'grade', 'test', 'gen', 'audit', 'eval', 'run', 'dashboard', 'deploy', 'upload', 'baselinestatus', 'setup-completion']);
4243
});
4344

4445
completion.on('arg1', ({ before, reply }) => {
4546
if (before === 'eval') {
46-
const tasks = fs.existsSync(tasksDir) ? fs.readdirSync(tasksDir).filter(f => f.endsWith('.md')).map(f => f.replace('.md', '')) : [];
47+
const tasks = Array.from(getTaskMap().keys());
4748
reply(['suite', ...tasks]);
4849
} else if (before === 'gen') {
4950
reply(['grader', 'negative']);
@@ -154,7 +155,6 @@ ${cBold('Evaluation:')}
154155
${cCyan('run')} <tmpl> <prompt> Run an ad-hoc agent test against a template
155156
${cCyan('deploy')} Deploy the dashboard to GitHub Pages
156157
${cCyan('upload')} <suite> Upload generated evaluation suite to GCS
157-
${cCyan('gen-negative-suite')} Generate resources for negative suite
158158
159159
${cBold('Other:')}
160160
${cCyan('baselinestatus')} <query> Check browser support and Baseline status
@@ -279,11 +279,6 @@ ${cBold('Options:')}
279279
process.exit(code);
280280
}
281281

282-
case 'gen-negative-suite': {
283-
const { generateNegativeSuite } = await import('../guides/negative-suite-gen.ts');
284-
await generateNegativeSuite();
285-
break;
286-
}
287282

288283
default: {
289284
// Legacy fallbacks — guide namespace was flattened

config.ts.example

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,7 @@
1-
import { Agents, Serving } from './harness/config.ts';
2-
import type { SuiteConfig } from './harness/config.ts';
1+
import { Agents, Serving, type SuiteConfig } from './harness/config.ts';
32

43
const customConfig: Partial<SuiteConfig> = {
5-
agent: Agents.CLAUDE_CODE,
4+
agent: Agents.GEMINI_CLI,
65
numRuns: 1,
76
negative: false,
87
serving: Serving.SKILLS_CLI

eval-view/api.js

Lines changed: 29 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -147,17 +147,40 @@ export class ApiClient {
147147

148148
/** Resolves the correct base path for specific run details, parsing legacy logic. */
149149
async getResultInfo(testId, run, testName) {
150-
const [appName, _, runType] = testName.split(' - ');
151-
const actualBaseApp = run.baseApp || appName;
150+
const [taskName, guideName, runType] = testName.split(' - ');
151+
const actualBaseApp = run.baseApp;
152+
let logicalBasePath = `${testId}/${run.runNumber}/${guideName}/${taskName}/${runType}`;
153+
let entryPointPath = await this._findBestEntryPoint(logicalBasePath);
152154

153-
const logicalBasePath = `${testId}/${run.runNumber}/${appName}/${runType}`;
154-
const entryPointPath = await this._findBestEntryPoint(logicalBasePath);
155+
// Fallback for older results stored in a depth-2 folder structure (runDir/taskName/runType)
156+
if (!entryPointPath) {
157+
const legacyPath = `${testId}/${run.runNumber}/${taskName}/${runType}`;
158+
const legacyEntryPoint = await this._findBestEntryPoint(legacyPath);
159+
if (legacyEntryPoint) {
160+
logicalBasePath = legacyPath;
161+
entryPointPath = legacyEntryPoint;
162+
} else {
163+
entryPointPath = `${logicalBasePath}/index.html`; // default fallback
164+
}
165+
}
155166

156167
// Calculate relative sub-path to build the setup apps correlation
157168
const relativePath = entryPointPath.replace(logicalBasePath + '/', '');
158169

170+
// Try run-local base_app first (at the appName level, not inside guided/unguided), fallback to centralized base_apps for older runs
171+
const localBaseAppPath = `${testId}/${run.runNumber}/${guideName}/${taskName}/base_app/${relativePath}`;
172+
let exists = false;
173+
174+
if (this.source === 'remote') {
175+
exists = await this._checkRemoteFileExists(localBaseAppPath);
176+
} else {
177+
exists = await this._checkLocalFileExists(localBaseAppPath);
178+
}
179+
180+
const setupPath = exists ? localBaseAppPath : `base_apps/${actualBaseApp}/${relativePath}`;
181+
159182
return {
160-
setupPath: `base_apps/${actualBaseApp}/${relativePath}`,
183+
setupPath,
161184
resultPath: entryPointPath,
162185
usedBasePath: logicalBasePath
163186
};
@@ -205,7 +228,7 @@ export class ApiClient {
205228
bestCandidate = results.find(result => result !== null);
206229
}
207230

208-
return bestCandidate || `${basePath}/index.html`; // strict default fallback
231+
return bestCandidate;
209232
}
210233

211234
/** Lists relevant metadata files (like raw results or trajectories) for a specific test execution dir. */

0 commit comments

Comments
 (0)