Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 13 additions & 1 deletion bin/gd.ts
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ function listGuideDirs(): string[] {
const completion = omelette('gd <command> <arg1> <arg2>');

completion.on('command', ({ reply }) => {
reply(['dev', 'dev-all', 'grade', 'test', 'gen', 'audit', 'eval', 'run', 'dashboard', 'deploy', 'upload', 'baselinestatus', 'setup-completion', 'gen-negative-suite']);
reply(['dev', 'dev-all', 'grade', 'test', 'gen', 'audit', 'eval', 'run', 'dashboard', 'deploy', 'upload', 'baselinestatus', 'setup-completion', 'gen-negative-suite', 'gen-task-suite']);
});

completion.on('arg1', ({ before, reply }) => {
Expand Down Expand Up @@ -81,6 +81,8 @@ const { positionals, values } = parseArgs({
'gen-grader': { type: 'boolean' },
'gen-negative': { type: 'boolean' },
guided: { type: 'boolean' },
'sync-task': { type: 'boolean' },
'no-test': { type: 'boolean' },
verbose: { type: 'boolean' },
usecases: { type: 'boolean' },
},
Expand Down Expand Up @@ -128,6 +130,7 @@ ${cBold('Guide Development:')}
${"Piece-wise options for `dev`:"}
${cDim('--grade')} Run/calibrate grader
${cDim('--test-grader')} Check grader calibration (demo + negative-demo)
${cDim('--sync-task')} Force update task prompt from prompts.md
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can just make the default for gd dev to update existing tasks (just the prompt)? and remove this sync-task option?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make it the default without a flag, would it be a concern that task descriptions got unintentionally overwritten? or maybe gd dev itself (without any flag) is already intentional enough..

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, if a dev runs gd dev without the sync-task option, it would still check if there's any outdated tasks. And instead of sync the prompts, it would warn the dev for any outdated tasks

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine to have it overwrite. for now we can keep the prompts consistent between the two locations.

${cDim('--gen-grader')} Generate a new grader script
${cDim('--gen-negative')} Generate negative examples
${cDim('--guided')} Skip calibration, run guided agent test only
Expand All @@ -141,6 +144,7 @@ ${cBold('Evaluation:')}
${cCyan('deploy')} Deploy the dashboard to GitHub Pages
${cCyan('upload')} <suite> Upload generated evaluation suite to GCS
${cCyan('gen-negative-suite')} Generate resources for negative suite
${cCyan('gen-task-suite')} Update regular tasks with latest prompts

${cBold('Other:')}
${cCyan('baselinestatus')} <query> Check browser support and Baseline status
Expand Down Expand Up @@ -188,6 +192,8 @@ ${cBold('Options:')}
const success = await devGuide(dir, {
guidedOnly: !!values.guided,
verbose: !!values.verbose,
syncTask: !!values['sync-task'],
test: !values['no-test'],
});
process.exit(success ? 0 : 1);
}
Expand Down Expand Up @@ -263,6 +269,12 @@ ${cBold('Options:')}
break;
}

case 'gen-task-suite': {
const { generateTaskSuite } = await import('../guides/task-suite-gen.ts');
await generateTaskSuite();
break;
}

default: {
// Legacy fallbacks — guide namespace was flattened
if (command === 'guide') {
Expand Down
12 changes: 11 additions & 1 deletion guides/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ gd dev <path/to/guide_dir>
This runs the following pipeline after the grader calibrates successfully:

1. **Generate `prompts.md`** if missing — uses Gemini CLI to create a set of developer-facing prompts derived from the guide
2. **Find or create a task file** in `harness/tasks/` — scans existing tasks for a matching `grader:` field, or creates `<guideName>-task.md` using the first prompt from `prompts.md` (defaults to `daily-grind` base app)
2. **Find or create a task file** in `harness/tasks/` — scans existing tasks for a matching `grader:` field, or creates `<guideName>-task.md` using the first prompt from `prompts.md` (defaults to `daily-grind` base app). If the task file already exists but its prompt has drifted from `prompts.md`, `gd dev` will warn you. Run with `--sync-task` to force it to synchronize.
3. **Grade the base app as-is** (pre-score) — establishes a baseline before any agent runs
4. **Run the agent** in both `unguided` (no guide access) and `guided` (with MCP guide access) modes against the base app
5. **Grade both outputs** and print a comparison:
Expand All @@ -125,6 +125,16 @@ The agent and base app are selected from the [harness config](../harness/config.

The generated task file is automatically included in future `gd eval suite` runs — the suite discovers all task files in `harness/tasks/` by default.

### Synchronizing All Regular Tasks

If you update multiple `prompts.md` files or want to ensure all regular tasks are in sync with their respective guides, you can run:

```bash
gd gen-task-suite
```

This script scans for "eval-ready" guides, reads their `prompts.md`, and updates the corresponding `<guideName>-task.md` in `harness/tasks/` while preserving any custom `base_app` configuration in the task file.

### Negative Suite

To verify that guides improve agent performance starting from a "bad" implementation, you can run a **Negative Suite**.
Expand Down
74 changes: 61 additions & 13 deletions guides/dev-guide.ts
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ export interface DevGuideOptions {
maxRetries?: number; // default: 2
test?: boolean; // default: true — run agent test after calibration
guidedOnly?: boolean; // skip calibration and only run the guided agent test
syncTask?: boolean; // update task with latest prompt from prompts.md
verbose?: boolean;
}

Expand Down Expand Up @@ -298,9 +299,28 @@ export async function devGuide(targetDirRaw: string, options: DevGuideOptions =
}
}

if (!existingTask && fs.existsSync(promptsPath)) {
const taskInfo = createTask(targetDir, currentInv.name);
taskMap.set(currentInv.name, taskInfo);
if (fs.existsSync(promptsPath)) {
const latestPrompt = getLatestPrompt(targetDir, currentInv.name);

if (existingTask) {
if (existingTask.prompt.trim() !== latestPrompt.trim()) {
if (options.syncTask) {
console.log(cYellow(`\nSyncing prompt for ${currentInv.name}-task.md...`));
const taskInfo = createTask(targetDir, currentInv.name);
taskMap.set(currentInv.name, taskInfo);
} else {
console.log(cYellow(`\n\u26a0\ufe0f Task prompt is outdated!`));
console.log(` Current: "${cDim(existingTask.prompt.substring(0, 100))}${existingTask.prompt.length > 100 ? '...' : ''}"`);
console.log(` Latest: "${cDim(latestPrompt.substring(0, 100))}${latestPrompt.length > 100 ? '...' : ''}"`);
console.log(` Run with ${cBold('--sync-task')} to update.`);
}
} else {
console.log(cGreen(`\n\u2705 Task prompt is up-to-date`));
}
} else {
const taskInfo = createTask(targetDir, currentInv.name);
taskMap.set(currentInv.name, taskInfo);
}
}
}

Expand Down Expand Up @@ -399,24 +419,50 @@ Only create the ${PROMPTS_FILE} file. Do not modify any other files.`;
}
}

function createTask(targetDir: string, guideName: string): TaskInfo {
const promptsContent = readFileSafe(path.join(targetDir, PROMPTS_FILE));
const firstLine = promptsContent.split('\n').find(l => l.trim().startsWith('- '));
const prompt = firstLine ? firstLine.replace(/^-\s*/, '').trim() : `Implement the guidance from ${guideName}`;
export function getLatestPrompt(targetDir: string, guideName: string): string {
const promptsPath = path.join(targetDir, PROMPTS_FILE);
if (fs.existsSync(promptsPath)) {
const promptsContent = readFileSafe(promptsPath);
const firstLine = promptsContent.split('\n').find(l => l.trim().startsWith('- '));
if (firstLine) {
return firstLine.replace(/^-\s*/, '').trim();
}
} else {
console.warn(cYellow(` ⚠️ Missing ${PROMPTS_FILE} for ${guideName}, using default prompt.`));
}
return `Implement the guidance from ${guideName}`;
}

export function createTask(targetDir: string, guideName: string): TaskInfo {
const prompt = getLatestPrompt(targetDir, guideName);

const taskName = `${guideName}-task`;
const taskFilePath = path.join(TASKS_DIR, `${taskName}.md`);

// Preserve existing base_app if task file already exists
let baseApp = 'daily-grind';
if (fs.existsSync(taskFilePath)) {
const rawContent = readFileSafe(taskFilePath);
if (rawContent) {
const { data } = matter(rawContent);
if (data?.base_app) {
baseApp = data.base_app;
}
}
}

const taskContent = `---
base_app: daily-grind
base_app: ${baseApp}
grader: ${guideName}
---
${prompt}
`;

fs.mkdirSync(TASKS_DIR, { recursive: true });
fs.writeFileSync(path.join(TASKS_DIR, `${taskName}.md`), taskContent);
console.log(cGreen(`✅ Created task: harness/tasks/${taskName}.md`));
fs.writeFileSync(taskFilePath, taskContent);
console.log(cGreen(`✅ Created/Updated task: harness/tasks/${taskName}.md (base_app: ${baseApp})`));

return { taskName, baseApp: 'daily-grind', prompt };
return { taskName, baseApp, prompt };
}

async function runAgentTest(targetDir: string, guideName: string, taskMap: Map<string, TaskInfo>, guidedOnly = false): Promise<void> {
Expand Down Expand Up @@ -828,11 +874,13 @@ if (import.meta.url.startsWith('file:') && process.argv[1] === fileURLToPath(imp
const isTest = !args.includes('--no-test');

if (!dir) {
console.error('Usage: node --experimental-strip-types guides/dev-guide.ts <path/to/guide> [--no-test]');
console.error('Usage: node --experimental-strip-types guides/dev-guide.ts <path/to/guide> [--no-test] [--sync-task]');
process.exit(1);
}

devGuide(dir, { test: isTest }).then(success => {
const syncTask = args.includes('--sync-task');

devGuide(dir, { test: isTest, syncTask }).then(success => {
process.exit(success ? 0 : 1);
}).catch(err => {
console.error(err);
Expand Down
33 changes: 33 additions & 0 deletions guides/task-suite-gen.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
import path from 'path';
import { fileURLToPath } from 'url';

const __filename = fileURLToPath(import.meta.url);
const __dirname = path.dirname(__filename);

// @ts-ignore - dev-guide.ts might not have types in this setup but node handles it
import { scanAllGuides, classifyGuide, getTaskMap, createTask } from './dev-guide.ts';


export async function generateTaskSuite() {
console.log('Scanning guides...');
const taskMap = getTaskMap();
const allGuides = scanAllGuides(taskMap);

const evalReadyGuides = allGuides.filter(inv => classifyGuide(inv) === 'eval-ready');

if (evalReadyGuides.length === 0) {
console.log('No eval-ready guides found.');
return;
}

console.log(`Found ${evalReadyGuides.length} eval-ready guides.`);

let updatedCount = 0;

for (const inv of evalReadyGuides) {
createTask(inv.dir, inv.name);
updatedCount++;
}

console.log(`\nTask suite generation complete! Updated/Synced ${updatedCount} tasks.`);
}
2 changes: 1 addition & 1 deletion harness/tasks/animate-scrollbar-color-on-scroll-task.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@
base_app: daily-grind
grader: animate-scrollbar-color-on-scroll
---
hey can u make the scrollbar color change as i scroll down the page? like it should start as one color and shift to another as you get to the bottom of the coffee site. make it look smooth.
hey can u make the scrollbar color change as i scroll down the page? like it should start as one color and shift to another as you get to the bottom. make it look smooth.
2 changes: 1 addition & 1 deletion harness/tasks/deprioritize-background-fetches-task.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@
base_app: empty-app
grader: deprioritize-background-fetches
---
Create an extremely minimal web page with a single button that triggers two concurrent fetch requests: one request to '/api/data' for mission-critical data that must be loaded as quickly as possible, and another to '/api/analytics' that POSTs a `{click: 1}` payload. Write the page to index.html.
Create an extremely minimal web page with a single button that triggers two concurrent fetch requests: one request to '/api/data' for mission-critical data that must be loaded as quickly as possible, and another to '/api/analytics' that POSTs a `{click: 1}` payload.
2 changes: 1 addition & 1 deletion harness/tasks/improve-next-page-load-performance-task.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@
base_app: daily-grind
grader: improve-next-page-load-performance
---
Improve the speed of this website
Improve the speed of my website
4 changes: 1 addition & 3 deletions harness/tasks/optimize-image-priority-task.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,4 @@
base_app: empty-app
grader: optimize-image-priority
---
Create an extremely minimal product landing page optimized for a main hero image 'hero-lcp.jpg' which is the largest contentful element. The page also contains a product image gallery where the first image is visible but the second image 'gallery-alt.jpg' is currently hidden behind a toggle. There is also a secondary 'mega-menu-promo.jpg' image that is part of a navigation menu and initially hidden. Finally, include a 'footer-logo.png' much further down the page below the fold.

MANDATORY: Write the page to index.html and ensure that all image sources exactly match the filenames provided. Do not bother downloading stock images, just use the filenames as the src attributes, it's ok if they don't exist.
Create an extremely minimal product landing page that features a main hero image 'hero-lcp.jpg' which is the largest contentful element. The page also contains a product image gallery where the first image is visible but the second image 'gallery-alt.jpg' is currently hidden behind a toggle. There is also a secondary 'mega-menu-promo.jpg' image that is part of a navigation menu and initially hidden. Finally, include a 'footer-logo.png' much further down the page below the fold.
2 changes: 1 addition & 1 deletion harness/tasks/optimize-preload-priority-task.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@
base_app: empty-app
grader: optimize-preload-priority
---
Create an extremely minimal video landing page that optimizes for LCP. This includes a video poster image 'poster.jpg' (the LCP element), a custom web font 'brand-font.woff2' that is critical for the header rendering, and a secondary font 'secondary-font.woff2' for less critical UI elements. Write the page to index.html.
Create an extremely minimal video landing page that optimizes for LCP. This includes a video poster image 'poster.jpg' (the LCP element), a custom web font 'brand-font.woff2' that is critical for the header rendering, and a secondary font 'secondary-font.woff2' for less critical UI elements.
2 changes: 1 addition & 1 deletion harness/tasks/optimize-script-priority-task.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@
base_app: empty-app
grader: optimize-script-priority
---
Create an extremely minimal web page for a dashboard. It requires a critical interactivity script at '/js/app.js' that should be loaded asynchronously. It also includes an older '/js/legacy-widgets.js' script that is normally parser-blocking. Finally, include an analytics script '/js/tracker.js'. Write the page to index.html.
Create an extremely minimal web page for a dashboard. It requires a critical interactivity script at '/js/app.js' that should be loaded asynchronously. It also includes an older '/js/legacy-widgets.js' script that is normally parser-blocking. Finally, include an analytics script '/js/tracker.js'.