GoogleChrome
diff --git a/‎.agents/skills/project-evals/SKILL.md‎
Lines changed: 11 additions & 14 deletions b/‎.agents/skills/project-evals/SKILL.md‎
Lines changed: 11 additions & 14 deletions
diff --git a/‎.agents/skills/project-guides/SKILL.md‎
Lines changed: 1 addition & 1 deletion b/‎.agents/skills/project-guides/SKILL.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎CONTEXT.md‎
Lines changed: 14 additions & 16 deletions b/‎CONTEXT.md‎
Lines changed: 14 additions & 16 deletions
diff --git a/‎README.md‎
Lines changed: 1 addition & 1 deletion b/‎README.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎bin/gd.ts‎
Lines changed: 4 additions & 9 deletions b/‎bin/gd.ts‎
Lines changed: 4 additions & 9 deletions
diff --git a/‎config.ts.example‎
Lines changed: 2 additions & 3 deletions b/‎config.ts.example‎
Lines changed: 2 additions & 3 deletions
diff --git a/‎eval-view/api.js‎
Lines changed: 29 additions & 6 deletions b/‎eval-view/api.js‎
Lines changed: 29 additions & 6 deletions
@@ -15,23 +15,22 @@ This is the third of three stages in creating guidance:
 
 **Real-world coding agents see only `guide.md`** — retrieved automatically via the RAG skills system when a developer asks for help. Every other file in a use case directory is eval infrastructure.
 
-**The eval harness** runs a separate coding agent in a controlled environment to test whether the guidance works. This eval agent receives a prompt from `prompts.md` and has access to `guide.md` via the same RAG system. The harness then runs `grader.ts` against the eval agent's output.
+**The eval harness** runs a separate coding agent in a controlled environment to test whether the guidance works. This eval agent receives the first prompt from `tasks/task.md` and has access to `guide.md` via the same RAG system. The harness then runs `grader.ts` against the eval agent's output.
 
 None of the following are ever seen by real-world coding agents:
 
 | File | Role in eval pipeline |
 |---|---|
-| `prompts.md` | Simulated developer prompts fed to the eval agent by the harness |
+| `tasks/task.md` | Simulated developer prompts and base application name fed to the eval agent by the harness |
 | `demo.html` | Reference implementation — grader runs against it to confirm tests pass on correct code |
 | `negative-demo.html` | Anti-example — grader runs against it to confirm tests fail on incorrect code |
 | `expectations.md` | Spec used to generate `grader.ts` |
 | `grader.ts` | Playwright tests run against the eval agent's output |
 
 ## How the eval files work together
 
-`prompts.md`, `expectations.md`, and `grader.ts` form a tightly coupled pipeline:
-
-1. **`prompts.md`** — Simulated developer prompts used only by the eval harness. Each prompt should sound like a real developer request, without naming specific APIs or best practices — the eval agent is expected to discover those by reading `guide.md` via RAG. The first prompt is the most important: it drives negative demo generation and is used as the default task.
+`tasks/task.md`, `expectations.md`, and `grader.ts` form a tightly coupled pipeline:
+1. **`tasks/task.md`** — Simulated developer prompts used only by the eval harness. It must start with a YAML frontmatter specifying the `base_app`, followed by a list of prompts. Each prompt should sound like a real developer request, without naming specific APIs or best practices — the eval agent is expected to discover those by reading `guide.md` via RAG. The first prompt is the most important: it is used as the default task.
 
 2. **`expectations.md`** — The ground truth for what a correct implementation looks like. Each bullet becomes exactly one test in `grader.ts`. Write expectations assuming the eval agent read `guide.md` and implemented it faithfully; they describe the observable output, not the implementation approach.
 
@@ -67,22 +66,21 @@ This command will automatically:
 1. Generate a `negative-demo.html` based on the guidance.
 2. Generate a `grader.ts` Playwright test that asserts your `expectations.md` against both `demo.html` (should pass) and `negative-demo.html` (should fail).
 3. Test and calibrate the grader by running the test suite.
-4. Generate realistic developer `prompts.md` and a task file to run an agent end-to-end.
-
-The human can manually write or revise the grader or prompts if they wish.
-
-## Writing `prompts.md`
 
-`prompts.md` contains realistic developer prompts used to run AI agents end-to-end against the guide's grader. Each line is a separate prompt the harness can send to an agent.
+## Writing `tasks/task.md`
 
-**Format:** Each prompt must be on its own line, prefixed with `- `. No other structure is required.
+`tasks/task.md` contains realistic developer prompts used to run AI agents end-to-end against the guide's grader, prefixed by a YAML frontmatter specifying the base application.
 
+**Format:**
 ```md
+---
+base_app: daily-grind
+---
 - make my images load faster on the page
 - Optimize the priority of my LCP image 'hero.jpg' and deprioritize the gallery images below the fold.
 ```
 
-**Critical:** The **first prompt** is the most important — it is what the harness uses for negative demo generation and as the task body in `harness/tasks/<guide-name>-task.md`. The task file body **must always match the first prompt exactly**. If you update the first prompt, update the task file body to match. It must be specific enough to produce a grader-testable result.
+**Critical:** The **first prompt** is the most important. It is used as the default task for the harness, and it must be specific enough to produce a grader-testable result.
 
 **Rules:**
 - DO write prompts as a developer talking to an AI coding assistant — casual, lowercase, sometimes vague.
@@ -105,4 +103,3 @@ The human can manually write or revise the grader or prompts if they wish.
 If `gd dev` fails to calibrate the grader:
 * Read the command output to see which assertions failed.
 * If the grader logic generated by the pipeline is wrong, you may need to tweak the language in `expectations.md` so the generated grader is more accurate, or simply run `gd dev` again (it attempts to fix itself using failure context).
-
@@ -22,7 +22,7 @@ When a developer asks an AI coding assistant to implement something, the assista
 | `negative-demo.html` | Incorrect implementation used to verify the grader catches failures | ❌ No |
 | `expectations.md` | Source used to generate `grader.ts` | ❌ No |
 | `grader.ts` | Playwright tests run against the eval agent's output | ❌ No |
-| `prompts.md` | Simulated developer prompts used only by the eval harness | ❌ No |
+| `tasks/task.md` | Simulated developer prompts and base application name fed to the eval agent by the harness | ❌ No |
 
 **Implication for `demo.html`:** Because real agents never see `demo.html`, it does not need to be a polished, production-ready example. It just needs to be a correct, minimal implementation that the grader can pass against. Do not over-engineer it.
 
 
@@ -38,7 +38,6 @@ guidance/
     config.ts                 # Central configuration (agent selection, MCP servers, etc.)
     run_suite.ts              # Suite runner (discovers tasks, runs agents, grades output)
     evaluate.ts               # Evaluation and reporting
-    tasks/                    # Task files that define eval scenarios
     base_apps/                # Base applications that agents modify (e.g. daily-grind)
     agents/                   # Agent runner scripts (gemini_cli, claude_code, jetski)
     lib/                      # Shared utilities (isolation, file helpers)
@@ -65,19 +64,19 @@ Each guide lives in its own directory (e.g. `guides/performance/batch-analytics-
 | `expectations.md` | SME (human) | Natural-language bulleted list of assertions that must be true if the guidance is followed correctly. Used as input for grader generation. |
 | `negative-demo.html` | Generated (Gemini CLI) | A deliberately incorrect implementation. Must score 0% against the grader. Used for grader calibration. |
 | `grader.ts` | Generated (Gemini CLI) | A Playwright test file that grades any HTML file against the expectations. May include both browser automation checks and static content checks. |
-| `prompts.md` | Generated (Gemini CLI) | Realistic developer prompts (1-2) that an AI coding assistant might receive. Used for agent testing. |
+| `task.md` | Generated (Gemini CLI) | Simulated developer prompts and base_app fed to the eval agent by the harness |
 
-Additionally, each guide that is ready for evaluation has a **task file** in `harness/tasks/`:
+The **task file** looks like:
 
 ```yaml
 ---
 base_app: daily-grind
-grader: batch-analytics-events
 ---
-Implement Core Web Vitals monitoring on a web page...
+- Implement Core Web Vitals monitoring on a web page...
+- Alternative prompt...
 ```
 
-The task file connects a grader (by guide directory name), a base application the agent will modify, and the prompt the agent receives.
+The task file connects a base application the agent will modify, and the prompt the agent receives (first prompt in the list). The grader is implicit (the same directory).
 
 ### Guide maturity stages
 
@@ -87,8 +86,8 @@ A guide progresses through these stages:
 2. **Incomplete**: Has `guide.md` content but is missing `demo.html` and/or `expectations.md`.
 3. **Needs expectations**: Has guide + demo but no `expectations.md` (or it's empty). Cannot proceed to automated generation without this.
 4. **Needs calibration**: Has all three human-authored files. Ready for `gd dev` to generate `negative-demo.html`, `grader.ts`, and calibrate.
-5. **Needs test**: Grader is calibrated but missing `prompts.md` or a task file. Agent tests haven't been run.
-6. **Eval-ready**: All artifacts exist. The guide is included in `gd eval suite` runs.
+5. **Needs test**: Grader is calibrated but missing `task.md`. Agent tests haven't been run.
+6. **Eval-ready**: All artifacts exist. The guide is included in `gd eval` runs.
 
 ---
 
@@ -122,7 +121,7 @@ pnpm link --global && gd setup-completion
 
 | Command | What it does |
 |---|---|
-| `gd eval` | Run the full evaluation suite (discovers all tasks in `harness/tasks/`) |
+| `gd eval` | Run the full evaluation suite (discovers all tasks in guide folders) |
 | `gd eval [task1] [task2]` | Run specific tasks only |
 | `gd eval --config <custom_config>` | Run with config overrides (`--config my_custom_config.ts`, defaults to `config.ts`, or falls back to defaults in `harness/config.ts`) |
 | `gd dashboard` | Start the eval results dashboard (eval-view) |
@@ -151,8 +150,7 @@ Runs the grader against both `demo.html` (should pass 100%) and `negative-demo.h
 
 ### Step 5: Agent test (runs by default)
 After successful calibration:
-1. Generates `prompts.md` if missing (via Gemini CLI, using the base app as context)
-2. Finds or creates a task file in `harness/tasks/`
+1. Generates `task.md` if missing (via Gemini CLI, using the base app as context)
 3. Grades the base app as-is (pre-score baseline)
 4. Runs the configured agent in both **unguided** (no MCP guide access) and **guided** (with MCP guide access) modes
 5. Grades both outputs and prints a comparison showing guide impact
@@ -171,12 +169,12 @@ The eval harness measures whether guides actually improve agent output.
 
 ### How a suite run works (`gd eval`)
 
-1. **Build MCP index**: Compiles all guides into the MCP server's searchable index.
-2. **Discover tasks**: Scans `harness/tasks/*.md` for task definitions (or uses explicitly configured tasks).
+1. **Build Guide Index**: Compiles all guides into a searchable index (RAG).
+2. **Discover tasks**: Scans guide directories for `task.md` definitions (or uses explicitly configured tasks).
 3. **For each task, for each run** (configurable `numRuns`, default 2):
    - Set up an isolated working directory with the base app
-   - Run the agent in **unguided mode** (no MCP servers)
-   - Run the agent in **guided mode** (with configured MCP servers)
+   - Run the agent in **unguided mode** (no guidance)
+   - Run the agent in **guided mode** (with configured guidance)
    - Grade both outputs using the task's grader
 4. **Generate reports**: JSON results + HTML report in the output directory.
 5. **Upload** (optional): `pnpm upload <suite-name>` pushes results to GCS for the dashboard.
@@ -365,7 +363,7 @@ MCP_API_KEY=...
 
 Suite configuration in `harness/config.ts`:
 - `numRuns`: Number of agent runs per task (default: 2)
-- `tasks`: Empty array = discover all tasks in `harness/tasks/`. Set explicitly to run a subset.
+- `tasks`: Empty array = discover all tasks by scanning guide folders. Set explicitly to run a subset.
 - `mcpServersToEnable`: Which MCP servers agents can access (`['modern-web']`, `['google-developer-knowledge']`, or both)
 - `serving`: The approach used to serve guidance (`skills_cli`, `skills`, or `mcp`)
 - `agent`: Which agent to use (`Agents.GEMINI_CLI`, `Agents.CLAUDE_CODE`, `Agents.JETSKI`)
 
@@ -70,7 +70,7 @@ gd dev [dir] [options]        # auto-generate/calibrate
 
 # Evaluation
 gd eval                       # run the full evaluation suite
-gd eval [task1] [task2]       # run specific tasks
+gd eval [task1] [task2]       # run specific tasks (which are the names of guides e.g. `batch-analytics-events`)
 gd eval --config <custom_config>       # run with config overrides (defaults to config.ts, or harness/config.ts)
 gd dashboard                  # start the evaluation dashboard
 
 
@@ -8,7 +8,8 @@ import omelette from 'omelette';
 import { pathToFileURL } from 'url';
 import { cRed, cCyan, cBold, cDim } from '../lib/colors.ts';
 import { Serving, mergeSuiteConfig, type SuiteConfig } from '../harness/config.ts';
-import { rootDir, guidesDir, tasksDir, baseAppsDir, evalViewDir } from '../lib/paths.ts';
+import { rootDir, guidesDir, baseAppsDir, evalViewDir } from '../lib/paths.ts';
+import { getTaskMap } from '../lib/guide-validation.ts';
 
 // Load environment variables (Node 20.12+)
 try {
@@ -38,12 +39,12 @@ function listGuideDirs(): string[] {
 const completion = omelette('gd <command> <arg1> <arg2>');
 
 completion.on('command', ({ reply }) => {
-  reply(['dev', 'dev-all', 'grade', 'test', 'gen', 'audit', 'eval', 'run', 'dashboard', 'deploy', 'upload', 'baselinestatus', 'setup-completion', 'gen-negative-suite']);
+  reply(['dev', 'dev-all', 'grade', 'test', 'gen', 'audit', 'eval', 'run', 'dashboard', 'deploy', 'upload', 'baselinestatus', 'setup-completion']);
 });
 
 completion.on('arg1', ({ before, reply }) => {
   if (before === 'eval') {
-    const tasks = fs.existsSync(tasksDir) ? fs.readdirSync(tasksDir).filter(f => f.endsWith('.md')).map(f => f.replace('.md', '')) : [];
+    const tasks = Array.from(getTaskMap().keys());
     reply(['suite', ...tasks]);
   } else if (before === 'gen') {
     reply(['grader', 'negative']);
@@ -154,7 +155,6 @@ ${cBold('Evaluation:')}
   ${cCyan('run')} <tmpl> <prompt>    Run an ad-hoc agent test against a template
   ${cCyan('deploy')}                 Deploy the dashboard to GitHub Pages
   ${cCyan('upload')} <suite>         Upload generated evaluation suite to GCS
-  ${cCyan('gen-negative-suite')}     Generate resources for negative suite
 
 ${cBold('Other:')}
   ${cCyan('baselinestatus')} <query>      Check browser support and Baseline status
@@ -279,11 +279,6 @@ ${cBold('Options:')}
       process.exit(code);
     }
 
-    case 'gen-negative-suite': {
-      const { generateNegativeSuite } = await import('../guides/negative-suite-gen.ts');
-      await generateNegativeSuite();
-      break;
-    }
 
     default: {
       // Legacy fallbacks — guide namespace was flattened
 
@@ -1,8 +1,7 @@
-import { Agents, Serving } from './harness/config.ts';
-import type { SuiteConfig } from './harness/config.ts';
+import { Agents, Serving, type SuiteConfig } from './harness/config.ts';
 
 const customConfig: Partial<SuiteConfig> = {
-  agent: Agents.CLAUDE_CODE,
+  agent: Agents.GEMINI_CLI,
   numRuns: 1,
   negative: false,
   serving: Serving.SKILLS_CLI
 
@@ -147,17 +147,40 @@ export class ApiClient {
 
     /** Resolves the correct base path for specific run details, parsing legacy logic. */
     async getResultInfo(testId, run, testName) {
-        const [appName, _, runType] = testName.split(' - ');
-        const actualBaseApp = run.baseApp || appName;
+        const [taskName, guideName, runType] = testName.split(' - ');
+        const actualBaseApp = run.baseApp;
+        let logicalBasePath = `${testId}/${run.runNumber}/${guideName}/${taskName}/${runType}`;
+        let entryPointPath = await this._findBestEntryPoint(logicalBasePath);
 
-        const logicalBasePath = `${testId}/${run.runNumber}/${appName}/${runType}`;
-        const entryPointPath = await this._findBestEntryPoint(logicalBasePath);
+        // Fallback for older results stored in a depth-2 folder structure (runDir/taskName/runType)
+        if (!entryPointPath) {
+            const legacyPath = `${testId}/${run.runNumber}/${taskName}/${runType}`;
+            const legacyEntryPoint = await this._findBestEntryPoint(legacyPath);
+            if (legacyEntryPoint) {
+                logicalBasePath = legacyPath;
+                entryPointPath = legacyEntryPoint;
+            } else {
+                entryPointPath = `${logicalBasePath}/index.html`; // default fallback
+            }
+        }
 
         // Calculate relative sub-path to build the setup apps correlation
         const relativePath = entryPointPath.replace(logicalBasePath + '/', '');
 
+        // Try run-local base_app first (at the appName level, not inside guided/unguided), fallback to centralized base_apps for older runs
+        const localBaseAppPath = `${testId}/${run.runNumber}/${guideName}/${taskName}/base_app/${relativePath}`;
+        let exists = false;
+
+        if (this.source === 'remote') {
+            exists = await this._checkRemoteFileExists(localBaseAppPath);
+        } else {
+            exists = await this._checkLocalFileExists(localBaseAppPath);
+        }
+
+        const setupPath = exists ? localBaseAppPath : `base_apps/${actualBaseApp}/${relativePath}`;
+
         return {
-            setupPath: `base_apps/${actualBaseApp}/${relativePath}`,
+            setupPath,
             resultPath: entryPointPath,
             usedBasePath: logicalBasePath
         };
@@ -205,7 +228,7 @@ export class ApiClient {
             bestCandidate = results.find(result => result !== null);
         }
 
-        return bestCandidate || `${basePath}/index.html`; // strict default fallback
+        return bestCandidate;
     }
 
     /** Lists relevant metadata files (like raw results or trajectories) for a specific test execution dir. */