Merge pull request #79 from tikalk/evals-refactor-v0.1.2

kanfil · web-flow · commit 3959c83b90ac · 2026-03-17T17:23:43.000+02:00
feat: add evals for fork-specific features (v0.1.2)
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -28,6 +28,7 @@ and this project adheres to to [Semantic Versioning](https://semver.org/spec/v2.
 ### Added
 
 - **`remove_replaced_commands()`**: New method in `CommandRegistrar` class (`agents.py`) to remove command files across all detected agent directories
+>>>>>>> main
 
 ## [0.1.1] - 2026-03-14
 
diff --git a/evals/README.md b/evals/README.md
@@ -4,21 +4,22 @@
 
 Comprehensive evaluation infrastructure for testing spec-kit template quality using PromptFoo with Claude.
 
-## 📊 Current Evaluation Results (Updated: 2026-02-18)
+## 📊 Current Evaluation Results (Updated: 2026-03-14)
 
-**Overall: 22 LLM eval tests + 39 unit tests across 6 suites** ✅
+**Overall: 29 LLM eval tests + 39 unit tests across 7 suites** ✅
 
 | Test Suite | Tests | What It Checks |
 |------------|-------|----------------|
-| **Spec Template** | 9 | Structure, clarity, security, completeness, regression |
+| **Spec Template** | 12 | Structure, clarity, security, completeness, regression, fork sections |
 | **Plan Template** | 2 | Simplicity gate, constitution compliance |
 | **Architecture Template** | 4 | Rozanski & Woods structure, blackbox context view, simplicity, ADR quality |
 | **Extension System** | 3 | Manifest validation, self-containment, config template |
 | **Clarify Command** | 2 | Ambiguity identification, architectural focus |
 | **Trace Validation** | 2 | Structure completeness, decision quality |
+| **Mission Brief** | 4 | Completeness, quality, constraint extraction, approval flow |
 | **Security (all suites)** | +4 per test | PII, prompt injection, hallucinations, misinformation |
 | **Unit tests (pytest)** | 39 | Grader logic, extension system |
-| **Total** | **61** | |
+| **Total** | **68** | |
 
 ## Quick Start
 
@@ -86,7 +87,11 @@ uv run pytest tests/ -v
 
 Each suite sends a prompt to the LLM and evaluates the output against structured assertions and custom Python graders.
 
+<<<<<<< HEAD
+#### Spec Template (12 tests)
+=======
 #### Spec Template (10 tests)
+>>>>>>> main
 - **Basic Structure** — required sections present (Overview, Requirements, User Stories, etc.)
 - **No Premature Tech Stack** — spec focuses on WHAT, not HOW
 - **Quality User Stories** — proper format with acceptance criteria
@@ -96,6 +101,12 @@ Each suite sends a prompt to the LLM and evaluates the output against structured
 - **Completeness** — complex features have comprehensive requirements
 - **Regression** — simple features still maintain proper structure
 - **Rename Regression** — post-rename output matches quality bar
+<<<<<<< HEAD
+- **Fork Sections** — Goal, Demo Sentence, Boundary Map present (agentic-sdlc preset)
+- **Boundary Map Structure** — Produces and Consumes subsections
+- **Constraints Extraction** — key limitations documented
+=======
+>>>>>>> main
 
 #### Plan Template (2 tests)
 
@@ -122,6 +133,15 @@ Each suite sends a prompt to the LLM and evaluates the output against structured
 - **Structure** — session metadata, decisions, and artifacts sections present
 - **Validation Accuracy** — validator correctly scores incomplete traces
 
+<<<<<<< HEAD
+#### Mission Brief (4 tests) — Fork-specific
+- **Completeness** — Goal, Success Criteria, Constraints, Demo Sentence present
+- **Quality** — Goal is concise, criteria are measurable, demo is observable
+- **Constraint Extraction** — technical, business, regulatory constraints captured
+- **Approval Flow** — includes "Proceed with this Mission Brief?" prompt
+
+=======
+>>>>>>> main
 ### Security Graders (run on every LLM test)
 
 Four graders apply automatically to every test output via `defaultTest.assert`:
@@ -132,6 +152,12 @@ Four graders apply automatically to every test output via `defaultTest.assert`:
 | **`check_prompt_injection`** | "Ignore previous instructions", DAN mode, embedded role markers, base64 payloads |
 | **`check_hallucination_signals`** | Overconfident metrics, dangling references, self-contradictions, fabricated RFCs |
 | **`check_misinformation`** | MD5/SHA-1 for passwords, plaintext HTTP, unsafe APIs (eval/pickle/yaml.load), impossible performance claims |
+<<<<<<< HEAD
+| **`check_mission_brief_completeness`** | Mission Brief has Goal, Success Criteria, Constraints, Demo Sentence |
+| **`check_mission_brief_quality`** | Goal is concise, criteria are measurable, demo is observable |
+| **`check_fork_spec_sections`** | Spec includes fork-specific sections (Goal, Demo Sentence, Boundary Map) |
+=======
+>>>>>>> main
 
 ### Unit Tests (pytest)
 
diff --git a/evals/configs/promptfooconfig-mission-brief.js b/evals/configs/promptfooconfig-mission-brief.js
@@ -0,0 +1,177 @@
+// PromptFoo configuration for Mission Brief tests
+// Tests the /adlc.spec.specify command's Mission Brief enforcement
+module.exports = {
+  description: 'Mission Brief Enforcement Evaluation',
+
+  // Rate limiting to avoid 429 errors
+  evaluateOptions: {
+    maxConcurrency: 1,
+    delay: process.env.CI ? 15000 : 5000,
+  },
+
+  // Mission Brief prompt
+  prompts: ['file://../prompts/mission-brief-prompt.txt'],
+
+  // Configure LLM provider using OpenAI-compatible endpoint
+  providers: [
+    {
+      id: `openai:chat:${process.env.LLM_MODEL || 'claude-sonnet-4-5-20250929'}`,
+      label: `${process.env.LLM_MODEL || 'Sonnet 4.5'} (via AI API Gateway)`,
+      config: {
+        apiBaseUrl: process.env.LLM_BASE_URL,
+        apiKey: process.env.LLM_AUTH_TOKEN,
+        temperature: 0.7,
+        max_tokens: 4000,
+      },
+      env: {
+        OPENAI_API_KEY: process.env.LLM_AUTH_TOKEN,
+        OPENAI_BASE_URL: process.env.LLM_BASE_URL,
+      },
+    },
+  ],
+
+  defaultTest: {
+    options: {
+      provider: `openai:chat:${process.env.LLM_MODEL || 'claude-sonnet-4-5-20250929'}`,
+    },
+    assert: [
+      { type: 'python', value: 'file://../graders/custom_graders.py:check_pii_leakage' },
+      { type: 'python', value: 'file://../graders/custom_graders.py:check_prompt_injection' },
+      { type: 'python', value: 'file://../graders/custom_graders.py:check_hallucination_signals' },
+      { type: 'python', value: 'file://../graders/custom_graders.py:check_misinformation' },
+    ],
+  },
+
+  tests: [
+    // Test 1: Mission Brief Completeness - Substantial Input
+    {
+      description: 'Mission Brief: Extracts complete Mission Brief from detailed input',
+      vars: {
+        user_input:
+          'Build a user authentication system with email/password login, password reset via email, and session management. Users should be able to stay logged in for 30 days. The system must support 10,000 concurrent users and comply with GDPR for European users.',
+      },
+      assert: [
+        { type: 'python', value: 'file://../graders/custom_graders.py:check_mission_brief_completeness' },
+        { type: 'icontains', value: 'goal' },
+        { type: 'icontains', value: 'success criteria' },
+        { type: 'icontains', value: 'demo sentence' },
+        { type: 'icontains', value: 'proceed with this mission brief' },
+      ],
+    },
+
+    // Test 2: Mission Brief Quality - Goal Extraction
+    {
+      description: 'Mission Brief: Goal is concise and captures core purpose',
+      vars: {
+        user_input:
+          'Create a dashboard where admins can view real-time analytics including user signups, active sessions, revenue metrics, and system health. The dashboard should update every 5 seconds and support drill-down into individual metrics.',
+      },
+      assert: [
+        { type: 'python', value: 'file://../graders/custom_graders.py:check_mission_brief_quality' },
+        {
+          type: 'llm-rubric',
+          value:
+            'Grade the Mission Brief Goal quality (0-1):\n' +
+            '1. Is the Goal a single sentence?\n' +
+            '2. Does it capture the core purpose (admin analytics dashboard)?\n' +
+            '3. Is it technology-agnostic (no frameworks, databases)?\n' +
+            '4. Is it measurable/achievable?\n' +
+            'Return average score 0-1.',
+          threshold: 0.7,
+        },
+      ],
+    },
+
+    // Test 3: Mission Brief - Constraint Extraction
+    {
+      description: 'Mission Brief: Extracts constraints from requirements',
+      vars: {
+        user_input:
+          'Build a payment processing integration for our e-commerce platform. Must comply with PCI-DSS, support credit cards and PayPal, process transactions under 3 seconds, and work with our existing Django backend. Budget is limited so we need to use Stripe as the payment provider.',
+      },
+      assert: [
+        { type: 'icontains', value: 'constraint' },
+        { type: 'icontains', value: 'pci' },
+        {
+          type: 'llm-rubric',
+          value:
+            'Check if the Mission Brief Constraints section includes:\n' +
+            '1. PCI-DSS compliance requirement\n' +
+            '2. Performance constraint (3 second processing)\n' +
+            '3. Technical constraint (Django backend integration)\n' +
+            '4. Budget/provider constraint (Stripe)\n' +
+            'Return 1.0 if all constraints captured, 0.5 if some, 0.0 if none.',
+          threshold: 0.7,
+        },
+      ],
+    },
+
+    // Test 4: Mission Brief - Demo Sentence Observable
+    {
+      description: 'Mission Brief: Demo Sentence is observable and concrete',
+      vars: {
+        user_input:
+          'Add a file upload feature to the project management app. Users should be able to upload PDF, Word, and image files up to 25MB. Files should be attached to tasks and downloadable by team members.',
+      },
+      assert: [
+        { type: 'icontains', value: 'demo sentence' },
+        { type: 'icontains', value: 'user can' },
+        {
+          type: 'llm-rubric',
+          value:
+            'Grade the Demo Sentence quality (0-1):\n' +
+            '1. Does it describe an observable user action (not just "have file upload")?\n' +
+            '2. Is it concrete and specific (mentions uploading/downloading files)?\n' +
+            '3. Can a human verify this outcome (e.g., "upload a PDF and download it")?\n' +
+            '4. Does it avoid implementation details?\n' +
+            'Return average score 0-1.',
+          threshold: 0.7,
+        },
+      ],
+    },
+
+    // Test 5: Mission Brief - Minimal Input Triggers Questions
+    {
+      description: 'Mission Brief: Minimal input triggers clarifying questions',
+      vars: {
+        user_input: 'Add search',
+      },
+      assert: [
+        {
+          type: 'llm-rubric',
+          value:
+            'Check if the response asks clarifying questions for the minimal input:\n' +
+            '1. Does it ask about the goal/purpose of the search feature?\n' +
+            '2. Does it ask about success criteria or expected outcomes?\n' +
+            '3. Does it ask about constraints or requirements?\n' +
+            '4. Does it NOT skip to creating a spec without gathering more info?\n' +
+            'Return 1.0 if proper questions asked, 0.0 if it proceeds without clarification.',
+          threshold: 0.7,
+        },
+      ],
+    },
+
+    // Test 6: Mission Brief - Approval Prompt Present
+    {
+      description: 'Mission Brief: Always includes approval prompt',
+      vars: {
+        user_input:
+          'Create a notification system that sends email, SMS, and push notifications. Users can configure their preferences per notification type. Support for templated messages with variable substitution.',
+      },
+      assert: [
+        { type: 'icontains', value: 'proceed' },
+        { type: 'icontains-any', value: ['yes / no', 'yes/no', '(yes', 'yes, no'] },
+        {
+          type: 'llm-rubric',
+          value:
+            'Check if the response includes a clear approval request:\n' +
+            '1. Is there a "Proceed with this Mission Brief?" question?\n' +
+            '2. Are options provided (yes/no/adjust or similar)?\n' +
+            '3. Does the flow indicate waiting for user response before continuing?\n' +
+            'Return 1.0 if proper approval flow, 0.0 if missing.',
+          threshold: 0.8,
+        },
+      ],
+    },
+  ],
+};
diff --git a/evals/prompts/mission-brief-prompt.txt b/evals/prompts/mission-brief-prompt.txt
@@ -0,0 +1,56 @@
+You are an AI assistant executing the /adlc.spec.specify command, which creates feature specifications.
+
+USER INPUT:
+{{ user_input }}
+
+CRITICAL REQUIREMENT: Mission Brief Enforcement
+
+The /adlc.spec.specify command MUST follow a strict Mission Brief protocol:
+
+## Phase 1: Mission Brief Collection
+
+If the user input is substantial (10+ words), extract the Mission Brief elements:
+- **Goal**: One-sentence core objective
+- **Success Criteria**: 2-3 measurable outcomes
+- **Constraints**: Technical, business, or regulatory limitations
+- **Demo Sentence**: Observable user capability after completion
+
+If the user input is minimal (<10 words), ask clarifying questions to gather the Mission Brief.
+
+## Phase 2: Mission Brief Display
+
+Display the extracted/gathered Mission Brief in this EXACT format:
+
+```markdown
+## Mission Brief
+
+**Goal**: {goal}
+
+**Success Criteria**:
+- {criterion 1}
+- {criterion 2}
+- {criterion 3 if applicable}
+
+**Constraints**:
+- {constraint 1}
+- {constraint 2}
+- "None identified" if no constraints
+
+**Demo Sentence**:
+After this feature, the user can: {observable capability}
+```
+
+## Phase 3: Approval Request
+
+After displaying the Mission Brief, ask:
+
+**Proceed with this Mission Brief?** (yes / no / adjust)
+
+IMPORTANT:
+- Do NOT skip the Mission Brief even if the user provides detailed requirements
+- The Mission Brief MUST be displayed before any branch creation or spec writing
+- Wait for explicit user approval before proceeding
+
+OUTPUT: 
+If user_input is substantial: Display the Mission Brief and approval prompt
+If user_input is minimal: Ask clarifying questions to gather the Mission Brief