tikalk
diff --git a/‎evals/README.md‎
Lines changed: 5 additions & 4 deletions b/‎evals/README.md‎
Lines changed: 5 additions & 4 deletions
diff --git a/‎evals/configs/promptfooconfig-spec.js‎
Lines changed: 1 addition & 24 deletions b/‎evals/configs/promptfooconfig-spec.js‎
Lines changed: 1 addition & 24 deletions
diff --git a/‎evals/configs/promptfooconfig.js‎
Lines changed: 0 additions & 24 deletions b/‎evals/configs/promptfooconfig.js‎
Lines changed: 0 additions & 24 deletions
diff --git a/‎evals/docs/EVAL.md‎
Lines changed: 4 additions & 5 deletions b/‎evals/docs/EVAL.md‎
Lines changed: 4 additions & 5 deletions
diff --git a/‎templates/plan-template-build.md‎
Lines changed: 0 additions & 56 deletions b/‎templates/plan-template-build.md‎
Lines changed: 0 additions & 56 deletions
diff --git a/‎templates/spec-template-build.md‎
Lines changed: 0 additions & 50 deletions b/‎templates/spec-template-build.md‎
Lines changed: 0 additions & 50 deletions
@@ -6,19 +6,19 @@ Comprehensive evaluation infrastructure for testing spec-kit template quality us
 
 ## 📊 Current Evaluation Results (Updated: 2026-02-18)
 
-**Overall: 23 LLM eval tests + 39 unit tests across 6 suites** ✅
+**Overall: 22 LLM eval tests + 39 unit tests across 6 suites** ✅
 
 | Test Suite | Tests | What It Checks |
 |------------|-------|----------------|
-| **Spec Template** | 10 | Structure, clarity, security, completeness, regression |
+| **Spec Template** | 9 | Structure, clarity, security, completeness, regression |
 | **Plan Template** | 2 | Simplicity gate, constitution compliance |
 | **Architecture Template** | 4 | Rozanski & Woods structure, blackbox context view, simplicity, ADR quality |
 | **Extension System** | 3 | Manifest validation, self-containment, config template |
 | **Clarify Command** | 2 | Ambiguity identification, architectural focus |
 | **Trace Validation** | 2 | Structure completeness, decision quality |
 | **Security (all suites)** | +4 per test | PII, prompt injection, hallucinations, misinformation |
 | **Unit tests (pytest)** | 39 | Grader logic, extension system |
-| **Total** | **63+** | |
+| **Total** | **61** | |
 
 ## Quick Start
 
@@ -96,7 +96,8 @@ Each suite sends a prompt to the LLM and evaluates the output against structured
 - **Completeness** — complex features have comprehensive requirements
 - **Regression** — simple features still maintain proper structure
 - **Rename Regression** — post-rename output matches quality bar
-- **Build-mode Spec** — build-mode template generates appropriate output
+
+#### Plan Template (2 tests)
 
 #### Plan Template (2 tests)
 - **Simplicity Gate** — simple apps have ≤3 projects (Constitution Article VII)
 
@@ -198,28 +198,5 @@ module.exports = {
         },
       ],
     },
-
-    // Test 12: Build-mode Spec Quality
-    {
-      description: 'Spec Template: Build-mode produces lean, focused output',
-      vars: {
-        user_input:
-          'Build a simple health check endpoint that returns server status, uptime, and database connectivity. Build mode - minimal spec.',
-      },
-      assert: [
-        { type: 'icontains', value: 'requirement' },
-        {
-          type: 'llm-rubric',
-          value:
-            'Grade if this is appropriately lean for a simple health check feature (0-1):\n' +
-            '1. Is it concise (not overly verbose for a health check endpoint)?\n' +
-            '2. Does it include core functional requirements (status, uptime, db connectivity)?\n' +
-            '3. Does it have success criteria?\n' +
-            '4. Does it AVOID unnecessary complexity for such a simple feature?\n' +
-            'Return average score 0-1.',
-          threshold: 0.7,
-        },
-      ],
-    },
-  ],
+  },
 };
@@ -210,30 +210,6 @@ module.exports = {
       ],
     },
 
-    // Test 12: Build-mode Spec Quality
-    {
-      description: 'Spec Template: Build-mode produces lean, focused output',
-      prompt: 'file://../prompts/spec-prompt.txt',
-      vars: {
-        user_input:
-          'Build a simple health check endpoint that returns server status, uptime, and database connectivity. Build mode - minimal spec.',
-      },
-      assert: [
-        { type: 'icontains', value: 'requirement' },
-        {
-          type: 'llm-rubric',
-          value:
-            'Grade if this is appropriately lean for a simple health check feature (0-1):\n' +
-            '1. Is it concise (not overly verbose for a health check endpoint)?\n' +
-            '2. Does it include core functional requirements (status, uptime, db connectivity)?\n' +
-            '3. Does it have success criteria?\n' +
-            '4. Does it AVOID unnecessary complexity for such a simple feature?\n' +
-            'Return average score 0-1.',
-          threshold: 0.7,
-        },
-      ],
-    },
-
     // ========================================
     // PLAN TEMPLATE TESTS (4 tests)
     // ========================================
 
@@ -8,11 +8,11 @@ The annotation evals are a **multi-layered evaluation framework** for testing th
 
 ## 1. Automated Testing (PromptFoo)
 
-**23 LLM eval tests** across 6 suites, plus **4 security graders** that run on every test automatically.
+**22 LLM eval tests** across 6 suites, plus **4 security graders** that run on every test automatically.
 
 ### Test Suites
 
-#### Spec Template Tests (10 tests)
+#### Spec Template Tests (9 tests)
 - **Basic Structure**: Validates required sections (Overview, Requirements, User Stories, etc.)
 - **No Premature Tech Stack**: Ensures spec focuses on WHAT, not HOW
 - **Quality User Stories**: Checks for proper format and acceptance criteria
@@ -22,7 +22,6 @@ The annotation evals are a **multi-layered evaluation framework** for testing th
 - **Completeness**: Comprehensive requirements for complex features
 - **Regression**: Even simple features maintain proper structure
 - **Rename Regression**: Post-rename output matches quality bar
-- **Build-mode Spec**: Build-mode template generates appropriate output
 
 #### Plan Template Tests (2 tests)
 - **Simplicity Gate**: Simple apps should have ≤3 projects (Constitution Article VII)
@@ -61,7 +60,7 @@ Four graders run on **every LLM output** across all 23 tests via `defaultTest.as
 ### Running Automated Tests
 
 ```bash
-# Run all 23 LLM eval tests
+# Run all 22 LLM eval tests
 ./evals/scripts/run-promptfoo-eval.sh
 
 # Run with JSON output
@@ -146,7 +145,7 @@ Located in `evals/annotation-tool/`, this is a **FastHTML-based web interface**
 ```
 1. Generate Specs/Plans/Arch docs (using prompt templates)
    ↓
-2. PromptFoo Tests (23 LLM tests + 4 security graders on each)
+2. PromptFoo Tests (22 LLM tests + 4 security graders on each)
    ↓
 3. Unit Tests (pytest — fast, no API key needed)
    ↓