Skip to content

Commit 409b08b

Browse files
committed
clean: Remove build mode references from evals and tests
Complet build/spec mode removal from evals framework and test suite: Eval Framework Changes: - Remove Test 12 (Build-mode Spec Quality) from promptfooconfig.js and promptfooconfig-spec.js - Update test counts: 23→22 LLM eval tests, 10→9 Spec Template tests - Update README and EVAL.md test documentation to reflect new counts - Remove "Build-mode Spec" bullet point from test lists Test Changes: - Delete tests/test_mode_evolution.py file (tests obsolete build/spec mode functionality) - Tests were checking for mode detection, template differences that no longer exist Template Cleanup: - Delete spec-template-build.md (lean, minimal output template) - Delete plan-template-build.md (simplified planning template) - These templates are no longer used after build mode removal Impact: - Eval tests now test spec-driven workflow only - Test suite reduced from 63 to 61 tests (23→22 LLM + 39 unit) - All build mode test artifacts removed
1 parent 74caf4f commit 409b08b

7 files changed

Lines changed: 10 additions & 323 deletions

File tree

evals/README.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,19 +6,19 @@ Comprehensive evaluation infrastructure for testing spec-kit template quality us
66

77
## 📊 Current Evaluation Results (Updated: 2026-02-18)
88

9-
**Overall: 23 LLM eval tests + 39 unit tests across 6 suites**
9+
**Overall: 22 LLM eval tests + 39 unit tests across 6 suites**
1010

1111
| Test Suite | Tests | What It Checks |
1212
|------------|-------|----------------|
13-
| **Spec Template** | 10 | Structure, clarity, security, completeness, regression |
13+
| **Spec Template** | 9 | Structure, clarity, security, completeness, regression |
1414
| **Plan Template** | 2 | Simplicity gate, constitution compliance |
1515
| **Architecture Template** | 4 | Rozanski & Woods structure, blackbox context view, simplicity, ADR quality |
1616
| **Extension System** | 3 | Manifest validation, self-containment, config template |
1717
| **Clarify Command** | 2 | Ambiguity identification, architectural focus |
1818
| **Trace Validation** | 2 | Structure completeness, decision quality |
1919
| **Security (all suites)** | +4 per test | PII, prompt injection, hallucinations, misinformation |
2020
| **Unit tests (pytest)** | 39 | Grader logic, extension system |
21-
| **Total** | **63+** | |
21+
| **Total** | **61** | |
2222

2323
## Quick Start
2424

@@ -96,7 +96,8 @@ Each suite sends a prompt to the LLM and evaluates the output against structured
9696
- **Completeness** — complex features have comprehensive requirements
9797
- **Regression** — simple features still maintain proper structure
9898
- **Rename Regression** — post-rename output matches quality bar
99-
- **Build-mode Spec** — build-mode template generates appropriate output
99+
100+
#### Plan Template (2 tests)
100101

101102
#### Plan Template (2 tests)
102103
- **Simplicity Gate** — simple apps have ≤3 projects (Constitution Article VII)

evals/configs/promptfooconfig-spec.js

Lines changed: 1 addition & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -198,28 +198,5 @@ module.exports = {
198198
},
199199
],
200200
},
201-
202-
// Test 12: Build-mode Spec Quality
203-
{
204-
description: 'Spec Template: Build-mode produces lean, focused output',
205-
vars: {
206-
user_input:
207-
'Build a simple health check endpoint that returns server status, uptime, and database connectivity. Build mode - minimal spec.',
208-
},
209-
assert: [
210-
{ type: 'icontains', value: 'requirement' },
211-
{
212-
type: 'llm-rubric',
213-
value:
214-
'Grade if this is appropriately lean for a simple health check feature (0-1):\n' +
215-
'1. Is it concise (not overly verbose for a health check endpoint)?\n' +
216-
'2. Does it include core functional requirements (status, uptime, db connectivity)?\n' +
217-
'3. Does it have success criteria?\n' +
218-
'4. Does it AVOID unnecessary complexity for such a simple feature?\n' +
219-
'Return average score 0-1.',
220-
threshold: 0.7,
221-
},
222-
],
223-
},
224-
],
201+
},
225202
};

evals/configs/promptfooconfig.js

Lines changed: 0 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -210,30 +210,6 @@ module.exports = {
210210
],
211211
},
212212

213-
// Test 12: Build-mode Spec Quality
214-
{
215-
description: 'Spec Template: Build-mode produces lean, focused output',
216-
prompt: 'file://../prompts/spec-prompt.txt',
217-
vars: {
218-
user_input:
219-
'Build a simple health check endpoint that returns server status, uptime, and database connectivity. Build mode - minimal spec.',
220-
},
221-
assert: [
222-
{ type: 'icontains', value: 'requirement' },
223-
{
224-
type: 'llm-rubric',
225-
value:
226-
'Grade if this is appropriately lean for a simple health check feature (0-1):\n' +
227-
'1. Is it concise (not overly verbose for a health check endpoint)?\n' +
228-
'2. Does it include core functional requirements (status, uptime, db connectivity)?\n' +
229-
'3. Does it have success criteria?\n' +
230-
'4. Does it AVOID unnecessary complexity for such a simple feature?\n' +
231-
'Return average score 0-1.',
232-
threshold: 0.7,
233-
},
234-
],
235-
},
236-
237213
// ========================================
238214
// PLAN TEMPLATE TESTS (4 tests)
239215
// ========================================

evals/docs/EVAL.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,11 @@ The annotation evals are a **multi-layered evaluation framework** for testing th
88

99
## 1. Automated Testing (PromptFoo)
1010

11-
**23 LLM eval tests** across 6 suites, plus **4 security graders** that run on every test automatically.
11+
**22 LLM eval tests** across 6 suites, plus **4 security graders** that run on every test automatically.
1212

1313
### Test Suites
1414

15-
#### Spec Template Tests (10 tests)
15+
#### Spec Template Tests (9 tests)
1616
- **Basic Structure**: Validates required sections (Overview, Requirements, User Stories, etc.)
1717
- **No Premature Tech Stack**: Ensures spec focuses on WHAT, not HOW
1818
- **Quality User Stories**: Checks for proper format and acceptance criteria
@@ -22,7 +22,6 @@ The annotation evals are a **multi-layered evaluation framework** for testing th
2222
- **Completeness**: Comprehensive requirements for complex features
2323
- **Regression**: Even simple features maintain proper structure
2424
- **Rename Regression**: Post-rename output matches quality bar
25-
- **Build-mode Spec**: Build-mode template generates appropriate output
2625

2726
#### Plan Template Tests (2 tests)
2827
- **Simplicity Gate**: Simple apps should have ≤3 projects (Constitution Article VII)
@@ -61,7 +60,7 @@ Four graders run on **every LLM output** across all 23 tests via `defaultTest.as
6160
### Running Automated Tests
6261

6362
```bash
64-
# Run all 23 LLM eval tests
63+
# Run all 22 LLM eval tests
6564
./evals/scripts/run-promptfoo-eval.sh
6665

6766
# Run with JSON output
@@ -146,7 +145,7 @@ Located in `evals/annotation-tool/`, this is a **FastHTML-based web interface**
146145
```
147146
1. Generate Specs/Plans/Arch docs (using prompt templates)
148147
149-
2. PromptFoo Tests (23 LLM tests + 4 security graders on each)
148+
2. PromptFoo Tests (22 LLM tests + 4 security graders on each)
150149
151150
3. Unit Tests (pytest — fast, no API key needed)
152151

templates/plan-template-build.md

Lines changed: 0 additions & 56 deletions
This file was deleted.

templates/spec-template-build.md

Lines changed: 0 additions & 50 deletions
This file was deleted.

0 commit comments

Comments
 (0)