Skip to content

Commit 2dfd48e

Browse files
authored
Merge pull request #91 from tikalk/edd_implementation
Evals Extension: EDD (Eval-Driven Development) with PromptFoo Integration
2 parents b3f271f + be500d4 commit 2dfd48e

38 files changed

Lines changed: 12391 additions & 0 deletions

CHANGELOG.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,22 @@
22

33
All notable changes to the Specify CLI and templates are documented here.
44

5+
# [0.4.11] - 2026-04-16
6+
7+
### Added
8+
9+
- **Evals Extension**: Complete EDD (Eval-Driven Development) implementation with PromptFoo integration
10+
- **Complete EDD Methodology**: All 10 EDD principles implemented and validated
11+
- **Goldset Lifecycle**: Full ADR/CDR pattern with `init``specify``clarify``analyze``implement` workflow
12+
- **Evaluation Pyramid**: Tier 1 fast checks (<30s) + Tier 2 semantic evaluation (<5min) + production sampling
13+
- **Statistical Validation**: TPR/TNR analysis with 95% confidence intervals and holdout dataset validation
14+
- **PromptFoo Integration**: Automatic config generation, executable grader creation, and seamless results processing
15+
- **Cross-Functional Intelligence**: `levelup` command generates stakeholder-specific insights and team PRs
16+
- **Smart Task Matching**: `tasks` command provides intelligent eval-task alignment with coverage analysis
17+
- **Production-Ready**: Complete validation pipeline, error handling, and production loop closure
18+
- **8 Commands**: `init`, `specify`, `clarify`, `analyze`, `implement`, `validate`, `levelup`, `tasks`
19+
- **Comprehensive Documentation**: 190+ section README with architecture guide, examples, and troubleshooting
20+
521
## [0.4.10] - 2026-04-14
622

723
### Changed

docs/EDD_USAGE_GUIDE.md

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
# EDD Components vs Internal Evaluation Systems - Usage Guide
2+
3+
## Overview
4+
5+
This repository contains **two separate evaluation systems** with different purposes. Understanding when to use each is crucial for proper implementation.
6+
7+
---
8+
9+
## 📋 **Internal Evaluation System** (`evals/configs/`)
10+
11+
### Purpose
12+
- **Evaluates the quality of the prompt templates** in this repository
13+
- Tests whether `spec-prompt.txt`, `plan-prompt.txt`, `arch-prompt.txt`, etc. generate good outputs
14+
- **Quality assurance for the toolkit itself**
15+
16+
### Structure
17+
```
18+
evals/
19+
├── configs/
20+
│ ├── promptfooconfig.js # Main evaluation config
21+
│ ├── promptfooconfig-spec.js # Tests spec-prompt.txt quality
22+
│ ├── promptfooconfig-plan.js # Tests plan-prompt.txt quality
23+
│ ├── promptfooconfig-arch.js # Tests arch-prompt.txt quality
24+
│ ├── promptfooconfig-ext.js # Tests ext-prompt.txt quality
25+
│ └── promptfooconfig-clarify.js # Tests clarify-prompt.txt quality
26+
├── graders/
27+
│ └── custom_graders.py # Custom graders for prompt quality
28+
└── prompts/
29+
├── spec-prompt.txt # The actual prompts being tested
30+
├── plan-prompt.txt
31+
└── ...
32+
```
33+
34+
### When to Use
35+
- ✅ Testing if your prompt templates generate good specifications
36+
- ✅ Regression testing after modifying prompts
37+
- ✅ Quality assurance for the toolkit development
38+
- ✅ Ensuring prompts follow best practices
39+
40+
---
41+
42+
## 🛡️ **EDD Components** (`evals/edd-components/`)
43+
44+
### Purpose
45+
- **Framework for evaluating external projects** that use this toolkit
46+
- Security baseline checks, compliance validation, etc.
47+
- **Methodology for others to implement in their projects**
48+
49+
### Structure
50+
```
51+
evals/
52+
├── edd-components/
53+
│ ├── graders/
54+
│ │ ├── check_pii_leakage.py # Security grader
55+
│ │ ├── check_prompt_injection.py # Security grader
56+
│ │ ├── check_hallucination.py # Security grader
57+
│ │ ├── check_misinformation.py # Security grader
58+
│ │ ├── check_regulatory_compliance.py # Compliance grader
59+
│ │ └── check_context_adherence.py # Context grader
60+
│ ├── configs/
61+
│ │ ├── config.js # EDD example config
62+
│ │ ├── config-tier1.js # Tier 1 (fast) evaluations
63+
│ │ └── config-tier2.js # Tier 2 (semantic) evaluations
64+
│ └── goldset/
65+
│ └── goldset.csv # Reference test data
66+
└── scripts/
67+
├── edd_failure_routing.py # EDD failure routing
68+
└── audit_binary_compliance.py # EDD compliance audit
69+
```
70+
71+
### When to Use
72+
-**External teams** implementing this toolkit in their projects
73+
- ✅ Evaluating **AI systems built using the prompts** from this toolkit
74+
- ✅ Security baseline validation for production AI systems
75+
- ✅ Compliance checking for regulated industries
76+
77+
---
78+
79+
## 🔄 **Command Alignment Across Extensions**
80+
81+
All extensions in Spec Kit follow a consistent workflow pattern:
82+
83+
| Step | `/architect.*` | `/levelup.*` | `/evals.*` | Purpose |
84+
|------|---------------|--------------|-----------|---------|
85+
| **1. Initialize** | `/architect.init` || `/evals.init` | Reverse-engineer from existing codebase (brownfield) |
86+
| **2. Specify** | `/architect.specify` | `/levelup.specify` | `/evals.specify` | Extract core artifacts (ADRs, CDRs, eval criteria) from spec |
87+
| **3. Clarify** | `/architect.clarify` | `/levelup.clarify` | `/evals.clarify` | Resolve ambiguities through interactive questions |
88+
| **4. Implement** | `/architect.implement` | `/levelup.implement` | `/evals.implement` | Generate final outputs (AD.md, skills, PromptFoo config) |
89+
| **5. Validate** | `/architect.validate` || `/evals.validate` | Validate alignment/quality (READ-ONLY for architect) |
90+
| **6. Analyze/Trace** | `/architect.analyze` | `/levelup.trace` | `/evals.trace` | Post-implementation analysis and reporting |
91+
92+
### Pattern Summary
93+
94+
**Common workflow**:
95+
1. **init** (optional, brownfield only) →
96+
2. **specify** (extract) →
97+
3. **clarify** (refine) →
98+
4. **implement** (generate) →
99+
5. **validate/trace** (verify/analyze)
100+
101+
**Key differences**:
102+
- **`/architect.*`**: Focuses on architectural decisions (ADRs → AD.md)
103+
- **`/levelup.*`**: Focuses on coding directives (CDRs → skills → team-ai-directives)
104+
- **`/evals.*`**: Focuses on evaluation criteria (goldset → PromptFoo config → test results)
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
// Tier 1 Only - Fast CI/CD Integration
2+
// EDD Principle IV: Fast deterministic checks only
3+
4+
module.exports = {
5+
description: 'EDD Tier 1 - Fast Deterministic Checks (<30s)',
6+
7+
tests: [
8+
{
9+
description: 'Security Baseline - PII Leakage',
10+
assert: [{ type: 'python', value: './graders/check_pii_leakage.py' }]
11+
},
12+
{
13+
description: 'Security Baseline - Prompt Injection',
14+
assert: [{ type: 'python', value: './graders/check_prompt_injection.py' }]
15+
},
16+
{
17+
description: 'Security Baseline - Hallucination Detection',
18+
assert: [{ type: 'python', value: './graders/check_hallucination.py' }]
19+
},
20+
{
21+
description: 'Security Baseline - Misinformation Detection',
22+
assert: [{ type: 'python', value: './graders/check_misinformation.py' }]
23+
},
24+
{
25+
description: 'Regulatory Compliance Validation',
26+
assert: [{ type: 'python', value: './graders/check_regulatory_compliance.py' }]
27+
}
28+
],
29+
30+
outputPath: '../results/tier1_results.json',
31+
32+
metadata: {
33+
tier: 1,
34+
sla: '30_seconds',
35+
use_case: 'ci_cd_fast_feedback'
36+
}
37+
};
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
// Tier 2 Only - Semantic Evaluation for Merge Gates
2+
// EDD Principle IV: Goldset LLM judges
3+
4+
module.exports = {
5+
description: 'EDD Tier 2 - Goldset Semantic Evaluation (<5min)',
6+
7+
tests: [
8+
{
9+
description: 'Context Adherence Validation',
10+
assert: [{ type: 'python', value: './graders/check_context_adherence.py' }]
11+
}
12+
],
13+
14+
outputPath: '../results/tier2_results.json',
15+
16+
metadata: {
17+
tier: 2,
18+
sla: '5_minutes',
19+
use_case: 'merge_gate_validation'
20+
}
21+
};
Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
// PromptFoo Configuration
2+
// Auto-generated from goldset.md following EDD principles
3+
// Generated: 2026-03-30T10:33:28Z
4+
5+
const path = require('path');
6+
7+
module.exports = {
8+
description: 'EDD Evaluation Suite - Binary Pass/Fail with Evaluation Pyramid',
9+
10+
// EDD Principle IV: Evaluation Pyramid
11+
tests: [
12+
13+
// ============================================
14+
// TIER 1: Fast Deterministic Checks (<30s)
15+
// ============================================
16+
17+
// Security Baseline (Always Applied)
18+
{
19+
description: 'Security Baseline - PII Leakage',
20+
assert: [{
21+
type: 'python',
22+
value: './graders/check_pii_leakage.py',
23+
}],
24+
metadata: { tier: 1, type: 'security_baseline', priority: 'critical' }
25+
},
26+
27+
{
28+
description: 'Security Baseline - Prompt Injection',
29+
assert: [{
30+
type: 'python',
31+
value: './graders/check_prompt_injection.py',
32+
}],
33+
metadata: { tier: 1, type: 'security_baseline', priority: 'critical' }
34+
},
35+
36+
{
37+
description: 'Security Baseline - Hallucination Detection',
38+
assert: [{
39+
type: 'python',
40+
value: './graders/check_hallucination.py',
41+
}],
42+
metadata: { tier: 1, type: 'security_baseline', priority: 'critical' }
43+
},
44+
45+
{
46+
description: 'Security Baseline - Misinformation Detection',
47+
assert: [{
48+
type: 'python',
49+
value: './graders/check_misinformation.py',
50+
}],
51+
metadata: { tier: 1, type: 'security_baseline', priority: 'critical' }
52+
},
53+
54+
// Goldset Tier 1 Criteria
55+
{
56+
description: 'Regulatory Compliance Validation',
57+
assert: [{
58+
type: 'python',
59+
value: './graders/check_regulatory_compliance.py',
60+
}],
61+
metadata: {
62+
tier: 1,
63+
type: 'goldset_criterion',
64+
criterion: 'eval-001',
65+
failure_type: 'specification_failure'
66+
}
67+
},
68+
69+
// ============================================
70+
// TIER 2: Goldset Semantic Evaluation (<5min)
71+
// ============================================
72+
73+
{
74+
description: 'Context Adherence Validation',
75+
assert: [{
76+
type: 'python',
77+
value: './graders/check_context_adherence.py',
78+
}],
79+
metadata: {
80+
tier: 2,
81+
type: 'goldset_criterion',
82+
criterion: 'eval-002',
83+
failure_type: 'generalization_failure',
84+
evaluator_type: 'llm-judge'
85+
}
86+
}
87+
],
88+
89+
// EDD Principle II: Binary pass/fail outputs only
90+
outputPath: '../results/promptfoo_results.json',
91+
92+
// EDD Principle V: Trajectory observability
93+
writeLatestResults: true,
94+
share: false,
95+
96+
// EDD Principle IX: Test data versioning metadata
97+
metadata: {
98+
version: '1.0.0',
99+
generated: '2026-03-30T10:33:28Z',
100+
goldset_version: '1.0.0',
101+
edd_compliant: true,
102+
binary_only: true,
103+
evaluation_pyramid: true,
104+
tier1_sla: '30_seconds',
105+
tier2_sla: '5_minutes',
106+
107+
// EDD Principle VIII: Failure type routing
108+
criteria_mapping: {
109+
'eval-001': { name: 'Regulatory Compliance', failure_type: 'specification_failure' },
110+
'eval-002': { name: 'Context Adherence', failure_type: 'generalization_failure' }
111+
}
112+
}
113+
};
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# System-specific configuration for promptfoo
2+
system: "promptfoo"
3+
binary_pass_fail: true
4+
5+
# EDD Principle IV: Evaluation Pyramid
6+
tiers:
7+
tier1:
8+
fast_checks: true
9+
security_baseline: true
10+
tier2:
11+
goldset_judges: true
12+
13+
# EDD Principle IX: Test Data is Code
14+
test_data:
15+
version_control: true
16+
adversarial_required: true
17+
holdout_ratio: 0.2

0 commit comments

Comments
 (0)