Skip to content

Commit bb52e4e

Browse files
authored
Merge pull request #92 from tikalk/edd_deepeval_option
Edd deepeval option alongside Promptfoo
2 parents fcc755a + afa518c commit bb52e4e

20 files changed

Lines changed: 4210 additions & 690 deletions

CHANGELOG.md

Lines changed: 28 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,29 @@ All notable changes to the Specify CLI and templates are documented here.
44

55
# [Unreleased]
66

7+
# [0.5.11] - 2026-04-23
8+
9+
### Added
10+
11+
- **DeepEval Integration**: Full support for DeepEval as alternative evaluation framework
12+
- Custom metric class generation with DeepEval v3.x API support
13+
- Automatic version compatibility validation (DeepEval >=3.0.0 required)
14+
- System detection to choose between PromptFoo and DeepEval
15+
- **Atomic Generation Order**: Prevents import errors in generated configurations
16+
- Graders generated before config (normal Python imports work)
17+
- Validation step with rollback on failure
18+
- Clear error messages for missing dependencies
19+
720
### Changed
821

9-
- **Hook-based architecture loading**: Replaced hardcoded AD.md/adr.md file loading in preset commands with hook-based architecture
10-
- Architecture context now loaded via `before_specify`/`before_analyze`/`before_clarify` hooks
11-
- Removed direct file path references from `adlc.spec.analyze.md` and `adlc.spec.clarify.md`
12-
- Aligns with extension hook system for better extensibility
22+
- **Command Naming**: Renamed `evals.trace` to `evals.analyze` for clarity
23+
- **Command Structure**: Standardized command interface across all evals commands
24+
25+
### Fixed
26+
27+
- **Import Errors**: Resolved chicken-and-egg problem in DeepEval config generation
28+
- **Version Compatibility**: Added clear error messages for DeepEval v2.x users with upgrade instructions
29+
- **Documentation**: Clarified threshold parameter usage in EDD binary evaluation mode
1330

1431
# [0.5.10] - 2026-04-20
1532

@@ -21,6 +38,13 @@ All notable changes to the Specify CLI and templates are documented here.
2138
- Added `get_team_directives_path()` helper to resolve path from init-options or extensions dir
2239
- Added `install` parameter to `sync_team_ai_directives()` for explicit control
2340

41+
### Changed
42+
43+
- **Hook-based architecture loading**: Replaced hardcoded AD.md/adr.md file loading in preset commands with hook-based architecture
44+
- Architecture context now loaded via `before_specify`/`before_analyze`/`before_clarify` hooks
45+
- Removed direct file path references from `adlc.spec.analyze.md` and `adlc.spec.clarify.md`
46+
- Aligns with extension hook system for better extensibility
47+
2448
### Fixed
2549

2650
- **team-ai-directives duplicate installation**: Removed duplicate `sync_team_ai_directives()` call

docs/EDD_USAGE_GUIDE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,7 @@ All extensions in Spec Kit follow a consistent workflow pattern:
8787
| **3. Clarify** | `/architect.clarify` | `/levelup.clarify` | `/evals.clarify` | Resolve ambiguities through interactive questions |
8888
| **4. Implement** | `/architect.implement` | `/levelup.implement` | `/evals.implement` | Generate final outputs (AD.md, skills, PromptFoo config) |
8989
| **5. Validate** | `/architect.validate` || `/evals.validate` | Validate alignment/quality (READ-ONLY for architect) |
90-
| **6. Analyze/Trace** | `/architect.analyze` | `/levelup.trace` | `/evals.trace` | Post-implementation analysis and reporting |
90+
| **6. Analyze/Trace** | `/architect.analyze` | `/levelup.trace` | `/evals.analyze` | Post-implementation analysis and reporting |
9191

9292
### Pattern Summary
9393

evals/RUN_EVALUATORS.md

Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
# Running Evaluators
2+
3+
Generic evaluator execution framework. Auto-discovers graders, runs against goldset, saves results.
4+
5+
## Quick Start
6+
7+
```bash
8+
# Basic run with defaults
9+
./run_evaluators.py
10+
11+
# Custom paths
12+
./run_evaluators.py \
13+
--goldset evals/deepeval/goldset.json \
14+
--graders evals/deepeval/graders \
15+
--results evals/results
16+
17+
# With grader mapping
18+
./run_evaluators.py --mapping evals/grader_mapping.json
19+
20+
# Custom pass threshold
21+
./run_evaluators.py --pass-threshold 0.9
22+
```
23+
24+
## Project Structure Expected
25+
26+
```
27+
project/
28+
├── evals/
29+
│ ├── goldset.json # Test examples
30+
│ ├── graders/ # Evaluator functions
31+
│ │ ├── check_*.py
32+
│ │ └── ...
33+
│ ├── results/ # Generated results (auto-created)
34+
│ └── grader_mapping.json # Optional: criterion → grader mapping
35+
└── run_evaluators.py
36+
```
37+
38+
## Goldset Format
39+
40+
Two supported formats:
41+
42+
### Format 1: Structured evaluations
43+
```json
44+
{
45+
"version": "1.0",
46+
"evaluations": [
47+
{
48+
"id": "eval-001",
49+
"name": "PII Detection",
50+
"examples": [
51+
{
52+
"input": "My email is user@example.com",
53+
"expected_pass": false,
54+
"type": "fail_case"
55+
}
56+
]
57+
}
58+
]
59+
}
60+
```
61+
62+
### Format 2: Flat examples
63+
```json
64+
{
65+
"version": "1.0",
66+
"examples": [
67+
{
68+
"input": "Test input",
69+
"expected_pass": true
70+
}
71+
]
72+
}
73+
```
74+
75+
## Grader Functions
76+
77+
Supports multiple signatures:
78+
79+
```python
80+
# Signature 1: PromptFoo style
81+
def grade(output: str, context: dict) -> dict:
82+
return {"pass": True, "score": 1.0, "reason": "..."}
83+
84+
# Signature 2: Simple
85+
def evaluate(input: str) -> bool:
86+
return True
87+
88+
# Signature 3: Comparison
89+
def evaluate(input: str, expected: str) -> dict:
90+
return {"pass": input == expected}
91+
```
92+
93+
## Results Format
94+
95+
```json
96+
{
97+
"execution_date": "2026-04-20T...",
98+
"goldset_version": "1.0",
99+
"summary": {
100+
"total_evaluated": 50,
101+
"total_passed": 45,
102+
"total_failed": 3,
103+
"total_errors": 2,
104+
"overall_accuracy": 0.9
105+
},
106+
"criteria_results": {
107+
"eval-001": {
108+
"passed": 8,
109+
"failed": 2,
110+
"accuracy": 0.8,
111+
"example_results": [...]
112+
}
113+
}
114+
}
115+
```
116+
117+
## Exit Codes
118+
119+
- `0`: Success (accuracy >= threshold)
120+
- `1`: Failure (accuracy < threshold)
121+
122+
## Integration with Analyze
123+
124+
After running evaluators:
125+
126+
```bash
127+
# Run evaluators
128+
./run_evaluators.py
129+
130+
# Analyze results
131+
specify analyze evals/results/
132+
```
133+
134+
## Grader Mapping
135+
136+
Optional `grader_mapping.json` maps criterion IDs to grader module names:
137+
138+
```json
139+
{
140+
"eval-001": "check_pii_leakage",
141+
"eval-002": "check_security"
142+
}
143+
```
144+
145+
Without mapping, uses criterion ID as grader name.

evals/grader_mapping.json

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
{
2+
"_comment": "Maps criterion IDs to grader module names",
3+
"_example": "If criterion ID is 'eval-001' and grader file is 'check_pii.py', map to 'check_pii'",
4+
5+
"eval-001": "check_pii_leakage",
6+
"eval-002": "check_misinformation",
7+
"eval-003": "check_hallucination",
8+
"eval-004": "check_https_validation",
9+
"eval-005": "check_prompt_injection",
10+
"eval-006": "check_unit_conversion"
11+
}

extensions/evals/CHANGELOG.md

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,53 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
### Added
11+
- **DeepEval Integration**: Full DeepEval support as alternative to PromptFoo
12+
- Custom metric class generation with DeepEval v3.x API (`LLMTestCase`, async `a_measure`)
13+
- DeepEval-specific configuration templates and validation scripts
14+
- Automatic dependency management with version compatibility checks (DeepEval >=3.0.0)
15+
- System detection logic to choose PromptFoo vs DeepEval based on configuration
16+
- **Atomic Generation Order**: Prevent chicken-and-egg import problems
17+
- Graders generated first, then tests, then config (normal Python imports)
18+
- Configuration validation after generation (import checks, metric instantiation)
19+
- Rollback on failure (all-or-nothing file generation)
20+
- Clear error messages for missing dependencies or validation failures
21+
- **Version Compatibility Validation**: DeepEval v2.x vs v3.x detection
22+
- Runtime version check with detailed upgrade instructions
23+
- Breaking changes documentation (TestCase → LLMTestCase, context type changes)
24+
- Graceful error handling with actionable fix commands
25+
26+
### Changed
27+
- **Command Naming**: Renamed `trace` command to `analyze` for clarity
28+
- Better alignment with goldset analysis phase terminology
29+
- Consistent with bottom-up error analysis workflow
30+
- **Command Structure**: Evals extension now follows same logical pattern as other extensions
31+
- Standardized command interface across all extension commands
32+
- Improved consistency with framework conventions
33+
34+
### Fixed
35+
- **DeepEval Import Issues**: Resolved chicken-and-egg problem in generated config.py
36+
- Config now generated AFTER graders exist (not before)
37+
- Validation step ensures all imports work before commit
38+
- Rollback mechanism prevents partial/broken state
39+
- **Version Compatibility**: Added DeepEval >=3.0.0 requirement with clear error messages
40+
- Users with v2.x get upgrade instructions with breaking changes list
41+
- Import validation catches missing v3.x API components
42+
- **EDD Principle II Clarity**: Added comments explaining threshold parameter in DeepEval metrics
43+
- Clarified that threshold is DeepEval API requirement, not scoring system
44+
- Documented that `edd_compliant=True` enforces strict binary 1.0/0.0 output
45+
- Function docstring explains binary-only behavior despite threshold presence
46+
47+
### Documentation
48+
- **DeepEval Integration Guide**: Complete documentation for DeepEval setup and usage
49+
- Metric class structure and template usage
50+
- Configuration validation and testing procedures
51+
- Migration notes for v2.x to v3.x breaking changes
52+
- **Atomic Generation Process**: Documented generation order and validation workflow
53+
- Step-by-step execution outline with rollback behavior
54+
- Verification checklist for both PromptFoo and DeepEval systems
55+
- Error handling and troubleshooting guidance
56+
1057
### Planned
1158
- Web UI for goldset management and annotation queues
1259
- Advanced statistical analysis with effect size calculations

0 commit comments

Comments
 (0)