@@ -7,6 +7,53 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77
88## [ Unreleased]
99
10+ ### Added
11+ - ** DeepEval Integration** : Full DeepEval support as alternative to PromptFoo
12+ - Custom metric class generation with DeepEval v3.x API (` LLMTestCase ` , async ` a_measure ` )
13+ - DeepEval-specific configuration templates and validation scripts
14+ - Automatic dependency management with version compatibility checks (DeepEval >=3.0.0)
15+ - System detection logic to choose PromptFoo vs DeepEval based on configuration
16+ - ** Atomic Generation Order** : Prevent chicken-and-egg import problems
17+ - Graders generated first, then tests, then config (normal Python imports)
18+ - Configuration validation after generation (import checks, metric instantiation)
19+ - Rollback on failure (all-or-nothing file generation)
20+ - Clear error messages for missing dependencies or validation failures
21+ - ** Version Compatibility Validation** : DeepEval v2.x vs v3.x detection
22+ - Runtime version check with detailed upgrade instructions
23+ - Breaking changes documentation (TestCase → LLMTestCase, context type changes)
24+ - Graceful error handling with actionable fix commands
25+
26+ ### Changed
27+ - ** Command Naming** : Renamed ` trace ` command to ` analyze ` for clarity
28+ - Better alignment with goldset analysis phase terminology
29+ - Consistent with bottom-up error analysis workflow
30+ - ** Command Structure** : Evals extension now follows same logical pattern as other extensions
31+ - Standardized command interface across all extension commands
32+ - Improved consistency with framework conventions
33+
34+ ### Fixed
35+ - ** DeepEval Import Issues** : Resolved chicken-and-egg problem in generated config.py
36+ - Config now generated AFTER graders exist (not before)
37+ - Validation step ensures all imports work before commit
38+ - Rollback mechanism prevents partial/broken state
39+ - ** Version Compatibility** : Added DeepEval >=3.0.0 requirement with clear error messages
40+ - Users with v2.x get upgrade instructions with breaking changes list
41+ - Import validation catches missing v3.x API components
42+ - ** EDD Principle II Clarity** : Added comments explaining threshold parameter in DeepEval metrics
43+ - Clarified that threshold is DeepEval API requirement, not scoring system
44+ - Documented that ` edd_compliant=True ` enforces strict binary 1.0/0.0 output
45+ - Function docstring explains binary-only behavior despite threshold presence
46+
47+ ### Documentation
48+ - ** DeepEval Integration Guide** : Complete documentation for DeepEval setup and usage
49+ - Metric class structure and template usage
50+ - Configuration validation and testing procedures
51+ - Migration notes for v2.x to v3.x breaking changes
52+ - ** Atomic Generation Process** : Documented generation order and validation workflow
53+ - Step-by-step execution outline with rollback behavior
54+ - Verification checklist for both PromptFoo and DeepEval systems
55+ - Error handling and troubleshooting guidance
56+
1057### Planned
1158- Web UI for goldset management and annotation queues
1259- Advanced statistical analysis with effect size calculations
0 commit comments