All notable changes to AgentEval are documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
0.2.0 — 2026-04-24
- Six new modules extending the evaluation surface beyond core metrics:
agenteval-contracts— contract testing for agent responsesagenteval-statistics— statistical analysis of eval runsagenteval-chaos— chaos engineering (fault injection, resilience evaluation)agenteval-replay— deterministic replay of captured agent interactionsagenteval-mutation— mutation testing for evaluation robustnessagenteval-fingerprint— capability fingerprinting of models under test
- Cost metrics and root-cause analysis helpers
- Smoke test coverage for
agenteval-langchain4j(11 tests) andagenteval-spring-ai(12 tests) - Chaos module tests for
LatencyInjector,SchemaMutationInjector, andResilienceEvaluator(22 tests) AUDIT.md— a full audit report of the library with severity-ranked findings
- Gradle root build removed; Gradle is now scoped to
agenteval-gradle-pluginonly (the module that must be Gradle-native forpublishPluginsto the Gradle Plugin Portal). Maven is the authoritative build for all 22 other modules, so the two module lists can no longer drift DatasetVersionerTestno longer relies onThread.sleepfor timestamp ordering; it explicitly sets file modification times for deterministic assertionsagenteval-bomnow documents why build-tooling modules are intentionally omitted- Bumped dependency versions via Dependabot: Jackson (to 2.21.x via BOM),
Logback 1.5.32, Spring AI 1.1.4, LangGraph4j (latest), Mockito 5.23.0, and
GitHub Actions (
actions/checkout@v6,actions/setup-node@v6,actions/upload-artifact@v7,actions/upload-pages-artifact@v5,actions/deploy-pages@v5,actions/stale@v10,dorny/test-reporter@v3) - Test fixtures now use neutral API key strings (
fake-key-for-tests,fake-ant-key-for-tests) instead ofsk-test/sk-ant-testso credential scanners do not match on shape
org.byteveda.agenteval.metrics.llm.PromptTemplate— useorg.byteveda.agenteval.core.template.PromptTemplateinstead. Scheduled for removal in1.0.0.SemanticSimilarityMetric.cosineSimilarity(List, List)— useVectorMath.cosineSimilarityinstead. Scheduled for removal in1.0.0.
- MDX parsing errors in the documentation site, plus a PR build check to catch future regressions (#68, #69)
JunitXmlReporternow configuresDocumentBuilderFactorywith full XXE defenses (disallow-doctype-decl, external entity/DTD disabling,setXIncludeAware(false),setExpandEntityReferences(false))YamlDatasetLoadernow caps alias expansion (≤50), nesting depth (≤50), and code points (≤3 MiB) and disallows duplicate/recursive keys — defense in depth on top of SnakeYAML 2.x's defaultSafeConstructor- SpotBugs suppressions narrowed from broad regex patterns
(
~...datasets.json.Json.*,~...datasets.version..*) to explicit<Or><Class .../></Or>enumerations so new classes in those packages surface genuine findings instead of being blanket-suppressed
- XXE hardening in
JunitXmlReporter(agenteval-reporting) - YAML resource-exhaustion hardening in
YamlDatasetLoader(agenteval-datasets) .gitignorenow covers common secret patterns (.env*,*.jks,*.keystore,*.p12,credentials.json)
0.1.0 — 2026-03-29
Initial public release. Includes:
- Core evaluation engine —
AgentTestCase,EvalMetricSPI,EvalScore(records normalized to[0.0, 1.0]with threshold validation), virtual-thread parallel evaluation with bounded concurrency, progress callbacks with ETA - 23 evaluation metrics across five categories:
- Response (9):
AnswerRelevancy,Faithfulness,Hallucination,Correctness,SemanticSimilarity,Coherence,Conciseness,Toxicity,Bias - RAG (3):
ContextualRelevancy,ContextualPrecision,ContextualRecall - Agent (9):
TaskCompletion,ToolSelectionAccuracy,ToolArgumentCorrectness,ToolResultUtilization,PlanQuality,PlanAdherence,RetrievalCompleteness,StepLevelErrorLocalization,TrajectoryOptimality - Conversation (4):
ConversationCoherence,ContextRetention,TopicDriftDetection,ConversationResolution - Utility (2):
CostNormalized,LatencyNormalized
- Response (9):
- 7 LLM-as-judge providers — OpenAI, Anthropic, Ollama, Google Gemini, Azure OpenAI, Amazon Bedrock (SigV4-signed), and a Custom HTTP provider compatible with vLLM / LiteLLM / LocalAI
- Multi-model judge consensus —
ConsensusStrategywithMAJORITY,AVERAGE,WEIGHTED_AVERAGE, andUNANIMOUSmodes; virtual-thread fan-out - JUnit 5 integration —
@AgentTest,@Metric, custom assertions - Datasets — JSON / JSONL / CSV / YAML loaders, synthetic dataset
generation (from-documents, variations, adversarial), golden-set versioning
with git metadata via
DatasetVersioner - Reporting — Console, JUnit XML, JSON, HTML (single-file, self-contained), Markdown (GFM), snapshot testing with baseline/compare/update modes, regression comparison, benchmark mode with variant comparison
- Framework integrations — Spring AI (auto-configuration + advisor interceptor), LangChain4j (chat model capture + content retriever capture), LangGraph4j (graph execution capture), MCP (tool call capture)
- Red teaming —
RedTeamSuite,AttackTemplateLibrarywith 20 attack templates,AttackEvaluator - Build plugins — Maven plugin (
EvaluateMojo@verifyphase), Gradle plugin (published to Gradle Plugin Portal), GitHub Actions composite action with PR commenter (marker-based update) - IntelliJ IDEA plugin —
AgentEvalToolWindow, metric gutter icons, live report file watcher - Configuration — programmatic (
AgentEvalConfig.builder()) and file-based (agenteval.yaml); environment variablesAGENTEVAL_JUDGE_PROVIDER,AGENTEVAL_JUDGE_MODEL