feat: Add AI online evaluations (Judge, Evaluator, IRunner)#301
feat: Add AI online evaluations (Judge, Evaluator, IRunner)#301mattrmc1 wants to merge 7 commits into
Conversation
…rthy/AIC-2660/aievals-core-judge
…pling normalization
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 3 potential issues.
There are 6 total unresolved issues (including 3 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit aa959ce. Configure here.
| { | ||
| var defaultValue = LdAiJudgeConfigDefault.Disabled; | ||
| var ldValue = _client.JsonVariation(judgeEntry.Key, context, defaultValue.ToLdValue()); | ||
| var judgeConfig = BuildJudgeConfig(judgeEntry.Key, ldValue, context, defaultValue, null); |
There was a problem hiding this comment.
Judge prompts omit caller variables
High Severity
When building an Evaluator from a completion or agent config, each judge is loaded via BuildJudgeConfig with variables set to null, so only LaunchDarkly context (ldctx) is merged into judge messages. Caller-supplied prompt variables passed into CompletionConfig / AgentConfig are not applied to embedded judge configs, unlike a direct JudgeConfig call.
Reviewed by Cursor Bugbot for commit aa959ce. Configure here.
…judge config, debug log for disabled judges


Summary
Adds AI online evaluations to
LaunchDarkly.ServerSdk.Ai. A caller that supplies arunnerFactorytoLdAiClientgets automatic judge evaluation wired into everyCompletionConfigandAgentConfig— each returned config carries anEvaluatorthat can score model output against the judges declared in the flag'sjudgeConfiguration. When norunnerFactoryis provided (or no judges are configured), configs receive a noopEvaluatorso callers never need null checks.Implements the AIEVALS and AIRUNNER specs (sections 1.1–1.4).
createJudge(AIEVALS 1.2) is intentionally omitted per .NET/Java convention — judges are created internally by the SDK, not by user code.New public types
LdAiClientchangesWhen
runnerFactoryis non-null,ConfigFactory.BuildEvaluatoriterates the flag'sjudgeConfiguration, evaluates each judge key as a flag variation, creates aJudge+IRunnerpair per enabled judge, and attaches the resultingEvaluatorto the config. Disabled judges, null runners, and initialization exceptions are logged and skipped — no single judge failure prevents the others from being built.LdAiConfigbase classAll config types (
LdAiCompletionConfig,LdAiAgentConfig,LdAiJudgeConfig) now carry anEvaluatorproperty via the base class.LdAiJudgeConfigalways receivesEvaluator.Noop()(judges don't evaluate themselves).JudgeResultchangesUpdated to match AIEVALS 1.3.1 defaults:
MetricKeynull)Score0.0)SampledtruefalseSuccesstruefalseErrorMessageReasoningNull safety (ref: java-core#175 discussion)
BuildEvaluatorhandles every failure mode without throwing:runnerFactory == nullor emptyjudgeConfiguration→Evaluator.Noop()null→ warn + skip judgeevaluationMetricKey→JudgeResult(success: false, errorMessage: ...)[0, 1]range →JudgeResult(success: false, errorMessage: ...)NaN/Infinity/negative/> 1.0→ normalized to safe boundsJudgeconstructor still usesArgumentNullExceptionguards, but these are never hit in practice becauseBuildEvaluatorvalidates inputs before construction.Migration
None required. The
LdAiClientconstructor gains an optionalrunnerFactoryparameter (defaultnull) — existing callers are unaffected.JudgeResultdefault changes are source-compatible (all parameters are now optional with safe defaults). No members removed or renamed.Test plan
dotnet test pkgs/sdk/server-ai/test/LaunchDarkly.ServerSdk.Ai.Tests.csproj --framework net8.0passesJudgeTest(435 lines) covers: successful evaluation with score/reasoning extraction, sampling skip path,samplingRateedge cases (NaN, negative,> 1.0), runner exception handling, missingevaluationMetricKeyvalidation, out-of-range score rejection witherrorMessage,evaluateMessagesformatting, null/empty message handlingEvaluatorTest(210 lines) covers: noop returns empty list, multi-judge execution with per-judge sampling, missing judge key logs warning and skips, noop does not log warningsLdAiCompletionConfigTest(189 lines added) covers:Evaluatorattached whenrunnerFactoryprovided, noopEvaluatorwhen no runner factory, noop whenjudgeConfigurationis empty, disabled judge skipped, null runner skippedLdAiJudgeConfigTestcovers:JudgeResultErrorMessageandReasoningfields, judge config always receives noopEvaluatorLdAiConfigTrackerTestcovers:TrackJudgeResultwith new optional fieldsNote
Medium Risk
New public API and optional extra LLM calls per request when
runnerFactoryis set; behavior is backward compatible via optional constructor arg and noop evaluators.Overview
Adds online judge evaluation to the server AI SDK: optional
runnerFactoryonLdAiClientwires anEvaluatoronto completion and agent configs (including default/fallback build paths). Without a factory or judges, configs getEvaluator.Noop()soEvaluatoris always non-null.Introduces
IRunner,RunnerResult,Judge(structured score/reasoning via JSON schema, sampling, metrics wrapping), andEvaluator(runs configured judges; does not emitTrackJudgeResult).ConfigFactory.BuildEvaluatorresolves each judge key as a flag variation, skips disabled/null-runner/failed inits, and filters the judge list to built judges only.JudgeResultgainsErrorMessageandReasoning, with safer defaults (Sampled/Successnow default false). Large test coverage for judge/evaluator behavior and client wiring.Reviewed by Cursor Bugbot for commit 66e2c31. Bugbot is set up for automated code reviews on this repo. Configure here.