feat: Add AI online evaluations (Judge, Evaluator, IRunner) by mattrmc1 · Pull Request #301 · launchdarkly/dotnet-core

mattrmc1 · 2026-06-30T18:28:25Z

Summary

Adds AI online evaluations to LaunchDarkly.ServerSdk.Ai. A caller that supplies a runnerFactory to LdAiClient gets automatic judge evaluation wired into every CompletionConfig and AgentConfig — each returned config carries an Evaluator that can score model output against the judges declared in the flag's judgeConfiguration. When no runnerFactory is provided (or no judges are configured), configs receive a noop Evaluator so callers never need null checks.

Implements the AIEVALS and AIRUNNER specs (sections 1.1–1.4). createJudge (AIEVALS 1.2) is intentionally omitted per .NET/Java convention — judges are created internally by the SDK, not by user code.

New public types

// Provider-facing runner interface (AIRUNNER 1.2)
public interface IRunner
{
    Task<RunnerResult> RunAsync(string input,
        IReadOnlyDictionary<string, object> outputType = null);
}

// Runner return type (AIRUNNER 1.3)
public sealed record RunnerResult(
    string Content,
    AiMetrics Metrics,
    object Raw = null,
    IReadOnlyDictionary<string, object> Parsed = null);

// Evaluation orchestrator (AIEVALS 1.4)
public sealed class Evaluator
{
    public static Evaluator Noop();
    public Task<IReadOnlyList<JudgeResult>> EvaluateAsync(string input, string output);
}

// Single-judge executor (AIEVALS 1.1)
public sealed class Judge
{
    public LdAiJudgeConfig Config { get; }
    public IRunner Runner { get; }
    public Task<JudgeResult> EvaluateAsync(string input, string output, double? samplingRate = null);
    public Task<JudgeResult> EvaluateMessagesAsync(
        IReadOnlyList<LdAiConfigTypes.Message> messages,
        RunnerResult runnerResult, double? samplingRate = null);
}

`LdAiClient` changes

// New optional parameter
public LdAiClient(ILaunchDarklyClient client,
    Func<LdAiJudgeConfig, IRunner> runnerFactory = null);

When runnerFactory is non-null, ConfigFactory.BuildEvaluator iterates the flag's judgeConfiguration, evaluates each judge key as a flag variation, creates a Judge + IRunner pair per enabled judge, and attaches the resulting Evaluator to the config. Disabled judges, null runners, and initialization exceptions are logged and skipped — no single judge failure prevents the others from being built.

`LdAiConfig` base class

All config types (LdAiCompletionConfig, LdAiAgentConfig, LdAiJudgeConfig) now carry an Evaluator property via the base class. LdAiJudgeConfig always receives Evaluator.Noop() (judges don't evaluate themselves).

`JudgeResult` changes

Updated to match AIEVALS 1.3.1 defaults:

Field	Before	After
`MetricKey`	required	optional (default `null`)
`Score`	required	optional (default `0.0`)
`Sampled`	default `true`	default `false`
`Success`	default `true`	default `false`
`ErrorMessage`	—	new field
`Reasoning`	—	new field

Null safety (ref: java-core#175 discussion)

BuildEvaluator handles every failure mode without throwing:

runnerFactory == null or empty judgeConfiguration → Evaluator.Noop()
Runner factory returns null → warn + skip judge
Judge config disabled → warn + skip judge
Any exception during judge init → warn + skip judge
Missing evaluationMetricKey → JudgeResult(success: false, errorMessage: ...)
Score out of [0, 1] range → JudgeResult(success: false, errorMessage: ...)
Sampling rate NaN/Infinity/negative/> 1.0 → normalized to safe bounds

Judge constructor still uses ArgumentNullException guards, but these are never hit in practice because BuildEvaluator validates inputs before construction.

Migration

None required. The LdAiClient constructor gains an optional runnerFactory parameter (default null) — existing callers are unaffected. JudgeResult default changes are source-compatible (all parameters are now optional with safe defaults). No members removed or renamed.

Test plan

dotnet test pkgs/sdk/server-ai/test/LaunchDarkly.ServerSdk.Ai.Tests.csproj --framework net8.0 passes
JudgeTest (435 lines) covers: successful evaluation with score/reasoning extraction, sampling skip path, samplingRate edge cases (NaN, negative, > 1.0), runner exception handling, missing evaluationMetricKey validation, out-of-range score rejection with errorMessage, evaluateMessages formatting, null/empty message handling
EvaluatorTest (210 lines) covers: noop returns empty list, multi-judge execution with per-judge sampling, missing judge key logs warning and skips, noop does not log warnings
LdAiCompletionConfigTest (189 lines added) covers: Evaluator attached when runnerFactory provided, noop Evaluator when no runner factory, noop when judgeConfiguration is empty, disabled judge skipped, null runner skipped
LdAiJudgeConfigTest covers: JudgeResult ErrorMessage and Reasoning fields, judge config always receives noop Evaluator
LdAiConfigTrackerTest covers: TrackJudgeResult with new optional fields

Note

Medium Risk
New public API and optional extra LLM calls per request when runnerFactory is set; behavior is backward compatible via optional constructor arg and noop evaluators.

Overview
Adds online judge evaluation to the server AI SDK: optional runnerFactory on LdAiClient wires an Evaluator onto completion and agent configs (including default/fallback build paths). Without a factory or judges, configs get Evaluator.Noop() so Evaluator is always non-null.

Introduces IRunner, RunnerResult, Judge (structured score/reasoning via JSON schema, sampling, metrics wrapping), and Evaluator (runs configured judges; does not emit TrackJudgeResult). ConfigFactory.BuildEvaluator resolves each judge key as a flag variation, skips disabled/null-runner/failed inits, and filters the judge list to built judges only.

JudgeResult gains ErrorMessage and Reasoning, with safer defaults (Sampled/Success now default false). Large test coverage for judge/evaluator behavior and client wiring.

^{Reviewed by Cursor Bugbot for commit 66e2c31. Bugbot is set up for automated code reviews on this repo. Configure here.}

…rthy/AIC-2660/aievals-core-judge

…pling normalization

cursor

Cursor Bugbot has reviewed your changes using default effort and found 3 potential issues.

There are 6 total unresolved issues (including 3 from previous reviews).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit aa959ce. Configure here.}

cursor · 2026-06-30T21:25:44Z

+            {
+                var defaultValue = LdAiJudgeConfigDefault.Disabled;
+                var ldValue = _client.JsonVariation(judgeEntry.Key, context, defaultValue.ToLdValue());
+                var judgeConfig = BuildJudgeConfig(judgeEntry.Key, ldValue, context, defaultValue, null);


Judge prompts omit caller variables

High Severity

When building an Evaluator from a completion or agent config, each judge is loaded via BuildJudgeConfig with variables set to null, so only LaunchDarkly context (ldctx) is merged into judge messages. Caller-supplied prompt variables passed into CompletionConfig / AgentConfig are not applied to embedded judge configs, unlike a direct JudgeConfig call.

^{Reviewed by Cursor Bugbot for commit aa959ce. Configure here.}

…judge config, debug log for disabled judges

mattrmc1 added 3 commits June 15, 2026 11:21

[AIC-2660] Implement AIEVALS (first pass)

a625933

Merge branch 'main' of github.com:launchdarkly/dotnet-core into mmcca…

db62ad3

…rthy/AIC-2660/aievals-core-judge

fix: AIEVALS spec compliance — JudgeResult defaults, null safety, sam…

a680fb4

…pling normalization

mattrmc1 changed the title ~~Mmccarthy/aic 2660/aievals core judge~~ feat: Add AI online evaluations (Judge, Evaluator, IRunner) Jun 30, 2026

fix: set errorMessage on out-of-range score for JS/Python parity

f920e9d

mattrmc1 marked this pull request as ready for review June 30, 2026 20:23

mattrmc1 requested a review from a team as a code owner June 30, 2026 20:23

cursor Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread pkgs/sdk/server-ai/src/Config/ConfigFactory.cs Outdated

Comment thread pkgs/sdk/server-ai/src/Evals/Judge.cs

Comment thread pkgs/sdk/server-ai/src/Evals/Evaluator.cs

mattrmc1 requested review from jsonbailey and tanderson-ld June 30, 2026 20:47

Merge branch 'main' into mmccarthy/AIC-2660/aievals-core-judge

aa959ce

cursor Bot reviewed Jun 30, 2026

View reviewed changes

mattrmc1 and others added 2 commits June 30, 2026 16:39

Merge branch 'main' into mmccarthy/AIC-2660/aievals-core-judge

b814097

fix: robust judge score parsing, fallback evaluator wiring, filtered …

66e2c31

…judge config, debug log for disabled judges

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add AI online evaluations (Judge, Evaluator, IRunner)#301

feat: Add AI online evaluations (Judge, Evaluator, IRunner)#301
mattrmc1 wants to merge 7 commits into
mainfrom
mmccarthy/AIC-2660/aievals-core-judge

mattrmc1 commented Jun 30, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 30, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mattrmc1 commented Jun 30, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New public types

LdAiClient changes

LdAiConfig base class

JudgeResult changes

Null safety (ref: java-core#175 discussion)

Migration

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 30, 2026

Choose a reason for hiding this comment

Judge prompts omit caller variables

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mattrmc1 commented Jun 30, 2026 •

edited by cursor Bot

Loading

`LdAiClient` changes

`LdAiConfig` base class

`JudgeResult` changes