Skip to content

feat: Add AI online evaluations (Judge, Evaluator, IRunner)#301

Open
mattrmc1 wants to merge 7 commits into
mainfrom
mmccarthy/AIC-2660/aievals-core-judge
Open

feat: Add AI online evaluations (Judge, Evaluator, IRunner)#301
mattrmc1 wants to merge 7 commits into
mainfrom
mmccarthy/AIC-2660/aievals-core-judge

Conversation

@mattrmc1

@mattrmc1 mattrmc1 commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds AI online evaluations to LaunchDarkly.ServerSdk.Ai. A caller that supplies a runnerFactory to LdAiClient gets automatic judge evaluation wired into every CompletionConfig and AgentConfig — each returned config carries an Evaluator that can score model output against the judges declared in the flag's judgeConfiguration. When no runnerFactory is provided (or no judges are configured), configs receive a noop Evaluator so callers never need null checks.

Implements the AIEVALS and AIRUNNER specs (sections 1.1–1.4). createJudge (AIEVALS 1.2) is intentionally omitted per .NET/Java convention — judges are created internally by the SDK, not by user code.

New public types

// Provider-facing runner interface (AIRUNNER 1.2)
public interface IRunner
{
    Task<RunnerResult> RunAsync(string input,
        IReadOnlyDictionary<string, object> outputType = null);
}

// Runner return type (AIRUNNER 1.3)
public sealed record RunnerResult(
    string Content,
    AiMetrics Metrics,
    object Raw = null,
    IReadOnlyDictionary<string, object> Parsed = null);

// Evaluation orchestrator (AIEVALS 1.4)
public sealed class Evaluator
{
    public static Evaluator Noop();
    public Task<IReadOnlyList<JudgeResult>> EvaluateAsync(string input, string output);
}

// Single-judge executor (AIEVALS 1.1)
public sealed class Judge
{
    public LdAiJudgeConfig Config { get; }
    public IRunner Runner { get; }
    public Task<JudgeResult> EvaluateAsync(string input, string output, double? samplingRate = null);
    public Task<JudgeResult> EvaluateMessagesAsync(
        IReadOnlyList<LdAiConfigTypes.Message> messages,
        RunnerResult runnerResult, double? samplingRate = null);
}

LdAiClient changes

// New optional parameter
public LdAiClient(ILaunchDarklyClient client,
    Func<LdAiJudgeConfig, IRunner> runnerFactory = null);

When runnerFactory is non-null, ConfigFactory.BuildEvaluator iterates the flag's judgeConfiguration, evaluates each judge key as a flag variation, creates a Judge + IRunner pair per enabled judge, and attaches the resulting Evaluator to the config. Disabled judges, null runners, and initialization exceptions are logged and skipped — no single judge failure prevents the others from being built.

LdAiConfig base class

All config types (LdAiCompletionConfig, LdAiAgentConfig, LdAiJudgeConfig) now carry an Evaluator property via the base class. LdAiJudgeConfig always receives Evaluator.Noop() (judges don't evaluate themselves).

JudgeResult changes

Updated to match AIEVALS 1.3.1 defaults:

Field Before After
MetricKey required optional (default null)
Score required optional (default 0.0)
Sampled default true default false
Success default true default false
ErrorMessage new field
Reasoning new field

Null safety (ref: java-core#175 discussion)

BuildEvaluator handles every failure mode without throwing:

  • runnerFactory == null or empty judgeConfigurationEvaluator.Noop()
  • Runner factory returns null → warn + skip judge
  • Judge config disabled → warn + skip judge
  • Any exception during judge init → warn + skip judge
  • Missing evaluationMetricKeyJudgeResult(success: false, errorMessage: ...)
  • Score out of [0, 1] range → JudgeResult(success: false, errorMessage: ...)
  • Sampling rate NaN/Infinity/negative/> 1.0 → normalized to safe bounds

Judge constructor still uses ArgumentNullException guards, but these are never hit in practice because BuildEvaluator validates inputs before construction.

Migration

None required. The LdAiClient constructor gains an optional runnerFactory parameter (default null) — existing callers are unaffected. JudgeResult default changes are source-compatible (all parameters are now optional with safe defaults). No members removed or renamed.

Test plan

  • dotnet test pkgs/sdk/server-ai/test/LaunchDarkly.ServerSdk.Ai.Tests.csproj --framework net8.0 passes
  • JudgeTest (435 lines) covers: successful evaluation with score/reasoning extraction, sampling skip path, samplingRate edge cases (NaN, negative, > 1.0), runner exception handling, missing evaluationMetricKey validation, out-of-range score rejection with errorMessage, evaluateMessages formatting, null/empty message handling
  • EvaluatorTest (210 lines) covers: noop returns empty list, multi-judge execution with per-judge sampling, missing judge key logs warning and skips, noop does not log warnings
  • LdAiCompletionConfigTest (189 lines added) covers: Evaluator attached when runnerFactory provided, noop Evaluator when no runner factory, noop when judgeConfiguration is empty, disabled judge skipped, null runner skipped
  • LdAiJudgeConfigTest covers: JudgeResult ErrorMessage and Reasoning fields, judge config always receives noop Evaluator
  • LdAiConfigTrackerTest covers: TrackJudgeResult with new optional fields

Note

Medium Risk
New public API and optional extra LLM calls per request when runnerFactory is set; behavior is backward compatible via optional constructor arg and noop evaluators.

Overview
Adds online judge evaluation to the server AI SDK: optional runnerFactory on LdAiClient wires an Evaluator onto completion and agent configs (including default/fallback build paths). Without a factory or judges, configs get Evaluator.Noop() so Evaluator is always non-null.

Introduces IRunner, RunnerResult, Judge (structured score/reasoning via JSON schema, sampling, metrics wrapping), and Evaluator (runs configured judges; does not emit TrackJudgeResult). ConfigFactory.BuildEvaluator resolves each judge key as a flag variation, skips disabled/null-runner/failed inits, and filters the judge list to built judges only.

JudgeResult gains ErrorMessage and Reasoning, with safer defaults (Sampled/Success now default false). Large test coverage for judge/evaluator behavior and client wiring.

Reviewed by Cursor Bugbot for commit 66e2c31. Bugbot is set up for automated code reviews on this repo. Configure here.

@mattrmc1 mattrmc1 changed the title Mmccarthy/aic 2660/aievals core judge feat: Add AI online evaluations (Judge, Evaluator, IRunner) Jun 30, 2026
@mattrmc1 mattrmc1 marked this pull request as ready for review June 30, 2026 20:23
@mattrmc1 mattrmc1 requested a review from a team as a code owner June 30, 2026 20:23
Comment thread pkgs/sdk/server-ai/src/Config/ConfigFactory.cs Outdated
Comment thread pkgs/sdk/server-ai/src/Evals/Judge.cs
Comment thread pkgs/sdk/server-ai/src/Evals/Evaluator.cs

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 3 potential issues.

There are 6 total unresolved issues (including 3 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit aa959ce. Configure here.

{
var defaultValue = LdAiJudgeConfigDefault.Disabled;
var ldValue = _client.JsonVariation(judgeEntry.Key, context, defaultValue.ToLdValue());
var judgeConfig = BuildJudgeConfig(judgeEntry.Key, ldValue, context, defaultValue, null);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Judge prompts omit caller variables

High Severity

When building an Evaluator from a completion or agent config, each judge is loaded via BuildJudgeConfig with variables set to null, so only LaunchDarkly context (ldctx) is merged into judge messages. Caller-supplied prompt variables passed into CompletionConfig / AgentConfig are not applied to embedded judge configs, unlike a direct JudgeConfig call.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit aa959ce. Configure here.

Comment thread pkgs/sdk/server-ai/src/Config/ConfigFactory.cs
Comment thread pkgs/sdk/server-ai/src/Evals/Judge.cs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant