feat: add Runner, RunnerResult, Judge, and Evaluator by mattrmc1 · Pull Request #180 · launchdarkly/java-core

mattrmc1 · 2026-06-23T20:27:16Z

Summary

Adds the AIEVALS types — Runner, RunnerResult, Judge, and Evaluator — and wires Evaluator.noop() into all config types. Callers can now implement a Runner to wrap any model provider, construct a Judge to evaluate AI outputs against a judge prompt with structured {score, reasoning} output, and coordinate multiple judges through an Evaluator.

New types

public interface Runner {
  RunnerResult run(String input, Map<String, Object> outputType) throws Exception;
  default RunnerResult run(String input) throws Exception;
}

Wraps a model provider SDK. outputType carries a JSON-Schema-like map when structured output is needed. Single-arg overload delegates with outputType = null.

RunnerResult.builder(String content, AIMetrics metrics)
    .raw(Object raw)
    .parsed(Map<String, Object> parsed)
    .build();

Immutable result of a Runner invocation. parsed is defensively copied and returned as unmodifiable.

public Judge(AIJudgeConfig config, Runner runner, LDLogger logger);

JudgeResult evaluate(String input, String output);
JudgeResult evaluate(String input, String output, double samplingRate);
JudgeResult evaluateMessages(List<Message> messages, RunnerResult response);
JudgeResult evaluateMessages(List<Message> messages, RunnerResult response, double samplingRate);

Evaluates AI output by invoking a runner with a formatted evaluation prompt and parsing the structured response. Sampling gate runs first — below the rate, returns sampled=false immediately. Creates a fresh tracker per evaluation via config.createTracker(). Parses score (Number, [0.0, 1.0]) and reasoning (String, optional). Runner exceptions are caught and returned as JudgeResult(success=false) — judge failures are results, not exceptions. Does not call trackJudgeResult.

public static Evaluator noop();
public Evaluator(Map<String, Judge> judges, JudgeConfiguration judgeConfiguration, LDLogger logger);

CompletableFuture<List<JudgeResult>> evaluate(String input, String output);

Coordinates sequential execution of judges. Missing judges skipped with a warning. Evaluator.noop() returns a singleton whose evaluate immediately returns an empty list. For v1.0, all configs receive Evaluator.noop().

Config type changes

AIConfig base class gains an Evaluator field and getEvaluator() accessor. AICompletionConfig and AIAgentConfig constructors accept an Evaluator. AIJudgeConfig always wires Evaluator.noop() internally — judges do not evaluate themselves.

Test plan

./gradlew :lib:sdk:server-ai:test passes
JudgeTest — successful evaluation, score boundary validation (0 and 1), reasoning optional, runner exception handling (caught not rethrown), null/missing parsed output, score out of range, sampling rates (0 always skips, 1 always runs), message formatting, getter accessors
EvaluatorTest — noop returns empty list, noop singleton identity, single/multiple judge execution, missing judge skipped, evaluator does not call trackJudgeResult, returned future is already complete
RunnerResultTest — builder field assignment, immutability, defensive copy of parsed map

Note

Medium Risk
New public SDK types and constructor changes on internal config builders; runtime judge paths invoke user-supplied Runners and affect metrics tracking, though client retrieval still uses noop evaluators.

Overview
Adds AI evaluation plumbing to server-ai: a Runner abstraction for provider calls, immutable RunnerResult, Judge for scored {score, reasoning} evals via structured output, and Evaluator to run configured judges sequentially (with sampling and skip-on-missing).

Config surface: AIConfig now carries an Evaluator and exposes getEvaluator(). Completion and agent configs take an evaluator at construction; LDAIClientImpl always passes Evaluator.noop() for v1.0. Judge configs hard-wire noop internally.

Behavior notes: Judge applies sampling before calling the runner, validates parsed scores in [0, 1], wraps runner work in trackMetricsOf, and returns failures as JudgeResult rather than throwing. Evaluator does not call trackJudgeResult—callers own tracking.

^{Reviewed by Cursor Bugbot for commit 792d33c. Bugbot is set up for automated code reviews on this repo. Configure here.}

… decode

…b.com:launchdarkly/java-core into mmccarthy/AIC-2665/java-ai-sdk-v-1-0-aievals

Co-authored-by: Cursor <cursoragent@cursor.com>

…hy/AIC-2665/java-ai-sdk-v-1-0-aievals

… Evaluator constructors

…Key from escaping

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 792d33c. Configure here.}

cursor · 2026-06-30T21:49:02Z

+          .success(false)
+          .errorMessage(ex.getMessage())
+          .build();
+    }


Exception path omits judge identity

Medium Severity

When Judge.evaluate catches a runner or tracker failure, the returned JudgeResult sets sampled, success, and errorMessage but not judgeConfigKey or metricKey. Other sampled failure paths populate those fields. Callers and Evaluator runs that handle multiple judges cannot reliably tell which judge failed except by list index.

^{Reviewed by Cursor Bugbot for commit 792d33c. Configure here.}

mattrmc1 added 3 commits June 22, 2026 15:08

[AIC-2664] Impl trackers (first pass)

7c4dbde

fix: default tracker version to 1 and remove version clamp from token…

a0c8784

… decode

feat: add Runner, RunnerResult, Judge, and Evaluator

1a7e1f6

mattrmc1 changed the base branch from main to mmccarthy/AIC-2664/ai-config-tracker-overhaul June 23, 2026 21:13

mattrmc1 marked this pull request as ready for review June 23, 2026 21:13

mattrmc1 requested a review from a team as a code owner June 23, 2026 21:13

mattrmc1 marked this pull request as draft June 23, 2026 21:14

cursor Bot reviewed Jun 23, 2026

View reviewed changes

Comment thread lib/sdk/server-ai/src/main/java/com/launchdarkly/sdk/server/ai/Judge.java Outdated

Comment thread lib/sdk/server-ai/src/main/java/com/launchdarkly/sdk/server/ai/Evaluator.java

mattrmc1 added 16 commits June 23, 2026 16:40

guard against null AIMetrics

bed4ca2

fix: guard against blank metricKey and infinite/invalid score

2b47c86

fix: MAX_TOKEN_BYTES -> MAX_TOKEN_LENGTH

4ef3de2

fix: guard against empty runId and configKey

1be0a1e

fix: Add warning comment to createTracker public call

8e81ea0

fix: use trim + isEmpty to support java 8

e81e2f5

fix: stop trackMetricsOf clock before running metrics extractor

c21fdd7

fix: record operation duration when trackMetricsOf extractor throws

4c96dca

fix: downgrade null-arg track logs from warn to debug per spec

4da5478

Merge branch 'mmccarthy/AIC-2664/ai-config-tracker-overhaul' of githu…

a94b2bf

…b.com:launchdarkly/java-core into mmccarthy/AIC-2665/java-ai-sdk-v-1-0-aievals

fix: remove unnecessary NoOpAIConfigTracker

394a044

Merge branch 'mmccarthy/AIC-2664/ai-config-tracker-overhaul' of githu…

6c80aed

…b.com:launchdarkly/java-core into mmccarthy/AIC-2665/java-ai-sdk-v-1-0-aievals

fix: remove resumption-token length cap

5381bf4

Merge branch 'mmccarthy/AIC-2664/ai-config-tracker-overhaul' of githu…

1355033

…b.com:launchdarkly/java-core into mmccarthy/AIC-2665/java-ai-sdk-v-1-0-aievals

fix: guard against NaN scores

add48f9

fix: defensively copy judges map in Evaluator constructor

1bd6777

mattrmc1 marked this pull request as ready for review June 24, 2026 21:26

mattrmc1 added 4 commits June 24, 2026 16:34

fix: use Java 8-compatible map/list construction in Judge

9a8143e

fix: Add security note to LDAIConfigTracker.getResumptionToken()

3aa5d08

fix: Add security note to MetricSummary.getResumptionToken()

121b140

Merge branch 'mmccarthy/AIC-2664/ai-config-tracker-overhaul' of githu…

faa4981

…b.com:launchdarkly/java-core into mmccarthy/AIC-2665/java-ai-sdk-v-1-0-aievals

cursor Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread lib/sdk/server-ai/src/main/java/com/launchdarkly/sdk/server/ai/Judge.java

fix: remove reasoning from Judge schema required fields

f42de0b

Co-authored-by: Cursor <cursoragent@cursor.com>

Base automatically changed from mmccarthy/AIC-2664/ai-config-tracker-overhaul to main June 25, 2026 16:05

Merge branch 'main' of github.com:launchdarkly/java-core into mmccart…

59835e3

…hy/AIC-2665/java-ai-sdk-v-1-0-aievals

mattrmc1 requested review from jsonbailey and tanderson-ld June 29, 2026 16:08

mattrmc1 and others added 2 commits June 29, 2026 14:00

Merge branch 'main' into mmccarthy/AIC-2665/java-ai-sdk-v-1-0-aievals

25346b3

fix: replace requireNonNull with graceful null-tolerance in Judge and…

26a61b4

… Evaluator constructors

mattrmc1 mentioned this pull request Jun 30, 2026

feat: add manual judge evaluation (Judge, Evaluator, createJudge) (AIC-2665) #175

Closed

3 tasks

fix: align sampling operator and skipped result fields with spec

ef762f5

cursor Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread lib/sdk/server-ai/src/main/java/com/launchdarkly/sdk/server/ai/Judge.java Outdated

Comment thread lib/sdk/server-ai/src/main/java/com/launchdarkly/sdk/server/ai/Judge.java

fix: align schema with JS SDK, guard null logger, fix zero-rate skip

a6a1ca1

cursor Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread lib/sdk/server-ai/src/main/java/com/launchdarkly/sdk/server/ai/Judge.java Outdated

fix: catch all exceptions in Judge.evaluate() to prevent blank metric…

792d33c

…Key from escaping

cursor Bot reviewed Jun 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add Runner, RunnerResult, Judge, and Evaluator#180

feat: add Runner, RunnerResult, Judge, and Evaluator#180
mattrmc1 wants to merge 30 commits into
mainfrom
mmccarthy/AIC-2665/java-ai-sdk-v-1-0-aievals

mattrmc1 commented Jun 23, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mattrmc1 commented Jun 23, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New types

Config type changes

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 30, 2026

Choose a reason for hiding this comment

Exception path omits judge identity

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mattrmc1 commented Jun 23, 2026 •

edited by cursor Bot

Loading