Skip to content

feat: add Runner, RunnerResult, Judge, and Evaluator#180

Merged
mattrmc1 merged 33 commits into
mainfrom
mmccarthy/AIC-2665/java-ai-sdk-v-1-0-aievals
Jul 1, 2026
Merged

feat: add Runner, RunnerResult, Judge, and Evaluator#180
mattrmc1 merged 33 commits into
mainfrom
mmccarthy/AIC-2665/java-ai-sdk-v-1-0-aievals

Conversation

@mattrmc1

@mattrmc1 mattrmc1 commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds the AIEVALS typesRunner, RunnerResult, Judge, and Evaluator — and wires Evaluator.noop() into all config types. Callers can now implement a Runner to wrap any model provider, construct a Judge to evaluate AI outputs against a judge prompt with structured {score, reasoning} output, and coordinate multiple judges through an Evaluator.

New types

public interface Runner {
  RunnerResult run(String input, Map<String, Object> outputType) throws Exception;
  default RunnerResult run(String input) throws Exception;
}

Wraps a model provider SDK. outputType carries a JSON-Schema-like map when structured output is needed. Single-arg overload delegates with outputType = null.

RunnerResult.builder(String content, AIMetrics metrics)
    .raw(Object raw)
    .parsed(Map<String, Object> parsed)
    .build();

Immutable result of a Runner invocation. parsed is defensively copied and returned as unmodifiable.

public Judge(AIJudgeConfig config, Runner runner, LDLogger logger);

JudgeResult evaluate(String input, String output);
JudgeResult evaluate(String input, String output, double samplingRate);
JudgeResult evaluateMessages(List<Message> messages, RunnerResult response);
JudgeResult evaluateMessages(List<Message> messages, RunnerResult response, double samplingRate);

Evaluates AI output by invoking a runner with a formatted evaluation prompt and parsing the structured response. Sampling gate runs first — below the rate, returns sampled=false immediately. Creates a fresh tracker per evaluation via config.createTracker(). Parses score (Number, [0.0, 1.0]) and reasoning (String, optional). Runner exceptions are caught and returned as JudgeResult(success=false) — judge failures are results, not exceptions. Does not call trackJudgeResult.

public static Evaluator noop();
public Evaluator(Map<String, Judge> judges, JudgeConfiguration judgeConfiguration, LDLogger logger);

CompletableFuture<List<JudgeResult>> evaluate(String input, String output);

Coordinates sequential execution of judges. Missing judges skipped with a warning. Evaluator.noop() returns a singleton whose evaluate immediately returns an empty list. For v1.0, all configs receive Evaluator.noop().

Config type changes

AIConfig base class gains an Evaluator field and getEvaluator() accessor. AICompletionConfig and AIAgentConfig constructors accept an Evaluator. AIJudgeConfig always wires Evaluator.noop() internally — judges do not evaluate themselves.

Test plan

  • ./gradlew :lib:sdk:server-ai:test passes
  • JudgeTest — successful evaluation, score boundary validation (0 and 1), reasoning optional, runner exception handling (caught not rethrown), null/missing parsed output, score out of range, sampling rates (0 always skips, 1 always runs), message formatting, getter accessors
  • EvaluatorTest — noop returns empty list, noop singleton identity, single/multiple judge execution, missing judge skipped, evaluator does not call trackJudgeResult, returned future is already complete
  • RunnerResultTest — builder field assignment, immutability, defensive copy of parsed map

Note

Medium Risk
New public API and config constructor changes in server-ai; runtime behavior stays noop until custom evaluators are wired, but Judge paths will invoke external runners when used.

Overview
Introduces the AI evaluation stack for server-ai: a pluggable Runner / RunnerResult pair for model calls (including optional structured outputType), a Judge that runs judge prompts via the runner and parses {score, reasoning} with sampling and defensive failure handling, and an Evaluator that runs configured judges sequentially and returns a completed CompletableFuture.

AIConfig and completion/agent configs now carry an Evaluator exposed via getEvaluator(); LDAIClientImpl wires Evaluator.noop() for all retrieved configs in v1.0 (judge configs always noop internally). Unit tests cover judge scoring, sampling, evaluator orchestration, and RunnerResult immutability.

Reviewed by Cursor Bugbot for commit aa028c1. Bugbot is set up for automated code reviews on this repo. Configure here.

@mattrmc1 mattrmc1 changed the base branch from main to mmccarthy/AIC-2664/ai-config-tracker-overhaul June 23, 2026 21:13
@mattrmc1 mattrmc1 marked this pull request as ready for review June 23, 2026 21:13
@mattrmc1 mattrmc1 requested a review from a team as a code owner June 23, 2026 21:13
@mattrmc1 mattrmc1 marked this pull request as draft June 23, 2026 21:14
Comment thread lib/sdk/server-ai/src/main/java/com/launchdarkly/sdk/server/ai/Judge.java Outdated
Comment thread lib/sdk/server-ai/src/main/java/com/launchdarkly/sdk/server/ai/Evaluator.java Outdated
@mattrmc1 mattrmc1 marked this pull request as ready for review June 24, 2026 21:26
Co-authored-by: Cursor <cursoragent@cursor.com>
Base automatically changed from mmccarthy/AIC-2664/ai-config-tracker-overhaul to main June 25, 2026 16:05
Comment thread lib/sdk/server-ai/src/main/java/com/launchdarkly/sdk/server/ai/Judge.java Outdated

@tanderson-ld tanderson-ld left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just reviewing from the java-core perspective, not from the AI product perspective.

* @return a completed future holding the list of judge results; never {@code null}
*/
public CompletableFuture<List<JudgeResult>> evaluate(String input, String output) {
if (isNoop) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my experience with noop pattern, there usually isn't a noop check and instead the noop just doesn't do anything when invoked. Can you derive its noop-ness from the presence of judges instead?

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 99f8c19. Configure here.

@mattrmc1 mattrmc1 merged commit a32c4fa into main Jul 1, 2026
24 checks passed
@mattrmc1 mattrmc1 deleted the mmccarthy/AIC-2665/java-ai-sdk-v-1-0-aievals branch July 1, 2026 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants