feat: add manual judge evaluation (Judge, Evaluator, createJudge) (AIC-2665) by ctawiah · Pull Request #175 · launchdarkly/java-core

ctawiah · 2026-06-11T02:42:23Z

Requirements

I have added test coverage for new or changed functionality
I have followed the repository's pull request submission guidelines
I have validated my changes against all supported platform versions

Related issues

Jira: AIC-2665 — Step 5: AIEVALS (Judge, Evaluator & CreateJudge, manual-only)
Epic: AIC-2629
Spec: AIEVALS

Stacked on #174 (AIC-2664). Review/merge that first; the diff here is against feat/ai-sdk-tracker.

Describe the solution you've provided

Implements the manual-only evaluation path for AI Configs. v1.0 does not auto-invoke judges on completion/agent calls; the caller drives evaluation: createJudge() → judge.evaluate(...) → track the result yourself.

Runner SPI + RunnerResult — caller-supplied model invocation. RunnerResult carries content, the run Metrics, and parsed structured output ({score, reasoning}). Provider-specific runners ship post-1.0.
Judge — sampling is decided before invoking the model; input is formatted as MESSAGE HISTORY:\n{input}\n\nRESPONSE TO EVALUATE:\n{output}; the runner is invoked via the tracker's trackMetricsOf so invocation metrics are recorded; score (0.0–1.0, out-of-range → failure) and reasoning are parsed. The judge returns a JudgeResult but does not call trackJudgeResult — recording is the caller's responsibility. evaluateMessages renders <role>: <content> history and delegates to evaluate. Sampling rate is normalized (NaN/Infinity → 1.0, negative → 0.0, >1 → 1.0).
Evaluator — runs a set of judges with per-judge fault isolation (a failing/timing-out judge yields a failed JudgeResult; others are preserved in order) and a per-judge timeout so a hung judge can't stall the chain. noop() returns an empty list with no warnings. Thread-safe; uses a short-lived executor per evaluate call.
LDAIClient.createJudge — fires only $ld:ai:usage:create-judge, resolves the judge config through the internal evaluate path (so no $ld:ai:usage:judge-config event), and returns null if the config is disabled or no runner is supplied.
README documents the manual-only flow and the auto-attach descope.

Async surface is synchronous, consistent with the rest of this server SDK; concurrency for per-judge timeout is internal to Evaluator.

Tests

JudgeTest — scoring/metric key, input formatting, zero-sampling skip (runner not invoked), missing metric key, out-of-range score, missing reasoning, runner throw, runner failure metrics, evaluateMessages rendering, sample-rate normalization.
EvaluatorTest — noop() empty, order preservation, fault isolation, timeout isolation, completion-order independence.
LDAIClientImplTest — createJudge fires only create-judge (not judge-config), returns a Judge when enabled, null when disabled, null when no runner.

Describe alternatives you've considered

A CompletableFuture-based async API was considered but rejected for consistency with the synchronous server SDK surface. Automatic sample-rate-driven judge auto-attachment and provider runners are intentionally deferred past v1.0 (aligns with the .NET descope).

Additional context

JudgeResult was added in #174 (AIC-2664) and is reused here.

Note

Medium Risk
New public API and usage telemetry paths; judge runs depend on caller-supplied runners and correct manual tracking, but no changes to core flag evaluation or auth.

Overview
Adds manual-only AI response evaluation to the server AI SDK: callers use LDAIClient.createJudge() with a custom Runner, run Judge.evaluate() (or evaluateMessages()), and record scores via trackJudgeResult themselves—no auto-attachment on completion/agent calls in v1.0.

New public Runner / RunnerResult SPI for model calls; Judge applies sampling, formats judge input, invokes the runner through the config tracker for invocation metrics, and parses {score, reasoning}. Internal Evaluator runs multiple judges concurrently with per-judge timeouts and fault isolation. createJudge emits only $ld:ai:usage:create-judge (not judge-config) and returns null when disabled or no runner. README documents tracking and the manual judge flow.

^{Reviewed by Cursor Bugbot for commit f6d4a4c. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 83a342f. Configure here.}

cursor · 2026-06-11T03:07:13Z

+  private JudgeResult awaitResult(Judge judge, Future<JudgeResult> future) {
+    String key = judge.getAIConfig().getKey();
+    try {
+      return future.get(perJudgeTimeout.toMillis(), TimeUnit.MILLISECONDS);


Sequential waits break judge timeouts

High Severity

Evaluator starts all judges concurrently but awaits each Future in list order with a full perJudgeTimeout on every get. That timeout is measured from each get call, not from when the judge task started, so later judges can run far longer than the configured cap and evaluate can take up to judges.size() × perJudgeTimeout when multiple judges hang.

^{Reviewed by Cursor Bugbot for commit 83a342f. Configure here.}

jsonbailey · 2026-06-11T15:56:04Z

+
+      String evaluationInput = buildEvaluationInput(input, output);
+      RunnerResult response = tracker.trackMetricsOf(RunnerResult::getMetrics,
+          () -> runner.run(evaluationInput));


This may come later but we typically use structured outputs and would need to define the output shape for the run.

jsonbailey · 2026-06-11T15:59:16Z

+      LDContext context,
+      AIJudgeConfigDefault defaultValue,
+      Map<String, Object> variables,
+      Runner runner,


It might be worth talking about the plans for the experimental features of the SDK if you haven't already. The create judge method should be marked experimental. In node and python we don't accept a runner but build it internally so this is a change in the SDK. Not sure if this is temporary and will be addressed later.

…C-2665) Implements the AIEVALS manual-only evaluation path: - Runner SPI and RunnerResult for caller-supplied model invocation - Judge: sampling decided before invocation, well-known input format, score/reasoning parsing with range validation, invocation tracked via trackMetricsOf (does not emit trackJudgeResult; caller's responsibility) - Evaluator: per-judge fault isolation and per-judge timeout, order-preserving results, noop() returns an empty list; sampling-rate normalization on Judge - LDAIClient.createJudge: fires only $ld:ai:usage:create-judge, resolves the judge config via the internal evaluate path, returns null when disabled or when no runner is supplied Automatic judge auto-attachment and provider runners are deferred past v1.0. README documents the manual-only flow and the auto-attach descope. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

tanderson-ld · 2026-06-30T13:02:43Z

Do you really want to throw exceptions with Objects.requireNonNull? As long as these indicate a LaunchDarkly developer error this could be ok. If there are any paths where a customer may not exercise the code during their development testing and it only gets triggered later once it has shipped, that could be a problem.

We generally try to avoid this and prefer reasonable defaults if possible.

tanderson-ld · 2026-06-30T13:07:56Z

+      return Collections.emptyList();
+    }
+
+    ExecutorService pool = Executors.newFixedThreadPool(judges.size());


Double check that there aren't performance implications here, it may take a bit to get those threads allocated and ready to do work. You may benefit from an already warmed up thread pool instead of making one each call. Also consider what happens if there are many parallel evaluations and no threads can be given due to system constraints.

Iirc, in some of our SDKs we make executors for various types of work when we make the SDK instance and then pass those executors around to share the threads and avoid unbounded thread allocation situations.

tanderson-ld · 2026-06-30T13:12:53Z

+  public JudgeResult evaluate(String input, String output, double samplingRate) {
+    double effectiveRate = normalizeSampleRate(samplingRate);
+    String key = config.getKey();
+    LDAIConfigTracker tracker = config.createTracker();


Does this tracker need to survive multiple evaluate calls? It is fairly rare in SDKs to see a tracker factory invoked here.

mattrmc1 · 2026-06-30T20:34:02Z

Closing this pull request. We are purposefully omitting createJudge. The notes made in this PR have been addresses in their respective PRs

See: #180

ctawiah marked this pull request as ready for review June 11, 2026 03:05

ctawiah requested a review from a team as a code owner June 11, 2026 03:05

ctawiah requested review from jsonbailey, mattrmc1 and tanderson-ld June 11, 2026 03:05

cursor Bot reviewed Jun 11, 2026

View reviewed changes

jsonbailey reviewed Jun 11, 2026

View reviewed changes

ctawiah force-pushed the feat/ai-sdk-tracker branch from 19d0f4f to 2ca9fc8 Compare June 11, 2026 21:29

ctawiah and others added 2 commits June 11, 2026 17:30

refactor: make Evaluator package-private (not public API in v1.0)

f6d4a4c

Co-authored-by: Cursor <cursoragent@cursor.com>

ctawiah force-pushed the feat/ai-sdk-evals branch from 83a342f to f6d4a4c Compare June 11, 2026 21:30

tanderson-ld reviewed Jun 30, 2026

View reviewed changes

mattrmc1 mentioned this pull request Jun 30, 2026

feat: Add AI online evaluations (Judge, Evaluator, IRunner) launchdarkly/dotnet-core#301

Open

6 tasks

mattrmc1 closed this Jun 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add manual judge evaluation (Judge, Evaluator, createJudge) (AIC-2665)#175

feat: add manual judge evaluation (Judge, Evaluator, createJudge) (AIC-2665)#175
ctawiah wants to merge 2 commits into
feat/ai-sdk-trackerfrom
feat/ai-sdk-evals

ctawiah commented Jun 11, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 11, 2026

Uh oh!

jsonbailey Jun 11, 2026

Uh oh!

jsonbailey Jun 11, 2026

Uh oh!

tanderson-ld Jun 30, 2026

Uh oh!

tanderson-ld Jun 30, 2026

Uh oh!

tanderson-ld Jun 30, 2026

Uh oh!

mattrmc1 commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

ctawiah commented Jun 11, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 11, 2026

Choose a reason for hiding this comment

Sequential waits break judge timeouts

Uh oh!

jsonbailey Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

jsonbailey Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

tanderson-ld Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

tanderson-ld Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

tanderson-ld Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

mattrmc1 commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ctawiah commented Jun 11, 2026 •

edited by cursor Bot

Loading