Skip to content

feat: add manual judge evaluation (Judge, Evaluator, createJudge) (AIC-2665)#175

Closed
ctawiah wants to merge 2 commits into
feat/ai-sdk-trackerfrom
feat/ai-sdk-evals
Closed

feat: add manual judge evaluation (Judge, Evaluator, createJudge) (AIC-2665)#175
ctawiah wants to merge 2 commits into
feat/ai-sdk-trackerfrom
feat/ai-sdk-evals

Conversation

@ctawiah

@ctawiah ctawiah commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Requirements

  • I have added test coverage for new or changed functionality
  • I have followed the repository's pull request submission guidelines
  • I have validated my changes against all supported platform versions

Related issues

Stacked on #174 (AIC-2664). Review/merge that first; the diff here is against feat/ai-sdk-tracker.

Describe the solution you've provided

Implements the manual-only evaluation path for AI Configs. v1.0 does not auto-invoke judges on completion/agent calls; the caller drives evaluation: createJudge()judge.evaluate(...) → track the result yourself.

  • Runner SPI + RunnerResult — caller-supplied model invocation. RunnerResult carries content, the run Metrics, and parsed structured output ({score, reasoning}). Provider-specific runners ship post-1.0.
  • Judge — sampling is decided before invoking the model; input is formatted as MESSAGE HISTORY:\n{input}\n\nRESPONSE TO EVALUATE:\n{output}; the runner is invoked via the tracker's trackMetricsOf so invocation metrics are recorded; score (0.0–1.0, out-of-range → failure) and reasoning are parsed. The judge returns a JudgeResult but does not call trackJudgeResult — recording is the caller's responsibility. evaluateMessages renders <role>: <content> history and delegates to evaluate. Sampling rate is normalized (NaN/Infinity → 1.0, negative → 0.0, >1 → 1.0).
  • Evaluator — runs a set of judges with per-judge fault isolation (a failing/timing-out judge yields a failed JudgeResult; others are preserved in order) and a per-judge timeout so a hung judge can't stall the chain. noop() returns an empty list with no warnings. Thread-safe; uses a short-lived executor per evaluate call.
  • LDAIClient.createJudge — fires only $ld:ai:usage:create-judge, resolves the judge config through the internal evaluate path (so no $ld:ai:usage:judge-config event), and returns null if the config is disabled or no runner is supplied.
  • README documents the manual-only flow and the auto-attach descope.

Async surface is synchronous, consistent with the rest of this server SDK; concurrency for per-judge timeout is internal to Evaluator.

Tests

  • JudgeTest — scoring/metric key, input formatting, zero-sampling skip (runner not invoked), missing metric key, out-of-range score, missing reasoning, runner throw, runner failure metrics, evaluateMessages rendering, sample-rate normalization.
  • EvaluatorTestnoop() empty, order preservation, fault isolation, timeout isolation, completion-order independence.
  • LDAIClientImplTestcreateJudge fires only create-judge (not judge-config), returns a Judge when enabled, null when disabled, null when no runner.

Describe alternatives you've considered

A CompletableFuture-based async API was considered but rejected for consistency with the synchronous server SDK surface. Automatic sample-rate-driven judge auto-attachment and provider runners are intentionally deferred past v1.0 (aligns with the .NET descope).

Additional context

JudgeResult was added in #174 (AIC-2664) and is reused here.


Note

Medium Risk
New public API and usage telemetry paths; judge runs depend on caller-supplied runners and correct manual tracking, but no changes to core flag evaluation or auth.

Overview
Adds manual-only AI response evaluation to the server AI SDK: callers use LDAIClient.createJudge() with a custom Runner, run Judge.evaluate() (or evaluateMessages()), and record scores via trackJudgeResult themselves—no auto-attachment on completion/agent calls in v1.0.

New public Runner / RunnerResult SPI for model calls; Judge applies sampling, formats judge input, invokes the runner through the config tracker for invocation metrics, and parses {score, reasoning}. Internal Evaluator runs multiple judges concurrently with per-judge timeouts and fault isolation. createJudge emits only $ld:ai:usage:create-judge (not judge-config) and returns null when disabled or no runner. README documents tracking and the manual judge flow.

Reviewed by Cursor Bugbot for commit f6d4a4c. Bugbot is set up for automated code reviews on this repo. Configure here.

@ctawiah ctawiah marked this pull request as ready for review June 11, 2026 03:05
@ctawiah ctawiah requested a review from a team as a code owner June 11, 2026 03:05

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 83a342f. Configure here.

private JudgeResult awaitResult(Judge judge, Future<JudgeResult> future) {
String key = judge.getAIConfig().getKey();
try {
return future.get(perJudgeTimeout.toMillis(), TimeUnit.MILLISECONDS);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sequential waits break judge timeouts

High Severity

Evaluator starts all judges concurrently but awaits each Future in list order with a full perJudgeTimeout on every get. That timeout is measured from each get call, not from when the judge task started, so later judges can run far longer than the configured cap and evaluate can take up to judges.size() × perJudgeTimeout when multiple judges hang.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 83a342f. Configure here.


String evaluationInput = buildEvaluationInput(input, output);
RunnerResult response = tracker.trackMetricsOf(RunnerResult::getMetrics,
() -> runner.run(evaluationInput));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may come later but we typically use structured outputs and would need to define the output shape for the run.

LDContext context,
AIJudgeConfigDefault defaultValue,
Map<String, Object> variables,
Runner runner,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth talking about the plans for the experimental features of the SDK if you haven't already. The create judge method should be marked experimental. In node and python we don't accept a runner but build it internally so this is a change in the SDK. Not sure if this is temporary and will be addressed later.

@ctawiah ctawiah force-pushed the feat/ai-sdk-tracker branch from 19d0f4f to 2ca9fc8 Compare June 11, 2026 21:29
ctawiah and others added 2 commits June 11, 2026 17:30
…C-2665)

Implements the AIEVALS manual-only evaluation path:

- Runner SPI and RunnerResult for caller-supplied model invocation
- Judge: sampling decided before invocation, well-known input format,
  score/reasoning parsing with range validation, invocation tracked via
  trackMetricsOf (does not emit trackJudgeResult; caller's responsibility)
- Evaluator: per-judge fault isolation and per-judge timeout, order-preserving
  results, noop() returns an empty list; sampling-rate normalization on Judge
- LDAIClient.createJudge: fires only $ld:ai:usage:create-judge, resolves the
  judge config via the internal evaluate path, returns null when disabled or
  when no runner is supplied

Automatic judge auto-attachment and provider runners are deferred past v1.0.
README documents the manual-only flow and the auto-attach descope.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@ctawiah ctawiah force-pushed the feat/ai-sdk-evals branch from 83a342f to f6d4a4c Compare June 11, 2026 21:30

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really want to throw exceptions with Objects.requireNonNull? As long as these indicate a LaunchDarkly developer error this could be ok. If there are any paths where a customer may not exercise the code during their development testing and it only gets triggered later once it has shipped, that could be a problem.

We generally try to avoid this and prefer reasonable defaults if possible.

return Collections.emptyList();
}

ExecutorService pool = Executors.newFixedThreadPool(judges.size());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double check that there aren't performance implications here, it may take a bit to get those threads allocated and ready to do work. You may benefit from an already warmed up thread pool instead of making one each call. Also consider what happens if there are many parallel evaluations and no threads can be given due to system constraints.

Iirc, in some of our SDKs we make executors for various types of work when we make the SDK instance and then pass those executors around to share the threads and avoid unbounded thread allocation situations.

public JudgeResult evaluate(String input, String output, double samplingRate) {
double effectiveRate = normalizeSampleRate(samplingRate);
String key = config.getKey();
LDAIConfigTracker tracker = config.createTracker();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this tracker need to survive multiple evaluate calls? It is fairly rare in SDKs to see a tracker factory invoked here.

@mattrmc1

Copy link
Copy Markdown
Contributor

Closing this pull request. We are purposefully omitting createJudge. The notes made in this PR have been addresses in their respective PRs

See: #180

@mattrmc1 mattrmc1 closed this Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants