Add RFC 0007: Scorer Presets with Customization and Persistence by Nehanth · Pull Request #13 · mlflow/rfcs

Nehanth · 2026-04-28T14:50:23Z

Summary

This RFC proposes a Preset class for MLflow that packages a named, immutable collection of scorers for common evaluation patterns. MLflow ships three built-in presets (Rag, Agent, ConversationalAgent) as starting points. Users can create custom presets, persist them to the MLflow server, and share them across teams and sessions.

from mlflow.genai.scorers import Agent, Guidelines

# Use a built-in preset
mlflow.genai.evaluate(
    data=eval_dataset,
    scorers=[Agent()],
)

# Create and persist a custom preset
my_preset = Preset("my_team_eval", scorers=[Safety(), Fluency(), my_custom_scorer])
my_preset.register()

# Load it later
preset = get_preset(name="my_team_eval")
mlflow.genai.evaluate(data=eval_dataset, scorers=[preset])

Based on mlflow/mlflow#21445.

Co-Authored-By: Claude noreply@anthropic.com

Proposes a Preset class that packages a named collection of scorers for common evaluation patterns (RAG, agent, conversational-agent, safety, quality). Presets can be passed directly in the scorers list alongside individual scorers, with automatic deduplication. Based on mlflow/mlflow#21445. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>

Each built-in preset is now a subclass (Agent, Rag, ConversationalAgent, SafetyPreset, Quality) that creates fresh scorer instances on each call. Eliminates shared mutable state and enables future preset-specific configuration. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>

mprahl · 2026-05-05T21:16:20Z

+        return len(self._scorers)
+
+    def __add__(self, other):
+        if isinstance(other, (Preset, list)):


Does it make sense to deduplicate scorers when combining presets especially since some have overlap? Edit: I see this is addressed below. It might be good to mention it here too.

Yes, I've added deduplication in add so duplicates are removed when presets are combined.

mprahl · 2026-05-05T21:19:31Z

+`validate_scorers()` deduplicates by scorer type after flattening:
+
+```python
+def _deduplicate_scorers(scorers: list[Scorer]) -> list[Scorer]:


We should keep Preset deduplicated whenever scorers are added instead of relying on validate_scorers to do it IMO.

I think we should have both, add (I added the deduplication there as well) handles deduplication when a user combines presets or scorers using the + operator, and validate_scorers() handles it when multiple presets are passed directly in the scorers list.

mprahl · 2026-05-05T21:22:12Z

@@ -0,0 +1,682 @@
+---


This RFC has too many implementation details. The RFC should be scoped to requirements, APIs, database schema changes, etc. This is delving into PR level implementation.

Can we keep this scoped to the desired UX, the proposed groupings, and the class definition?

Trimmed it down!

mprahl · 2026-05-05T21:26:39Z

+- **Type checking.** `isinstance(preset, Agent)` works — code can distinguish which preset is being used.
+- **Custom control flow.** Each preset can override methods for preset-specific validation or behavior.
+
+### Deduplication


How does deduplication account for scorers with different configurations (e.g. ConversationalGuidelines with separate configs)?

Deduplication works by (type(scorer), scorer.name), so if you use different names for your
ConversationalGuidelines scorers, they'll be kept. If two scorers have the same type and name, one gets removed.

mprahl · 2026-05-05T21:28:33Z

+
+MLflow ships five built-in preset subclasses. Each call creates fresh scorer instances.
+
+> **Note:** **`TaskSuccess`** is a new scorer proposed in [mlflow/mlflow#22972](https://github.com/mlflow/mlflow/issues/22972). It evaluates whether an agent successfully accomplished the user's task without requiring ground truth data — unlike `Correctness`, which requires an `expectations` column. This scorer would be added to `Agent`, `ConversationalAgent`, and `Quality`. This work can be part of this RFC or be a future addition after this RFC is completed.


Is this relevant for this RFC? It seems out of scope.

You're right, it's out of scope for this RFC. I'll remove it. I did create a separate issue for it mlflow#22972, once it's implemented it would be added to the Agent, ConversationalAgent, and Quality presets.comm

mprahl · 2026-05-05T21:30:08Z

+
+### Built-in Presets as Subclasses
+
+Each built-in preset is a subclass of `Preset` that hardcodes its scorer list. This means each call creates **fresh scorer instances** (no shared mutable singletons) and opens the door for preset-specific configuration and control flow in the future.


I don't really see a benefit to these presets if you can't configure the scorers. If it's merely to categorize different scorers, there are better ways to do this such as documentation or a function like get_agent_scorers() which returns the classes for discoverability.

The reason I’m going in this direction is because, after this RFC is approved, subclassing opens up more than just categorization. For example, each preset could accept different judge models (e.g, Agent(model="openai:/gpt-4o")), different inference parameters, or other preset-specific configurations.

If this RFC gets approved, I can also update it to use the function-based approach I proposed in the alternatives section, if that’s preferred.

I agree with Matt about customization capability. The class-based approach is more ergonomic to me but I think it should come with a good customizability and persistence.

mprahl · 2026-05-05T21:32:56Z

+    def __init__(self):
+        super().__init__("rag", [
+            RetrievalRelevance(),
+            RetrievalSufficiency(),


Doesn't this require expected_response or expected_facts?

Yes, RetrievalSufficiency does require ground truth, so I'm removing it from this preset. We want presets to work out of the box without requiring ground truth for easy evaluation.

- Add deduplication to Preset __init__ and __add__ - Remove TaskSuccess from presets (out of scope) - Remove RetrievalSufficiency from Rag preset (requires ground truth) - Trim implementation details per reviewer feedback Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>

mprahl · 2026-05-18T20:13:43Z

+    def __len__(self):
+        return len(self._scorers)
+
+    def __add__(self, other):


Do you think it'd be less surprising to implement __or__ for "set" union behavior since you are deduplicating?

Yeah, Just did that!

mprahl

I’m not convinced this should be a feature in its current form. Without a way to configure or cleanly override scorers inside a preset, the value seems limited to boilerplate reduction. At the same time, if built-in preset membership changes across releases, users will get silent behavior and cost changes on upgrade. That combination makes presets feel more like unstable convenience sugar than a strong API surface.

B-Step62

Thank you for the RFC submission @Nehanth!

The idea of bundling multiple scorers into sharable preset sounds very useful to me. My main feedback is to support persistence of presets in MLflow server and provide great customization UX. I can see built-in presets are useful for new users to getting started and solves "which scorer should I start with?" question, but many teams need customization for adopting them for the real use cases.

B-Step62 · 2026-05-18T20:27:24Z

+)
+```
+
+```python


I believe we need to persist presets more than a single python session. Otherwise L65 is essentially same as defining a normal python list of scorers. Team sharing benefit mentioned below only holds with persistence in a shared server.

B-Step62 · 2026-05-18T20:28:51Z

+        self._scorers = tuple(self._deduplicate(scorers))
+
+    @staticmethod
+    def _deduplicate(scorers):


nit: Instead of deduplication, I would block adding an existing scorer in the preset.

B-Step62 · 2026-05-18T20:30:58Z

+
+### Built-in Presets as Subclasses
+
+Each built-in preset is a subclass of `Preset` that hardcodes its scorer list. This means each call creates **fresh scorer instances** (no shared mutable singletons) and opens the door for preset-specific configuration and control flow in the future.


I agree with Matt about customization capability. The class-based approach is more ergonomic to me but I think it should come with a good customizability and persistence.

B-Step62 · 2026-05-18T20:32:36Z

+class SafetyPreset(Preset):
+    def __init__(self):
+        super().__init__("safety", [
+            Safety(),
+            ConversationalSafety(),
+        ])
+
+class Quality(Preset):
+    def __init__(self):
+        super().__init__("quality", [
+            RelevanceToQuery(),
+            Fluency(),
+            Completeness(),
+        ])


RAG/Agent/Conversation presets look great, the other two look less useful or intuitive to me. Safety one just bundles two scorers with 'safety' in its name. "Quality" sounds a bit too vague imo, technically most of scorers aims to evaluate some aspect of the agent quality. Shall we start with the first three + great customization story?

etirelli · 2026-05-19T19:42:10Z

I fully agree with @mprahl on this:

if built-in preset membership changes across releases, users will get silent behavior and cost changes on upgrade. That combination makes presets feel more like unstable convenience sugar than a strong API surface.

If this is not handled properly, this feature would end up being a burden for us to maintain, and create friction for users on upgrades, rather than being a benefit.
One way to solve this issue is to implement persistence as suggested by @B-Step62 , as that would empower users to control how they use and manage presets across versions. A function based implementation would in such cases be more flexible and efficient than a class hierarchy based one.

subclassing opens up more than just categorization. For example, each preset could accept different judge models (e.g, Agent(model="openai:/gpt-4o")), different inference parameters, or other preset-specific configurations.

A function based implementation does not prevent such customizations.

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>

Nehanth · 2026-06-01T15:31:28Z

Sorry for the delay in addressing comments.

Thanks for the feedback @mprahl @B-Step62 @etirelli. I updated the RFC to address the concern around this being a burden to maintain.

As @B-Step62 suggested, I dropped the Safety and Quality presets and kept Rag, Agent, and ConversationalAgent as starting points for new users. I also added customization and persistence so teams can create custom presets, save them to the MLflow server, and share them across sessions. This addresses the version stability concern since persisted presets are user-controlled and don't change on upgrade. Please take a look.

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>

jwm4

This RFC is on the right track. The core idea of bundling scorers into named, shareable presets is solid.

My main feedback is that the persistence story needs more user-facing specification (see my comment on that section), and that the RFC still leans heavily into implementation detail (private method bodies, deduplication internals, or/ror implementations). An RFC should nail down the requirements and the user-facing API. Implementation choices like how deduplication is coded or how subclass constructors work are better left to the PR.

jwm4 · 2026-06-11T15:53:41Z

+
+### Persistence
+
+Presets can be registered to the MLflow server so teams can share them across sessions. This leverages the existing scorer registration infrastructure.


Persistence looks very useful to me. The API surface (register() / get_preset()) is clear, but a few user-facing behavioral questions should be answered at the RFC level:

Custom scorer portability. The example shows my_custom_scorer in a persisted preset. When a teammate calls get_preset("my_team_eval"), what do they get back? Do they need the same custom scorer code available in their environment? What happens if they don't? This is central to the team-sharing story.

Scope/namespace. The experiment_id parameter suggests presets are scoped to experiments, but the default behavior (no experiment_id) isn't specified. Are presets workspace-global by default? If I register to experiment A, can someone working in experiment B see it?

Discovery. There's get_preset(name=...) for retrieving a known preset, but no way to ask "what presets are available here?" Something like list_presets() seems like a natural companion, especially for the team-sharing use case.

These don't need implementation details, just a description of the intended user experience.

Added, I've addressed this from Line 232

jwm4 · 2026-06-11T16:00:50Z

+    def __init__(self):
+        super().__init__()
+        # Add team-specific scorers
+        self._scorers = self._scorers + (Fluency(), my_compliance_scorer)


This implementation reassigns self._scorers after super().init() has already run _validate_no_duplicates, so the added scorers bypass the duplicate check. Consider having the constructor accept extra scorers so the full list goes through validation, e.g.:

class MyAgent(Agent): def __init__(self): super().__init__(extra_scorers=[Fluency(), my_compliance_scorer])

Alternatively, this level of implementation detail could be dropped from the RFC entirely. The RFC really should be focused more on the specification and not the code.

Dropped the implementation detail.

jwm4 · 2026-06-11T16:04:03Z

+    def __or__(self, other):
+        if isinstance(other, (Preset, list)):
+            combined = list(self) + list(other)
+            return self._deduplicate(combined)


_deduplicate returns a plain list, so Agent() | Rag() gives you a list, not a Preset. That means you can't chain (Agent() | Rag() | [my_scorer]), can't call register() on the result, and the return type silently changes from Preset to list. This should probably return a new Preset.

Alternatively, this level of implementation detail could be dropped from the RFC, with just the requirement that | combines presets with deduplication and produces a Preset.

Dropped the implementation detail.

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>

Nehanth · 2026-06-16T16:18:55Z

This RFC is on the right track. The core idea of bundling scorers into named, shareable presets is solid.

My main feedback is that the persistence story needs more user-facing specification (see my comment on that section), and that the RFC still leans heavily into implementation detail (private method bodies, deduplication internals, or/ror implementations). An RFC should nail down the requirements and the user-facing API. Implementation choices like how deduplication is coded or how subclass constructors work are better left to the PR.

Thanks @jwm4 for the feedback. I've addressed your comments in the RFC, please take a look.

jwm4

The revision addresses my feedback well. The implementation detail has been replaced with clear API contracts, and the persistence section now clearly specifies the user-facing behavior. To me, this looks ready to move forward.

mprahl · 2026-07-06T15:09:46Z

+
+Every team building agent evaluation follows this same pattern. This creates three problems (from the [original issue](https://github.com/mlflow/mlflow/issues/21445)):
+
+1. **No built-in grouping.** `get_all_scorers()` returns all 19 default-constructible scorers. Users evaluating a RAG pipeline get `ToolCallCorrectness`; users evaluating an agent get `RetrievalGroundedness`. Each unnecessary scorer wastes an LLM API call.


How would you feel about deferring the built-in groups and focus more on allowing custom presets to be persisted and optimize sharing? I feel like the built-in groups are somewhat subjective and the real benefit is from an organization's specific presets.

Based on earlier feedback from @B-Step62, I already narrowed it down to just 3 (Rag, Agent, ConversationalAgent): these are just intro presets for new users getting started. Aside from that, the RFC is focused on customization and persistence as you suggested.

mprahl · 2026-07-06T15:15:49Z

+
+**Persistence behavior:**
+
+- **Scope.** Presets are scoped to experiments, consistent with how scorer registration already works in MLflow. This prevents name collisions across teams and ensures presets are organized alongside the experiments they evaluate. If no `experiment_id` is provided, the active experiment is used.


I see value with allow this to be scoped at the workspace level as well. At the very least, we should make it easy with the MLflow Python client to copy a preset to another experiment since an experiment typically maps to a single agent.

Added copy_preset(name, from_experiment_id, to_experiment_id) so users can easily share presets across experiments.

nit: I guess we don't need from_experiment_id?

Also one random thought - there are actually more entities that nice to be able to share across experiments (e.g. dataset, scorer itself, prompt, traces). I feel we can benefit from abstract copy or clone interface on the entity class so we don't need to add many copy_xyz APIs?

mprahl · 2026-07-06T15:16:24Z

+my_preset.register()
+
+# Or register to a specific experiment
+my_preset.register(experiment_id="123")


Should we add a concept of versioning?

Yes, presets support versioning. Calling register() again creates a new version. Users can load a specific version via get_preset(name, version=N) or default to the latest.

mprahl · 2026-07-06T15:17:48Z

+from mlflow.genai.scorers import get_preset
+
+# Load from the active experiment
+preset = get_preset(name="my_team_agent")


What is the proposed return type of preset? It seems like we need server-side metadata of where the preset came from at least.

get_preset() returns a Preset object with server-side metadata attached (experiment ID, version, created timestamp).

mprahl · 2026-07-06T15:18:07Z

+my_preset.register()
+
+# Or register to a specific experiment
+my_preset.register(experiment_id="123")


Can I update a preset?

Yes. Calling register() with the same name creates a new version.

mprahl · 2026-07-06T15:19:32Z

+
+When presets are combined using `|`, the same scorer type can appear more than once. For example, `Agent()` and `Rag()` both contain `Safety()` and `RelevanceToQuery()`. Running the same scorer twice wastes LLM API calls and produces duplicate result columns.
+
+Deduplication happens in two places:


I'm not sure if this is safe to do since you could potentially have multiple scorers in the preset but each having different configurations. I think we can either drop the ability to combine presets or make the deduplication smarter.

Good point. I dropped preset combining (|) entirely. Users create custom presets from scratch with the exact scorers they want.

mprahl · 2026-07-06T15:20:08Z

+
+- **Scope.** Presets are scoped to experiments, consistent with how scorer registration already works in MLflow. This prevents name collisions across teams and ensures presets are organized alongside the experiments they evaluate. If no `experiment_id` is provided, the active experiment is used.
+- **Custom scorer portability.** If a preset contains custom scorers, those scorers must be registered first. When a teammate loads the preset, the custom scorers are resolved from the registry. If a custom scorer is not registered, `preset.register()` will raise an error.
+- **Discovery.** `list_presets()` returns all registered presets for the current experiment, allowing teams to discover what presets are available. This follows the same pattern as `list_scorers()`.


Could you share more of the API shape you are thinking are get, list, and update?

Added the full CRUD API: register(), get_preset(), list_presets(), delete_preset(), copy_preset(). See the updated Persistence section.

mprahl · 2026-07-06T15:20:29Z

+my_preset.register()
+
+# Or register to a specific experiment
+my_preset.register(experiment_id="123")


Can I delete a preset?

Yes, added delete_preset(name).

mprahl · 2026-07-06T15:21:10Z

+
+`evaluate()` itself does not change. By the time scorers reach the evaluation loop, they are all individual `Scorer` instances.
+
+### Built-in Preset Summary


As mentioned previously, I think we can defer this section for now and focus on the persistence and versioning.

Same as above — keeping the three built-ins as starting points per @B-Step62's feedback, but the focus of the RFC is on the persistence and customization story.

Yeah I think it's nice to have several built-in presets for educational purpose for this feature.

mprahl

This seems like a great feature! Thanks for the proposal. I left a few comments for clarification.

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>

Nehanth · 2026-07-10T18:06:41Z

Thanks @mprahl for the detailed feedback. I've updated the RFC to address all your comments:

Dropped preset combining (|) since it's unsafe with different scorer configurations
Added full CRUD API: register(), get_preset(), list_presets(), delete_preset(), copy_preset()
Added versioning support
Added copy_preset() for cross-experiment sharing
get_preset() returns a Preset with server-side metadata (experiment ID, version, timestamp)

Please take a look.

B-Step62

Overall looks good, sounds like a pretty useful feature!

#13 (comment) is not a blocker for starting the implementation. We can discuss UI in parallel to merging backend PRs.

B-Step62 · 2026-07-10T18:28:35Z

+my_preset.register()
+
+# Later, another team member loads it
+from mlflow.genai.scorers import get_preset


Suggested change

from mlflow.genai.scorers import get_preset

from mlflow.genai.scorers import get_scorer_preset

nit: to make API name self-explanatory

B-Step62 · 2026-07-10T18:31:39Z

+copy_preset(name="my_team_agent", from_experiment_id="123", to_experiment_id="456")
+
+# Use in evaluation
+result = mlflow.genai.evaluate(data=eval_dataset, scorers=[preset])


Can we do scorers=[preset.scorers] to keep the API interface unchanged rather than make it handle the new preset type?

B-Step62 · 2026-07-10T18:36:21Z

+
+**Persistence behavior:**
+
+- **Scope.** Presets are scoped to experiments, consistent with how scorer registration already works in MLflow. This prevents name collisions across teams and ensures presets are organized alongside the experiments they evaluate. If no `experiment_id` is provided, the active experiment is used.


nit: I guess we don't need from_experiment_id?

Also one random thought - there are actually more entities that nice to be able to share across experiments (e.g. dataset, scorer itself, prompt, traces). I feel we can benefit from abstract copy or clone interface on the entity class so we don't need to add many copy_xyz APIs?

B-Step62 · 2026-07-10T18:37:14Z

+
+`evaluate()` itself does not change. By the time scorers reach the evaluation loop, they are all individual `Scorer` instances.
+
+### Built-in Preset Summary


Yeah I think it's nice to have several built-in presets for educational purpose for this feature.

B-Step62 · 2026-07-10T18:40:21Z

@@ -0,0 +1,282 @@
+---


[Not blocker for starting implementation] Can we also discuss UI surface for presets? Currently MLflow have Judge tab that lists all registered scorers.

It would be nice if the preset can also be listed and CRUD from UI, plus we list the three built-in presets are quickstart. It also help discoverability of the feature a lot.

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>

Nehanth · 2026-07-10T18:56:45Z

Thanks @B-Step62 for the feedback. I've updated the RFC with your suggestions — renamed to get_scorer_preset, changed to scorers=list(preset.scorers) to keep the evaluate interface unchanged, and updated copy to preset.copy(to_experiment_id). For the UI surface, would you suggest creating a separate RFC to discuss that?

B-Step62 · 2026-07-10T19:00:23Z

Thanks for the quick turnaround @Nehanth!

I prefer adding UI discussion in this RFC because our stance is UI/UX is an essential part of the feature. That said, I don't want to slow this down so feel free to start backend impl once @mprahl sign off.

Btw a small thing about scorers=list(preset.scorers), is there any reason we use tuple for return object? The list casting seems unnecessary otherwise.

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>

Nehanth · 2026-07-10T19:20:03Z

Thanks for the quick turnaround @Nehanth!

I prefer adding UI discussion in this RFC because our stance is UI/UX is an essential part of the feature. That said, I don't want to slow this down so feel free to start backend impl once @mprahl sign off.

Btw a small thing about scorers=list(preset.scorers), is there any reason we use tuple for return object? The list casting seems unnecessary otherwise.

Changed scorers to return a list. As for the UI, I'll get started on it and add a UI section to the RFC.

Nehanth mentioned this pull request Apr 28, 2026

[FR] Scorer presets for common evaluation patterns (agent, RAG, safety) mlflow/mlflow#21445

Open

2 tasks

Nehanth and others added 3 commits April 28, 2026 11:04

Fix naming consistency throughout RFC

00db80f

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>

Remove _BUILTIN_PRESETS and list_presets from main proposal

f42a950

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>

mprahl reviewed May 5, 2026

View reviewed changes

Nehanth and others added 2 commits May 6, 2026 16:48

Add back validate_scorers code block for review context

22f64f8

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>

Nehanth requested a review from mprahl May 8, 2026 15:37

mprahl reviewed May 18, 2026

View reviewed changes

mprahl requested changes May 18, 2026

View reviewed changes

B-Step62 reviewed May 18, 2026

View reviewed changes

Nehanth and others added 2 commits June 1, 2026 10:30

Switch from __add__ to __or__ for set union semantics

287c3cc

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>

Update RFC: add persistence, customization, drop Safety/Quality presets

9710c79

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>

Nehanth force-pushed the rfc-0007-scorer-presets branch from 8e31d00 to 9710c79 Compare June 1, 2026 14:59

Nehanth and others added 2 commits June 1, 2026 11:08

Clean up alternatives and open questions

040ecb7

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>

Update date, simplify open questions

4c6388c

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>

Nehanth requested review from B-Step62 and mprahl June 1, 2026 15:32

Trim RFC: simplify design rationale and alternatives

6ab572e

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>

Nehanth changed the title ~~Add RFC 0007: Scorer Presets for Common Evaluation Patterns~~ Add RFC 0007: Scorer Presets with Customization and Persistence Jun 8, 2026

jwm4 mentioned this pull request Jun 8, 2026

plugin authn/authz rfc #19

Open

5 tasks

jwm4 mentioned this pull request Jun 8, 2026

RFC-0008 and RFC-0009: Skill Registry and Harness Integration #10

Closed

jwm4 suggested changes Jun 11, 2026

View reviewed changes

Address Bill's review: persistence UX, trim implementation details

5591ffa

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>

Nehanth force-pushed the rfc-0007-scorer-presets branch from 50e6011 to 5591ffa Compare June 16, 2026 16:15

Nehanth requested a review from jwm4 June 16, 2026 16:19

jwm4 approved these changes Jun 17, 2026

View reviewed changes

mprahl reviewed Jul 6, 2026

View reviewed changes

mprahl requested changes Jul 6, 2026

View reviewed changes

Drop preset combining, add full CRUD API, versioning, copy

6230d54

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>

Nehanth requested a review from mprahl July 10, 2026 18:14

B-Step62 approved these changes Jul 10, 2026

View reviewed changes

Address B-Step62's review: rename API, simplify evaluate interface

08619d7

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>

Change scorers property from tuple to list, remove list() casting

58c2f72

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>


		MLflow ships five built-in preset subclasses. Each call creates fresh scorer instances.

		> Note: `TaskSuccess` is a new scorer proposed in [mlflow/mlflow#22972](https://github.com/mlflow/mlflow/issues/22972). It evaluates whether an agent successfully accomplished the user's task without requiring ground truth data — unlike `Correctness`, which requires an `expectations` column. This scorer would be added to `Agent`, `ConversationalAgent`, and `Quality`. This work can be part of this RFC or be a future addition after this RFC is completed.


		### Built-in Presets as Subclasses

		Each built-in preset is a subclass of `Preset` that hardcodes its scorer list. This means each call creates fresh scorer instances (no shared mutable singletons) and opens the door for preset-specific configuration and control flow in the future.


		### Persistence

		Presets can be registered to the MLflow server so teams can share them across sessions. This leverages the existing scorer registration infrastructure.


		Every team building agent evaluation follows this same pattern. This creates three problems (from the [original issue](https://github.com/mlflow/mlflow/issues/21445)):

		1. No built-in grouping. `get_all_scorers()` returns all 19 default-constructible scorers. Users evaluating a RAG pipeline get `ToolCallCorrectness`; users evaluating an agent get `RetrievalGroundedness`. Each unnecessary scorer wastes an LLM API call.


		Persistence behavior:

		- Scope. Presets are scoped to experiments, consistent with how scorer registration already works in MLflow. This prevents name collisions across teams and ensures presets are organized alongside the experiments they evaluate. If no `experiment_id` is provided, the active experiment is used.


		When presets are combined using `\|`, the same scorer type can appear more than once. For example, `Agent()` and `Rag()` both contain `Safety()` and `RelevanceToQuery()`. Running the same scorer twice wastes LLM API calls and produces duplicate result columns.

		Deduplication happens in two places:


		`evaluate()` itself does not change. By the time scorers reach the evaluation loop, they are all individual `Scorer` instances.

		### Built-in Preset Summary

	from mlflow.genai.scorers import get_preset
	from mlflow.genai.scorers import get_scorer_preset

Uh oh!

Conversation

Nehanth commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

mprahl May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mprahl left a comment

Choose a reason for hiding this comment

Uh oh!

B-Step62 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

etirelli commented May 19, 2026

Uh oh!

Nehanth commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jwm4 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Nehanth Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jwm4 Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Nehanth commented Apr 28, 2026 •

edited

Loading

mprahl May 5, 2026 •

edited

Loading

Nehanth commented Jun 1, 2026 •

edited

Loading

Nehanth Jun 16, 2026 •

edited

Loading

jwm4 Jun 11, 2026 •

edited

Loading