Add RFC 0007: Scorer Presets for Common Evaluation Patterns#13
Conversation
Proposes a Preset class that packages a named collection of scorers for common evaluation patterns (RAG, agent, conversational-agent, safety, quality). Presets can be passed directly in the scorers list alongside individual scorers, with automatic deduplication. Based on mlflow/mlflow#21445. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>
Each built-in preset is now a subclass (Agent, Rag, ConversationalAgent, SafetyPreset, Quality) that creates fresh scorer instances on each call. Eliminates shared mutable state and enables future preset-specific configuration. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>
| return len(self._scorers) | ||
|
|
||
| def __add__(self, other): | ||
| if isinstance(other, (Preset, list)): |
There was a problem hiding this comment.
Does it make sense to deduplicate scorers when combining presets especially since some have overlap? Edit: I see this is addressed below. It might be good to mention it here too.
There was a problem hiding this comment.
Yes, I've added deduplication in add so duplicates are removed when presets are combined.
| `validate_scorers()` deduplicates by scorer type after flattening: | ||
|
|
||
| ```python | ||
| def _deduplicate_scorers(scorers: list[Scorer]) -> list[Scorer]: |
There was a problem hiding this comment.
We should keep Preset deduplicated whenever scorers are added instead of relying on validate_scorers to do it IMO.
There was a problem hiding this comment.
I think we should have both, add (I added the deduplication there as well) handles deduplication when a user combines presets or scorers using the + operator, and validate_scorers() handles it when multiple presets are passed directly in the scorers list.
| @@ -0,0 +1,682 @@ | |||
| --- | |||
There was a problem hiding this comment.
This RFC has too many implementation details. The RFC should be scoped to requirements, APIs, database schema changes, etc. This is delving into PR level implementation.
Can we keep this scoped to the desired UX, the proposed groupings, and the class definition?
| - **Type checking.** `isinstance(preset, Agent)` works — code can distinguish which preset is being used. | ||
| - **Custom control flow.** Each preset can override methods for preset-specific validation or behavior. | ||
|
|
||
| ### Deduplication |
There was a problem hiding this comment.
How does deduplication account for scorers with different configurations (e.g. ConversationalGuidelines with separate configs)?
There was a problem hiding this comment.
Deduplication works by (type(scorer), scorer.name), so if you use different names for your
ConversationalGuidelines scorers, they'll be kept. If two scorers have the same type and name, one gets removed.
|
|
||
| MLflow ships five built-in preset subclasses. Each call creates fresh scorer instances. | ||
|
|
||
| > **Note:** **`TaskSuccess`** is a new scorer proposed in [mlflow/mlflow#22972](https://github.com/mlflow/mlflow/issues/22972). It evaluates whether an agent successfully accomplished the user's task without requiring ground truth data — unlike `Correctness`, which requires an `expectations` column. This scorer would be added to `Agent`, `ConversationalAgent`, and `Quality`. This work can be part of this RFC or be a future addition after this RFC is completed. |
There was a problem hiding this comment.
Is this relevant for this RFC? It seems out of scope.
There was a problem hiding this comment.
You're right, it's out of scope for this RFC. I'll remove it. I did create a separate issue for it mlflow#22972, once it's implemented it would be added to the Agent, ConversationalAgent, and Quality presets.comm
|
|
||
| ### Built-in Presets as Subclasses | ||
|
|
||
| Each built-in preset is a subclass of `Preset` that hardcodes its scorer list. This means each call creates **fresh scorer instances** (no shared mutable singletons) and opens the door for preset-specific configuration and control flow in the future. |
There was a problem hiding this comment.
I don't really see a benefit to these presets if you can't configure the scorers. If it's merely to categorize different scorers, there are better ways to do this such as documentation or a function like get_agent_scorers() which returns the classes for discoverability.
There was a problem hiding this comment.
The reason I’m going in this direction is because, after this RFC is approved, subclassing opens up more than just categorization. For example, each preset could accept different judge models (e.g, Agent(model="openai:/gpt-4o")), different inference parameters, or other preset-specific configurations.
If this RFC gets approved, I can also update it to use the function-based approach I proposed in the alternatives section, if that’s preferred.
There was a problem hiding this comment.
I agree with Matt about customization capability. The class-based approach is more ergonomic to me but I think it should come with a good customizability and persistence.
| def __init__(self): | ||
| super().__init__("rag", [ | ||
| RetrievalRelevance(), | ||
| RetrievalSufficiency(), |
There was a problem hiding this comment.
Doesn't this require expected_response or expected_facts?
There was a problem hiding this comment.
Yes, RetrievalSufficiency does require ground truth, so I'm removing it from this preset. We want presets to work out of the box without requiring ground truth for easy evaluation.
- Add deduplication to Preset __init__ and __add__ - Remove TaskSuccess from presets (out of scope) - Remove RetrievalSufficiency from Rag preset (requires ground truth) - Trim implementation details per reviewer feedback Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>
|
Hey @mprahl, thanks for taking the time to review my RFC proposal. I've updated the RFC per you feedback commented back one some of the comments you made. |
| def __len__(self): | ||
| return len(self._scorers) | ||
|
|
||
| def __add__(self, other): |
There was a problem hiding this comment.
Do you think it'd be less surprising to implement __or__ for "set" union behavior since you are deduplicating?
mprahl
left a comment
There was a problem hiding this comment.
I’m not convinced this should be a feature in its current form. Without a way to configure or cleanly override scorers inside a preset, the value seems limited to boilerplate reduction. At the same time, if built-in preset membership changes across releases, users will get silent behavior and cost changes on upgrade. That combination makes presets feel more like unstable convenience sugar than a strong API surface.
B-Step62
left a comment
There was a problem hiding this comment.
Thank you for the RFC submission @Nehanth!
The idea of bundling multiple scorers into sharable preset sounds very useful to me. My main feedback is to support persistence of presets in MLflow server and provide great customization UX. I can see built-in presets are useful for new users to getting started and solves "which scorer should I start with?" question, but many teams need customization for adopting them for the real use cases.
| ) | ||
| ``` | ||
|
|
||
| ```python |
There was a problem hiding this comment.
I believe we need to persist presets more than a single python session. Otherwise L65 is essentially same as defining a normal python list of scorers. Team sharing benefit mentioned below only holds with persistence in a shared server.
| self._scorers = tuple(self._deduplicate(scorers)) | ||
|
|
||
| @staticmethod | ||
| def _deduplicate(scorers): |
There was a problem hiding this comment.
nit: Instead of deduplication, I would block adding an existing scorer in the preset.
|
|
||
| ### Built-in Presets as Subclasses | ||
|
|
||
| Each built-in preset is a subclass of `Preset` that hardcodes its scorer list. This means each call creates **fresh scorer instances** (no shared mutable singletons) and opens the door for preset-specific configuration and control flow in the future. |
There was a problem hiding this comment.
I agree with Matt about customization capability. The class-based approach is more ergonomic to me but I think it should come with a good customizability and persistence.
| class SafetyPreset(Preset): | ||
| def __init__(self): | ||
| super().__init__("safety", [ | ||
| Safety(), | ||
| ConversationalSafety(), | ||
| ]) | ||
|
|
||
| class Quality(Preset): | ||
| def __init__(self): | ||
| super().__init__("quality", [ | ||
| RelevanceToQuery(), | ||
| Fluency(), | ||
| Completeness(), | ||
| ]) |
There was a problem hiding this comment.
RAG/Agent/Conversation presets look great, the other two look less useful or intuitive to me. Safety one just bundles two scorers with 'safety' in its name. "Quality" sounds a bit too vague imo, technically most of scorers aims to evaluate some aspect of the agent quality. Shall we start with the first three + great customization story?
|
I fully agree with @mprahl on this:
If this is not handled properly, this feature would end up being a burden for us to maintain, and create friction for users on upgrades, rather than being a benefit.
A function based implementation does not prevent such customizations. |
Summary
This RFC proposes a
Presetclass for MLflow that packages a named, immutable collection of scorers for common evaluation patterns. MLflow ships five built-in presets (RAG,AGENT,CONVERSATIONAL_AGENT,SAFETY,QUALITY), and users can define their own.Presets can be passed directly in the
scorerslist alongside individual scorers:Duplicate scorers across presets are automatically deduplicated during validation.
Based on mlflow/mlflow#21445.
Also introduces a new scorer proposal:
TaskSuccess— a ground-truth-free scorer for evaluating agent result quality.Co-Authored-By: Claude noreply@anthropic.com