diff --git a/rfcs/0007-scorer-presets/0007-scorer-presets.md b/rfcs/0007-scorer-presets/0007-scorer-presets.md new file mode 100644 index 0000000..04a9c83 --- /dev/null +++ b/rfcs/0007-scorer-presets/0007-scorer-presets.md @@ -0,0 +1,456 @@ +--- + +## start_date: 2026-04-23 + +mlflow_issue: [https://github.com/mlflow/mlflow/issues/21445](https://github.com/mlflow/mlflow/issues/21445) +rfc_pr: + +# Scorer Presets for Common Evaluation Patterns + + +| Author(s) | Nehanth | +| ---------------------- | ----------- | +| **Date Last Modified** | 2026-04-28 | +| **AI Assistant(s)** | Claude Code | + + +# Summary + +> **Note:** This RFC is based on [mlflow/mlflow#21445](https://github.com/mlflow/mlflow/issues/21445). The motivation, proposed presets, and API examples are derived from that issue, with additional design details and implementation specifics added here. + +MLflow provides 21 built-in scorers for evaluating GenAI outputs, but users have no way to select a coherent subset for a specific evaluation pattern. Today, evaluating an agent requires importing and instantiating 9+ individual scorer classes -- boilerplate that gets copy-pasted across teams and templates. + +This RFC proposes a `Preset` class that packages a named collection of scorers. MLflow ships built-in preset subclasses for common evaluation patterns (`Rag`, `Agent`, `ConversationalAgent`, `SafetyPreset`, `Quality`), and users can define their own. Presets can be passed directly in the `scorers` list alongside individual scorers, with automatic deduplication when presets overlap. + +# Basic Example + +```python +import mlflow +from mlflow.genai.scorers import Agent + +# Use a built-in preset directly -- each call creates fresh scorer instances +result = mlflow.genai.evaluate( + data=eval_dataset, + predict_fn=predict_fn, + scorers=[Agent()], +) +``` + +```python +# Mix presets and individual scorers +from mlflow.genai.scorers import Agent, Guidelines + +result = mlflow.genai.evaluate( + data=eval_dataset, + predict_fn=predict_fn, + scorers=[Agent(), Guidelines(name="tone", guidelines=["Respond professionally"])], +) +``` + +```python +# Combine presets -- duplicates are resolved automatically +from mlflow.genai.scorers import Agent, SafetyPreset + +# Both contain Safety(); it runs once, not twice +result = mlflow.genai.evaluate( + data=eval_dataset, + scorers=[Agent(), SafetyPreset()], +) +``` + +```python +# Define a custom preset +from mlflow.genai.scorers import Preset, Safety, Fluency + +my_preset = Preset("my_team_eval", scorers=[Safety(), Fluency(), my_custom_scorer]) + +result = mlflow.genai.evaluate( + data=eval_dataset, + scorers=[my_preset, another_scorer], +) +``` + +## Motivation + +### The Problem + +As described in [the original issue](https://github.com/mlflow/mlflow/issues/21445), the Databricks agent app template [evaluate_agent.py](https://github.com/databricks/app-templates/blob/main/agent-openai-agents-sdk/agent_server/evaluate_agent.py) imports and instantiates 9 separate scorers to evaluate a conversational agent: + +```python +from mlflow.genai.scorers import ( + Completeness, + ConversationalSafety, + ConversationCompleteness, + Fluency, + KnowledgeRetention, + RelevanceToQuery, + Safety, + ToolCallCorrectness, + UserFrustration, +) + +mlflow.genai.evaluate( + data=simulator, + predict_fn=predict_fn, + scorers=[ + Completeness(), + ConversationCompleteness(), + ConversationalSafety(), + KnowledgeRetention(), + UserFrustration(), + Fluency(), + RelevanceToQuery(), + Safety(), + ToolCallCorrectness(), + ], +) +``` + +Every team building agent evaluation follows this same pattern. This creates three problems (from the [original issue](https://github.com/mlflow/mlflow/issues/21445)): + +1. **No built-in grouping.** `get_all_scorers()` returns all 19 default-constructible scorers. Users evaluating a RAG pipeline get `ToolCallCorrectness`; users evaluating an agent get `RetrievalGroundedness`. Each unnecessary scorer wastes an LLM API call. +2. **21 scorers to choose from.** Users must read documentation for each scorer to determine relevance. Session-level scorers (e.g., `KnowledgeRetention`) silently produce no results when passed to single-turn evaluation. +3. **Copy-paste problem.** The same scorer lists get duplicated across templates, notebooks, and tutorials. When new scorers are added, existing lists don't pick them up. + +### Who Benefits + +- **New users** get a curated starting point without reading all 21 scorer docs +- **Teams** can define and share custom presets, ensuring consistent evaluation across projects +- **Template authors** replace hardcoded scorer lists with a single preset +- **MLflow maintainers** gain a single place to update when new scorers are added + +### Out of Scope + +- **Parameterized presets.** Passing `model` or `inference_params` to all scorers in a preset. Users can iterate over the preset's scorers instead. +- **Third-party scorer presets.** Integrating presets for DeepEval, RAGAS, or TruLens scorers. +- **Preset registration/storage in the tracking server.** Presets are code-side only. + +## Detailed Design + +### The `Preset` Class + +A `Preset` is a named, iterable container of scorers. It is **not** a `Scorer` subclass -- it is a grouping mechanism that gets flattened into individual scorers at validation time. + +```python +class Preset: + """A named, immutable collection of scorers for a common evaluation pattern. + + Presets can be passed in the ``scorers`` list alongside individual + scorers. They are flattened and deduplicated during validation, + so the evaluation loop only ever sees individual ``Scorer`` instances. + + Args: + name: A descriptive name for this preset. + scorers: The list of scorer instances in this preset. + """ + + def __init__(self, name: str, scorers: list[Scorer]): + self._name = name + self._scorers = tuple(self._deduplicate(scorers)) + + @staticmethod + def _deduplicate(scorers): + seen = set() + result = [] + for scorer in scorers: + key = (type(scorer), scorer.name) + if key not in seen: + seen.add(key) + result.append(scorer) + return result + + @property + def name(self) -> str: + return self._name + + @property + def scorers(self) -> tuple: + return self._scorers + + def __iter__(self): + return iter(self._scorers) + + def __len__(self): + return len(self._scorers) + + def __add__(self, other): + if isinstance(other, (Preset, list)): + combined = list(self) + list(other) + return self._deduplicate(combined) + return NotImplemented + + def __radd__(self, other): + if isinstance(other, list): + combined = other + list(self) + return self._deduplicate(combined) + return NotImplemented + + def __repr__(self): + scorer_names = [type(s).__name__ for s in self._scorers] + return f"Preset('{self._name}', [{', '.join(scorer_names)}])" +``` + +**Key design decisions:** + +- **Immutable and deduplicated.** Scorers are stored as a tuple and exposed via a read-only property. Deduplication happens in `__init__` and `__add__` using `(type, name)` as the key, so scorers of the same class with different names are preserved (e.g., two `Guidelines` with different rules). +- **Not a `Scorer` subclass.** A preset doesn't produce feedback -- it's a container. The evaluation loop assumes one scorer = one result column. Making `Preset` a scorer would require changes throughout the pipeline (aggregation, telemetry, serialization). +- **Iterable.** Supports `__iter__`, `__len__`, and `__add__`/`__radd__` so it composes naturally: `Agent() + [my_scorer]`, `[my_scorer] + Agent()`, or `Agent() + SafetyPreset()`. +- **Stores instances, not classes.** Users pass already-configured scorer instances. + +### Built-in Presets as Subclasses + +Each built-in preset is a subclass of `Preset` that hardcodes its scorer list. This means each call creates **fresh scorer instances** (no shared mutable singletons) and opens the door for preset-specific configuration and control flow in the future. + +```python +class Agent(Preset): + def __init__(self): + super().__init__("agent", [ + ToolCallCorrectness(), + ToolCallEfficiency(), + RelevanceToQuery(), + Safety(), + Completeness(), + ]) + +class Rag(Preset): + def __init__(self): + super().__init__("rag", [ + RetrievalRelevance(), + RetrievalGroundedness(), + RelevanceToQuery(), + Safety(), + Completeness(), + ]) + +class ConversationalAgent(Preset): + def __init__(self): + super().__init__("conversational-agent", [ + ToolCallCorrectness(), + ToolCallEfficiency(), + RelevanceToQuery(), + Safety(), + Completeness(), + UserFrustration(), + ConversationCompleteness(), + ConversationalSafety(), + ConversationalToolCallEfficiency(), + KnowledgeRetention(), + ]) + +class SafetyPreset(Preset): + def __init__(self): + super().__init__("safety", [ + Safety(), + ConversationalSafety(), + ]) + +class Quality(Preset): + def __init__(self): + super().__init__("quality", [ + RelevanceToQuery(), + Fluency(), + Completeness(), + ]) +``` + +**Why subclasses over instances:** + +- **Fresh instances every time.** `Agent()` creates new scorer instances on each call. No shared mutable state — the singleton problem is eliminated entirely. +- **Preset-specific configuration.** Each preset can accept its own parameters in the future (e.g., `Agent(model="openai:/gpt-4o")` to set the judge model for all scorers). +- **Type checking.** `isinstance(preset, Agent)` works — code can distinguish which preset is being used. +- **Custom control flow.** Each preset can override methods for preset-specific validation or behavior. + +### Deduplication + +When multiple presets are combined, the same scorer type can appear more than once. For example, `Agent()` and `SafetyPreset()` both contain `Safety()`. Running the same scorer twice wastes LLM API calls and produces duplicate result columns. + +Deduplication happens in two places: + +- **In the `Preset` class** — both `__init__` and `__add__` deduplicate using `(type(scorer), scorer.name)` as the key, so the preset is always clean whenever scorers are added or combined. +- **In `validate_scorers()`** — when multiple presets are passed directly in a list (e.g., `scorers=[Agent(), SafetyPreset()]`) without using `+`, `__add__` is never called. `validate_scorers()` flattens and deduplicates as a safety net: + +```python +def validate_scorers(scorers: list[Any]) -> list[Scorer]: + from mlflow.genai.scorers.presets import Preset + + # 1. Flatten presets into individual scorers + flat = [] + for item in scorers: + if isinstance(item, Preset): + flat.extend(item) + else: + flat.append(item) + + # 2. Deduplicate by (type, name) + flat = Preset._deduplicate(flat) + + # 3. Existing validation on the flattened list + ... +``` + +Scorers of the same class with different names are preserved (e.g., two `Guidelines` with different rules). Only true duplicates — same class and same name — are removed. + +`evaluate()` itself does not change. By the time scorers reach the evaluation loop, they are all individual `Scorer` instances. + +### Built-in Preset Summary + +MLflow ships five built-in preset subclasses. Each call creates fresh scorer instances. + +| Preset | Scorers | Use Case | +| ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------- | +| `Rag()` | RetrievalRelevance, RetrievalGroundedness, RelevanceToQuery, Safety, Completeness | Retrieval-augmented generation pipelines | +| `Agent()` | ToolCallCorrectness, ToolCallEfficiency, RelevanceToQuery, Safety, Completeness | Single-turn tool-calling agents | +| `ConversationalAgent()`| All of `Agent` + UserFrustration, ConversationCompleteness, ConversationalSafety, ConversationalToolCallEfficiency, KnowledgeRetention | Multi-turn conversational agents | +| `SafetyPreset()` | Safety, ConversationalSafety | Safety-focused evaluation (composable with other presets) | +| `Quality()` | RelevanceToQuery, Fluency, Completeness | Architecture-independent output quality | + + +#### Design Rationale + +- **Safety is in `Rag` and `Agent`** because these presets aim to be complete starting points. Most users want safety checks without composing two presets. +- **Fluency is excluded from `Agent`** because agent evaluation emphasizes tool usage and task completion. Users who need it can compose: `Agent() + [Fluency()]`. +- **`ConversationalAgent` excludes `ConversationalRoleAdherence`** because it requires a defined persona in the system prompt, which not all agents have. +- **`RetrievalSufficiency` is excluded from `Rag`** because it requires `expected_response` or `expected_facts` (ground truth). Users who have expectations data can add it manually: `Rag() + [RetrievalSufficiency()]`. +- **`Correctness` is excluded from all presets** because it requires `expectations` (ground truth) data. Users who have ground truth can add it manually: `Quality() + [Correctness()]`. +- **`Guidelines` and `ConversationalGuidelines` are excluded from all presets** because both require a `guidelines` constructor argument. + +## Drawbacks + +1. **New class in the API.** Adds `Preset` to the public surface. Mitigation: it's a simple container with no complex behavior. +2. **Opinionated defaults.** Not everyone will agree on which scorers belong in which preset. Mitigation: presets are extensible via `+`, and users can define their own. +3. **Implicit behavior changes on upgrade.** A new scorer added to a built-in preset means different evaluation results after upgrading. Consistent with how `get_all_scorers()` already behaves. + +# Alternatives + +### 1. `get_preset()` function (no class) + +Instead of a `Preset` class, provide a simple function that returns a plain list: + +```python +from typing import Literal + +from mlflow.exceptions import MlflowException +from mlflow.genai.scorers.builtin_scorers import ( + Completeness, + ConversationalSafety, + ConversationalToolCallEfficiency, + ConversationCompleteness, + Correctness, + Fluency, + KnowledgeRetention, + RelevanceToQuery, + RetrievalGroundedness, + RetrievalRelevance, + RetrievalSufficiency, + Safety, + ToolCallCorrectness, + ToolCallEfficiency, + UserFrustration, +) + +_PRESETS: dict[str, list[type]] = { + "rag": [ + RetrievalRelevance, + RetrievalSufficiency, + RetrievalGroundedness, + RelevanceToQuery, + Safety, + Completeness, + ], + "agent": [ + ToolCallCorrectness, + ToolCallEfficiency, + RelevanceToQuery, + Safety, + Completeness, + ], + "conversational-agent": [ + ToolCallCorrectness, + ToolCallEfficiency, + RelevanceToQuery, + Safety, + Completeness, + UserFrustration, + ConversationCompleteness, + ConversationalSafety, + ConversationalToolCallEfficiency, + KnowledgeRetention, + ], + "safety": [ + Safety, + ConversationalSafety, + ], + "quality": [ + RelevanceToQuery, + Fluency, + Completeness, + Correctness, + ], +} + +_VALID_PRESET_NAMES = ", ".join(sorted(_PRESETS.keys())) +PresetName = Literal["rag", "agent", "conversational-agent", "safety", "quality"] + + +def get_preset(name: PresetName) -> list: + if name not in _PRESETS: + raise MlflowException.invalid_parameter_value( + f"Unknown preset '{name}'. Valid presets are: {_VALID_PRESET_NAMES}" + ) + return [scorer_class() for scorer_class in _PRESETS[name]] + + +def list_presets() -> dict[str, list[str]]: + return { + name: [cls.__name__ for cls in classes] + for name, classes in _PRESETS.items() + } +``` + +Usage: + +```python +from mlflow.genai.scorers import get_preset + +# Simple usage +result = mlflow.genai.evaluate(scorers=get_preset("agent")) + +# Extending a preset +scorers = get_preset("agent") + [Guidelines(name="tone", guidelines=["Be professional"])] +result = mlflow.genai.evaluate(scorers=scorers) +``` + +**Pros:** Simpler (~30 lines). No validation changes needed. Returns fresh instances each call (no mutable singleton concern). `Literal` type gives IDE autocompletion. Going from function to class later is non-breaking. + +**Cons:** No user-defined presets. Composition requires `+` with list concatenation. The preset concept disappears immediately -- it's just a list. No deduplication when combining presets. + +This is a viable first step if the class approach is deemed too heavy. The class can be added later as a non-breaking extension. + +### 2. Tag-based filtering + +Add `categories` to each scorer class and provide `get_scorers(categories=["rag"])`. More flexible but over-engineered for 21 scorers and requires modifying every existing class. + +### 3. Enum-based API + +`ScorerPreset.RAG.get_scorers()`. Type-safe but heavier API surface. The `Literal` type on a function already provides IDE autocompletion. + +### 4. Do nothing + +Users keep copy-pasting scorer lists. Does not scale as the scorer count grows. + +# Adoption Strategy + +This is an **additive, non-breaking change**. Existing code continues to work unchanged. + +- Update documentation and templates to show `Preset` usage alongside the manual import pattern. +- Update the `validate_scorers()` error message to mention presets for discoverability. +- Databricks agent templates can simplify from 9 imports + 9 instantiations to `scorers=[ConversationalAgent()]`. + +# Open Questions + +1. **Should `ConversationalRoleAdherence` be in `ConversationalAgent`?** Currently excluded because it requires a defined persona. **Open for discussion.** +2. **Should `Correctness` be in `Agent` or `Rag`?** Currently excluded from all presets because it requires `expectations` data. **Open for discussion.** +3. **Should there be an `All` preset?** `get_all_scorers()` already serves this role. **Recommendation:** Do not add. +4. **Deduplication key.** Should deduplication use `type(scorer)` alone, or `(type(scorer), scorer.name)`? The latter preserves multiple instances of the same class with different names (e.g., two `Guidelines` with different rules). +5. **Future: parameterized presets?** e.g., `Agent(model="openai:/gpt-4o")` to set the judge model for all scorers in the preset. Can be a future addition. +