RFC: Trace Views by forrestmurray-db · Pull Request #17 · mlflow/rfcs

forrestmurray-db · 2026-05-12T17:13:39Z

Summary

Proposes trace views for MLflow: named, reusable configurations that filter and label MLflow traces so SMEs, debuggers, and judges can focus on the parts of a trace that matter for their task.

What this RFC covers

Four critical user journeys — SME labeling, developer prepping for a labeling session, developer debugging an agent failure, judge author scoring against a view — with screenshots and workflow GIFs from the implementation prototype
The creation model: AI generates a view by default; the developer refines it via the in-UI editor or creates one from scratch by drag-selecting spans
Data model: TraceView + SpanRange + SpanSelector
Phased adoption: a slim PR with the entity, REST, Python client, CLI, view selector and editor (Phase 1); AI generation and judge integration (Phase 2); scale features (Phase 3)
Alternatives considered: Braintrust Custom Views, LangSmith Messages View, span tags, client-side-only, single-filter-per-view
Open questions: editor scope in v1, where display config lives, selector variables in templates, JSONPath dialect, AI quality, read-time consistency under schema change

Context

Tracking issue: Feature Request: Trace Views — filtered, annotated views of traces with span ranges and deeplinks mlflow#22499
Implementation (Phase 1, slim): forrestmurray-db:impl/trace-views-slim
Prior draft implementation PR: feat: trace views with span ranges, editor, and deeplinks mlflow#22500

Status

Draft — looking for UI/UX alignment first (per @B-Step62's guidance to lead with UX since implementation depends on it), then feedback on the data model and adoption phasing.

Screenshots and GIFs in the RFC are real captures from the working prototype.

RFC proposing trace views: named, reusable configurations that filter and label MLflow traces so SMEs, debuggers, and judges can focus on the parts of a trace that matter for their task. Spans the UX (4 critical user journeys with screenshots and a workflow GIF), the creation model (AI generates by default, developer edits or authors directly), the data model (TraceView + SpanRange), and a phased adoption strategy that maps to the in-flight implementation PR. Related: - mlflow/mlflow#22499 (tracking issue) - forrestmurray-db/mlflow:impl/trace-views-slim (implementation, Phase 1) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Forrest Murray <forrest.murray@databricks.com>

B-Step62 · 2026-05-13T06:53:05Z

+
+**Today:** She runs the session with raw traces. SMEs ask her in Slack what to focus on. She writes a doc; SMEs ignore it. She gives up and labels them herself, or the session ships poor labels and eval datasets degrade.
+
+**With trace views:** She runs the built-in trace-view skill — `trace.summarize()` followed by `summary.create_view()` — over the batch. An LLM analyzes each trace, identifies the agent's milestones (e.g., "Plan → Search → Synthesize"), and persists a `TraceView` with one `SpanRange` per milestone. An alternative approach would be to expose skills only and let users drive them with their own agent or the assistant. MLflow doesn't have an OOB agent harness yet so this is likely to be less complex than exposing a summarize method on the trace itself. In testing I found that a single prompt could produce decent milestones for traces from various agents.


I like the alternative approach more. To give a context, since MLflow is self-hosted, running built-in harness within MLflow requires users to configure api key and model. This has been a major dropout point for similar AI-driven products we released e.g. issue detection.

With the support of custom view and good apis/skills, I think coding agents can do a decent job.

For development of my draft PR I used only MLflow assistant + Claude code and it did work well.

I think there are a couple issues with this purely:

Discoverability: Relying on the assistant would mean it's only discoverable via docs or LSP. We could remedy this by adding a button to open assistant with a pre-filled summarize prompt, but that assumes the user has already set up assistant.

DX: essentially there's very little control on how to apply this over an experiment-level batch. It's fairly reasonable to say "create milestone trace views for all traces in the experiment", but then claude will likely take a while to figure out how to do this and there's no observability over the process vs.

traces = search_traces() for t in traces: try: summary = t.summarize() t.create_view(summary) except: log(f'summary for {t.trace_id} failed')

Differentiation: Other platforms don't provide this programmatically via sdk, mostly they expose it in their assistants. This affords developers the ability to build it in to their applications which meets teams where they already are.

B-Step62 · 2026-05-13T06:56:12Z

+
+![Output path selection](images/07-output-path-selection.gif)
+
+It's possible that the traces will have a consistent structure which generalizes across all traces in the experiment. In which case it should be easy to apply to all traces.


It is also possible the traces within experiment does not have very consistent structure. It is common that the same agent has different order and number of llm, tool, retriever calls. How does this RFC address it?

B-Step62 · 2026-05-13T07:01:52Z

+
+# Data model and API
+
+A `TraceView` contains an ordered list of `SpanRange`s. Each range identifies a segment of the trace and optionally extracts fields from it.


Braintrust supports custom view through React function. What is the strength of doing limiting the choice to a range of span over that? Is it a risk that we cannot support some type of views that competitors can do?

Listed that in the later section, personally I think that's quite a strong approach, and I seriously considered something like it initially. Here's how I see it:

Pros:

Really flexible and can hook into the feedback APIs to provide custom human-feedback.

Embeddable (I assume?) so it can be served directly in a custom app or replace a "dumb" thumbs-up/thumbs-down UX.

Cons:

Locks consumers in to having a DOM-based rendering layer and the deps. to produce it. That actually disqualifies this from being usable as a judge template var {{ trajectory }} which is just as strong of a motivator as SME readability. Maybe there's a way to get really clever about SSR-ing views for judges.

Unsafe/incorrect: Relies on a coding agent to produce error free JSX (TSX?), unclear how they test it, unclear if it's actually trustable code.

Can't be as easily edited via the UI, e.g. how do I add a JSON field to one of the elements I want to render? I have to write some JSX in the text area?

Do you see any other pros that I'm missing? I think the solution is really pretty nice but technically more complex and the more I think about it, it's hard to come up with custom things you'd want the UI to actually do here.

Another approach is to have TraceViews support composable widgets that could be more easily created in the UI directly without custom code issues mentioned above.

B-Step62 · 2026-05-13T07:06:26Z

+DELETE /mlflow/experiments/{exp_id}/views/{view_id}
+```
+
+**Schema.** A single `trace_views` table holds both scopes, with a `CHECK` constraint that exactly one of `trace_id` / `experiment_id` is set. Ranges are stored as a JSON column on the row to keep schema migrations minimal.


Are all views visible by default to everyone? In a large organization or teams, I think there should be some filtering/grouping otherwise it is hard to find a view from hundreds of them.

I doubt that there would be that many, maybe it's worth getting rid of experiment scoping on traces altogether? Others have project views as well as trace views.

B-Step62 · 2026-05-13T07:17:05Z

+
+**3. Single span filter per view.** The first draft used one `SpanFilter` per view. Real traces have multiple phases users want to highlight together — planning, retrieval, generation — so the model evolved to multi-range with `SpanRange[]`. The single-filter version made the common case awkward.
+
+**4. Custom React components (Braintrust Custom Views).** Braintrust lets developers write arbitrary React for trace rendering. Rejected because (a) the surface is too large to land in OSS MLflow with reasonable maintenance cost, (b) React components are not diffable or auditable in the way a declarative view config is, and (c) the upside — total flexibility — isn't worth it when AI-generated declarative views handle the common cases. Storing and executing arbitrary React in Managed MLflow would also require a sandboxing story this RFC doesn't want to design.


I'm not sure if these weakness is

the surface is too large to land in OSS MLflow with reasonable maintenance cost

The interface between the custom view definition and MLflow is actually narrow and soft in custom component than proposed option. For example, Braintrust is represented as a function that takes trace/spans/updates. As far as the input is typed as those objects, the platform does not need to meet other contract. On the other hand, the proposed framework defines other interface like span range, selector, which needs to be maintained in a backward compatible way.

React components are not diffable or auditable in the way a declarative view config is

React components are code, which is easy to manage.

the upside — total flexibility — isn't worth it when AI-generated declarative views handle the common cases.

I think the upside is big. A basic view like this in Braintrust example cannot be represented with the proposed model. Whether or not the desired view is archivable or not is a hard blocker so can be a deal breaker.

Storing and executing arbitrary React in Managed MLflow would also require a sandboxing story this RFC doesn't want to design.

Imo this is the only cons, but I think it is worth building a sandbox solution. It is not very new problem and there should be an existing solution.

Didn't see this comment before responding here

forrestmurray-db and others added 2 commits May 12, 2026 13:13

Refine trace views RFC user journeys

5bd7856

B-Step62 reviewed May 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Trace Views#17

RFC: Trace Views#17
forrestmurray-db wants to merge 2 commits into
mlflow:mainfrom
forrestmurray-db:rfc/trace-views

forrestmurray-db commented May 12, 2026 •

edited

Loading

Uh oh!

B-Step62 May 13, 2026

Uh oh!

forrestmurray-db May 13, 2026

Uh oh!

B-Step62 May 13, 2026

Uh oh!

B-Step62 May 13, 2026

Uh oh!

forrestmurray-db May 13, 2026

Uh oh!

B-Step62 May 13, 2026

Uh oh!

forrestmurray-db May 13, 2026

Uh oh!

B-Step62 May 13, 2026

Uh oh!

forrestmurray-db May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		Today: She runs the session with raw traces. SMEs ask her in Slack what to focus on. She writes a doc; SMEs ignore it. She gives up and labels them herself, or the session ships poor labels and eval datasets degrade.

		With trace views: She runs the built-in trace-view skill — `trace.summarize()` followed by `summary.create_view()` — over the batch. An LLM analyzes each trace, identifies the agent's milestones (e.g., "Plan → Search → Synthesize"), and persists a `TraceView` with one `SpanRange` per milestone. An alternative approach would be to expose skills only and let users drive them with their own agent or the assistant. MLflow doesn't have an OOB agent harness yet so this is likely to be less complex than exposing a summarize method on the trace itself. In testing I found that a single prompt could produce decent milestones for traces from various agents.


		![Output path selection](images/07-output-path-selection.gif)

		It's possible that the traces will have a consistent structure which generalizes across all traces in the experiment. In which case it should be easy to apply to all traces.


		# Data model and API

		A `TraceView` contains an ordered list of `SpanRange`s. Each range identifies a segment of the trace and optionally extracts fields from it.


		3. Single span filter per view. The first draft used one `SpanFilter` per view. Real traces have multiple phases users want to highlight together — planning, retrieval, generation — so the model evolved to multi-range with `SpanRange[]`. The single-filter version made the common case awkward.

		4. Custom React components (Braintrust Custom Views). Braintrust lets developers write arbitrary React for trace rendering. Rejected because (a) the surface is too large to land in OSS MLflow with reasonable maintenance cost, (b) React components are not diffable or auditable in the way a declarative view config is, and (c) the upside — total flexibility — isn't worth it when AI-generated declarative views handle the common cases. Storing and executing arbitrary React in Managed MLflow would also require a sandboxing story this RFC doesn't want to design.

Uh oh!

Conversation

forrestmurray-db commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What this RFC covers

Context

Status

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

forrestmurray-db commented May 12, 2026 •

edited

Loading