RFC: Trace Views#17
Conversation
RFC proposing trace views: named, reusable configurations that filter and label MLflow traces so SMEs, debuggers, and judges can focus on the parts of a trace that matter for their task. Spans the UX (4 critical user journeys with screenshots and a workflow GIF), the creation model (AI generates by default, developer edits or authors directly), the data model (TraceView + SpanRange), and a phased adoption strategy that maps to the in-flight implementation PR. Related: - mlflow/mlflow#22499 (tracking issue) - forrestmurray-db/mlflow:impl/trace-views-slim (implementation, Phase 1) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Forrest Murray <forrest.murray@databricks.com>
|
|
||
| **Today:** She runs the session with raw traces. SMEs ask her in Slack what to focus on. She writes a doc; SMEs ignore it. She gives up and labels them herself, or the session ships poor labels and eval datasets degrade. | ||
|
|
||
| **With trace views:** She runs the built-in trace-view skill — `trace.summarize()` followed by `summary.create_view()` — over the batch. An LLM analyzes each trace, identifies the agent's milestones (e.g., "Plan → Search → Synthesize"), and persists a `TraceView` with one `SpanRange` per milestone. An alternative approach would be to expose skills only and let users drive them with their own agent or the assistant. MLflow doesn't have an OOB agent harness yet so this is likely to be less complex than exposing a summarize method on the trace itself. In testing I found that a single prompt could produce decent milestones for traces from various agents. |
There was a problem hiding this comment.
I like the alternative approach more. To give a context, since MLflow is self-hosted, running built-in harness within MLflow requires users to configure api key and model. This has been a major dropout point for similar AI-driven products we released e.g. issue detection.
With the support of custom view and good apis/skills, I think coding agents can do a decent job.
There was a problem hiding this comment.
For development of my draft PR I used only MLflow assistant + Claude code and it did work well.
I think there are a couple issues with this purely:
- Discoverability: Relying on the assistant would mean it's only discoverable via docs or LSP. We could remedy this by adding a button to open assistant with a pre-filled summarize prompt, but that assumes the user has already set up assistant.
- DX: essentially there's very little control on how to apply this over an experiment-level batch. It's fairly reasonable to say "create milestone trace views for all traces in the experiment", but then claude will likely take a while to figure out how to do this and there's no observability over the process vs.
traces = search_traces()
for t in traces:
try:
summary = t.summarize()
t.create_view(summary)
except:
log(f'summary for {t.trace_id} failed')- Differentiation: Other platforms don't provide this programmatically via sdk, mostly they expose it in their assistants. This affords developers the ability to build it in to their applications which meets teams where they already are.
|
|
||
|  | ||
|
|
||
| It's possible that the traces will have a consistent structure which generalizes across all traces in the experiment. In which case it should be easy to apply to all traces. |
There was a problem hiding this comment.
It is also possible the traces within experiment does not have very consistent structure. It is common that the same agent has different order and number of llm, tool, retriever calls. How does this RFC address it?
|
|
||
| # Data model and API | ||
|
|
||
| A `TraceView` contains an ordered list of `SpanRange`s. Each range identifies a segment of the trace and optionally extracts fields from it. |
There was a problem hiding this comment.
Braintrust supports custom view through React function. What is the strength of doing limiting the choice to a range of span over that? Is it a risk that we cannot support some type of views that competitors can do?
There was a problem hiding this comment.
Listed that in the later section, personally I think that's quite a strong approach, and I seriously considered something like it initially. Here's how I see it:
Pros:
- Really flexible and can hook into the feedback APIs to provide custom human-feedback.
- Embeddable (I assume?) so it can be served directly in a custom app or replace a "dumb" thumbs-up/thumbs-down UX.
Cons:
- Locks consumers in to having a DOM-based rendering layer and the deps. to produce it. That actually disqualifies this from being usable as a judge template var
{{ trajectory }}which is just as strong of a motivator as SME readability. Maybe there's a way to get really clever about SSR-ing views for judges. - Unsafe/incorrect: Relies on a coding agent to produce error free JSX (TSX?), unclear how they test it, unclear if it's actually trustable code.
- Can't be as easily edited via the UI, e.g. how do I add a JSON field to one of the elements I want to render? I have to write some JSX in the text area?
Do you see any other pros that I'm missing? I think the solution is really pretty nice but technically more complex and the more I think about it, it's hard to come up with custom things you'd want the UI to actually do here.
Another approach is to have TraceViews support composable widgets that could be more easily created in the UI directly without custom code issues mentioned above.
| DELETE /mlflow/experiments/{exp_id}/views/{view_id} | ||
| ``` | ||
|
|
||
| **Schema.** A single `trace_views` table holds both scopes, with a `CHECK` constraint that exactly one of `trace_id` / `experiment_id` is set. Ranges are stored as a JSON column on the row to keep schema migrations minimal. |
There was a problem hiding this comment.
Are all views visible by default to everyone? In a large organization or teams, I think there should be some filtering/grouping otherwise it is hard to find a view from hundreds of them.
There was a problem hiding this comment.
I doubt that there would be that many, maybe it's worth getting rid of experiment scoping on traces altogether? Others have project views as well as trace views.
|
|
||
| **3. Single span filter per view.** The first draft used one `SpanFilter` per view. Real traces have multiple phases users want to highlight together — planning, retrieval, generation — so the model evolved to multi-range with `SpanRange[]`. The single-filter version made the common case awkward. | ||
|
|
||
| **4. Custom React components (Braintrust Custom Views).** Braintrust lets developers write arbitrary React for trace rendering. Rejected because (a) the surface is too large to land in OSS MLflow with reasonable maintenance cost, (b) React components are not diffable or auditable in the way a declarative view config is, and (c) the upside — total flexibility — isn't worth it when AI-generated declarative views handle the common cases. Storing and executing arbitrary React in Managed MLflow would also require a sandboxing story this RFC doesn't want to design. |
There was a problem hiding this comment.
I'm not sure if these weakness is
the surface is too large to land in OSS MLflow with reasonable maintenance cost
The interface between the custom view definition and MLflow is actually narrow and soft in custom component than proposed option. For example, Braintrust is represented as a function that takes trace/spans/updates. As far as the input is typed as those objects, the platform does not need to meet other contract. On the other hand, the proposed framework defines other interface like span range, selector, which needs to be maintained in a backward compatible way.
React components are not diffable or auditable in the way a declarative view config is
React components are code, which is easy to manage.
the upside — total flexibility — isn't worth it when AI-generated declarative views handle the common cases.
I think the upside is big. A basic view like this in Braintrust example cannot be represented with the proposed model. Whether or not the desired view is archivable or not is a hard blocker so can be a deal breaker.
Storing and executing arbitrary React in Managed MLflow would also require a sandboxing story this RFC doesn't want to design.
Imo this is the only cons, but I think it is worth building a sandbox solution. It is not very new problem and there should be an existing solution.
Summary
Proposes trace views for MLflow: named, reusable configurations that filter and label MLflow traces so SMEs, debuggers, and judges can focus on the parts of a trace that matter for their task.
What this RFC covers
TraceView+SpanRange+SpanSelectorContext
Status
Draft — looking for UI/UX alignment first (per @B-Step62's guidance to lead with UX since implementation depends on it), then feedback on the data model and adoption phasing.
Screenshots and GIFs in the RFC are real captures from the working prototype.