Skip to content

Commit 5f04e5d

Browse files
authored
docs: add agent rollout ingestion docs entry point (#499)
1 parent 6b92351 commit 5f04e5d

File tree

6 files changed

+283
-43
lines changed

6 files changed

+283
-43
lines changed
Lines changed: 269 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,269 @@
1+
# Agent Rollout Ingestion
2+
3+
`AgentRolloutSeedSource` turns existing agent rollouts into a seed dataset for synthetic data workflows. It lets you operate locally on rollout artifacts you already have on disk, then normalizes them into rows you can filter, curate, and distill into training or evaluation data.
4+
5+
## Quick Start
6+
7+
Use `AgentRolloutSeedSource` when you want to work from existing agent traces instead of traces captured during a Data Designer generation run.
8+
9+
=== "Claude Code"
10+
11+
Uses `~/.claude/projects` and `*.jsonl` by default.
12+
13+
```python
14+
import data_designer.config as dd
15+
16+
seed_source = dd.AgentRolloutSeedSource(
17+
format=dd.AgentRolloutFormat.CLAUDE_CODE,
18+
)
19+
```
20+
21+
=== "Codex"
22+
23+
Uses `~/.codex/sessions` and `*.jsonl` by default.
24+
25+
```python
26+
import data_designer.config as dd
27+
28+
seed_source = dd.AgentRolloutSeedSource(
29+
format=dd.AgentRolloutFormat.CODEX,
30+
)
31+
```
32+
33+
=== "Hermes Agent"
34+
35+
Uses `~/.hermes/sessions` and `*.json*` by default so CLI session logs and gateway transcripts can coexist.
36+
37+
```python
38+
import data_designer.config as dd
39+
40+
seed_source = dd.AgentRolloutSeedSource(
41+
format=dd.AgentRolloutFormat.HERMES_AGENT,
42+
)
43+
```
44+
45+
=== "ATIF"
46+
47+
ATIF requires an explicit `path`. See Harbor's [ATIF documentation](https://harborframework.com/docs/trajectory-format) for the format specification.
48+
49+
```python
50+
import data_designer.config as dd
51+
52+
seed_source = dd.AgentRolloutSeedSource(
53+
format=dd.AgentRolloutFormat.ATIF,
54+
path="/data/harbor/runs/swe-bench/job-042",
55+
recursive=True,
56+
file_pattern="trajectory*.json",
57+
)
58+
```
59+
60+
You can override `path` and `file_pattern` for any format when your rollout artifacts live outside the built-in defaults.
61+
62+
## Normalized Field Compatibility
63+
64+
All supported rollout formats map into the same seeded row schema. In the table below, `None` means the source artifact does not expose that field directly, and `derived` means Data Designer computes it from normalized `messages`.
65+
66+
| Normalized field | ATIF | Claude Code | Codex | Hermes Agent |
67+
|---|---|---|---|---|
68+
| `trace_id` | `session_id` | `sessionId[:agentId]` | `session_meta.id` or file stem | CLI `session_id` or file stem; gateway file stem |
69+
| `source_kind` | `"atif"` | `"claude_code"` | `"codex"` | `"hermes_agent"` |
70+
| `source_path` | Parsed `.json` path | Parsed `.jsonl` trace path | Parsed `rollout-*.jsonl` path | Parsed CLI `.json` or gateway `.jsonl` path |
71+
| `root_session_id` | `session_id` | `sessionId` or file stem | `trace_id` | `trace_id` |
72+
| `agent_id` | `None` | `agentId` | `None` | `None` |
73+
| `is_sidechain` | `False` | `isSidechain` | `False` | `False` |
74+
| `cwd` | `agent.extra.cwd` | First non-null record `cwd` | `session_meta.cwd` | `None` |
75+
| `project_path` | `extra.project_path` or `cwd` | `projectPath` or `cwd` | `cwd` | `None` |
76+
| `git_branch` | `agent.extra.git_branch` | First non-null record `gitBranch` | `session_meta.git_branch` | `None` |
77+
| `started_at` | Earliest step timestamp | Earliest row timestamp | `session_meta.timestamp` or earliest record timestamp | CLI `session_start`; gateway `created_at` |
78+
| `ended_at` | Latest step timestamp | Latest row timestamp | Latest record timestamp | CLI `last_updated`; gateway `updated_at` |
79+
| `messages` | Normalized steps | Normalized trace rows | Normalized response items | Normalized CLI or gateway rows |
80+
| `source_meta` | ATIF metadata | Claude metadata | Codex metadata | Hermes metadata |
81+
| `message_count` | `derived` | `derived` | `derived` | `derived` |
82+
| `tool_call_count` | `derived` | `derived` | `derived` | `derived` |
83+
| `final_assistant_message` | `derived` | `derived` | `derived` | `derived` |
84+
85+
### Notes
86+
87+
- `trace_id`: Claude Code appends `agentId` when present. Hermes uses either the CLI session ID or the gateway transcript file stem.
88+
- `is_sidechain`: ATIF and Hermes currently normalize this to `False`. Claude Code preserves `isSidechain` directly.
89+
- `messages`: All formats normalize into the same chat-style message schema. See [Message Traces](traces.md) for the shared block structure.
90+
- `source_meta`: This is where format-specific details live, such as ATIF copied-context metadata, Claude summaries, Codex response-item types, or Hermes tool/session metadata.
91+
92+
## Example: Summarize a Random Turn
93+
94+
Because the seeded fields are normalized, you can also build lightweight summarization workflows directly from imported rollouts. This example samples one random normalized message from each trace and summarizes it in a single sentence.
95+
96+
```python
97+
import data_designer.config as dd
98+
from data_designer.interface import DataDesigner
99+
100+
data_designer = DataDesigner()
101+
config_builder = dd.DataDesignerConfigBuilder(
102+
model_configs=[
103+
dd.ModelConfig(
104+
alias="trace-writer",
105+
model="nvidia/nemotron-3-nano-30b-a3b",
106+
provider="nvidia",
107+
)
108+
]
109+
)
110+
111+
config_builder.with_seed_dataset(
112+
dd.AgentRolloutSeedSource(
113+
format=dd.AgentRolloutFormat.CLAUDE_CODE,
114+
)
115+
)
116+
117+
config_builder.add_column(
118+
dd.ExpressionColumnConfig(
119+
name="sampled_turn",
120+
expr="{{ messages | random }}",
121+
)
122+
)
123+
124+
config_builder.add_column(
125+
dd.LLMTextColumnConfig(
126+
name="turn_summary",
127+
model_alias="trace-writer",
128+
prompt="""\
129+
Summarize this randomly sampled rollout turn in one sentence.
130+
The turn may come from the user, assistant, or a tool result.
131+
132+
Trace: {{ trace_id }}
133+
Turn:
134+
{{ sampled_turn }}
135+
""",
136+
)
137+
)
138+
139+
preview = data_designer.preview(config_builder, num_records=3)
140+
preview.display_sample_record()
141+
```
142+
143+
This stays fully declarative: no custom seed reader or preprocessing step is required. Because `sampled_turn` is drawn from the normalized `messages` list, the same config works across all supported rollout formats.
144+
145+
## Example: Turn Tool Interactions into a Review Dataset
146+
147+
You can also explode imported rollouts into a tool-interaction dataset. This example scans normalized `messages`, emits one row per tool call and matching tool response, preserves the trace context up to that response, and then uses a structured column to label the interaction as a success, failure, or unclear outcome.
148+
149+
```python
150+
import data_designer.config as dd
151+
from data_designer.interface import DataDesigner
152+
from pydantic import BaseModel, Field
153+
from typing import Literal
154+
155+
156+
@dd.custom_column_generator(
157+
required_columns=["messages"],
158+
side_effect_columns=["tool_call", "tool_response", "tool_name"],
159+
)
160+
def explode_tool_interactions(row: dict) -> list[dict]:
161+
rows = []
162+
tool_calls_by_id = {}
163+
context_messages = []
164+
165+
for message in row["messages"]:
166+
context_messages.append(message)
167+
168+
for tool_call in message.get("tool_calls") or []:
169+
tool_call_id = tool_call.get("id")
170+
if tool_call_id:
171+
tool_calls_by_id[tool_call_id] = tool_call
172+
173+
if message.get("role") != "tool":
174+
continue
175+
176+
tool_call = tool_calls_by_id.get(
177+
message.get("tool_call_id"),
178+
{
179+
"id": message.get("tool_call_id"),
180+
"type": "function",
181+
"function": {"name": "unknown", "arguments": "{}"},
182+
},
183+
)
184+
tool_name = tool_call.get("function", {}).get("name", "unknown")
185+
186+
rows.append(
187+
{
188+
**row,
189+
"tool_interaction_context": list(context_messages),
190+
"tool_call": tool_call,
191+
"tool_response": message,
192+
"tool_name": tool_name,
193+
}
194+
)
195+
196+
return rows
197+
198+
199+
class ToolInteractionAnalysis(BaseModel):
200+
outcome: Literal["success", "failure", "unclear"] = Field(
201+
description="Whether the tool interaction appears to have succeeded, failed, or is ambiguous."
202+
)
203+
summary: str = Field(
204+
description="One or two sentences summarizing what the tool was asked to do and what the response indicates."
205+
)
206+
207+
208+
data_designer = DataDesigner()
209+
config_builder = dd.DataDesignerConfigBuilder(
210+
model_configs=[
211+
dd.ModelConfig(
212+
alias="tool-analyst",
213+
model="nvidia/nemotron-3-nano-30b-a3b",
214+
provider="nvidia",
215+
)
216+
]
217+
)
218+
219+
config_builder.with_seed_dataset(
220+
dd.AgentRolloutSeedSource(
221+
format=dd.AgentRolloutFormat.CLAUDE_CODE,
222+
)
223+
)
224+
225+
config_builder.add_column(
226+
dd.CustomColumnConfig(
227+
name="tool_interaction_context",
228+
generator_function=explode_tool_interactions,
229+
allow_resize=True,
230+
)
231+
)
232+
233+
config_builder.add_column(
234+
dd.LLMStructuredColumnConfig(
235+
name="tool_interaction_analysis",
236+
model_alias="tool-analyst",
237+
output_format=ToolInteractionAnalysis,
238+
prompt="""\
239+
You are analyzing one tool interaction from an imported agent rollout.
240+
241+
Context up to the tool response:
242+
{{ tool_interaction_context }}
243+
244+
Tool name: {{ tool_name }}
245+
246+
Tool call:
247+
{{ tool_call }}
248+
249+
Tool response:
250+
{{ tool_response }}
251+
252+
Decide whether this interaction is a success, failure, or unclear outcome.
253+
Then summarize what the tool was asked to do and what happened.
254+
Base your answer on the tool call arguments, the tool response, and the immediate context.
255+
""",
256+
)
257+
)
258+
259+
preview = data_designer.preview(config_builder, num_records=5)
260+
preview.display_sample_record()
261+
```
262+
263+
This pattern is useful when you want to curate evaluator or monitoring datasets from real traces. The resize-enabled custom column turns each tool interaction into its own record, and the structured column adds a consistent outcome label plus a grounded summary. Because the logic operates on normalized `tool_calls` and `tool` messages, the same pattern transfers across supported rollout formats. If your traces are long, consider adding a second custom or expression column that windows the context before sending it to a model.
264+
265+
## Related Guides
266+
267+
- For the general seed dataset model, see [Seed Datasets](seed-datasets.md).
268+
- For the normalized `messages` structure used in imported rollouts, see [Message Traces](traces.md).
269+
- For an end-to-end distillation example, see [Agent Rollout Trace Distillation](../recipes/trace_ingestion/agent_rollout_distillation.md).

docs/concepts/seed-datasets.md

Lines changed: 8 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -166,6 +166,9 @@ Path: {{ relative_path }}
166166
- `file_name` — basename of the matched file
167167
- `content` — decoded text contents of the matched file
168168

169+
!!! tip "Custom Filesystem Readers"
170+
If you need custom row construction, fan-out behavior, or expensive hydration logic for any directory-backed seed source, build a custom `FileSystemSeedReader` and pass it via `DataDesigner(seed_readers=[...])`. See the [FileSystemSeedReader Plugins](../plugins/filesystem_seed_reader.md) guide.
171+
169172
!!! note "Encoding"
170173
`encoding="utf-8"` is the default. Set a different Python codec name if your files use another text encoding.
171174

@@ -181,54 +184,16 @@ seed_source = dd.AgentRolloutSeedSource(
181184
config_builder.with_seed_dataset(seed_source)
182185
```
183186

184-
When `path` is omitted, built-in defaults are used for the vendor-native formats:
185-
186-
- **Claude Code**`~/.claude/projects`
187-
- **Codex**`~/.codex/sessions`
188-
- **Hermes Agent**`~/.hermes/sessions`
189-
190-
ATIF rollouts use standalone `.json` trajectory files and require an explicit `path`.
191-
192-
You can override both the path and file pattern:
187+
!!! info "Dedicated guide"
188+
See [Agent Rollout Ingestion](agent-rollout-ingestion.md) for the rollout-specific guide, including:
193189

194-
```python
195-
seed_source = dd.AgentRolloutSeedSource(
196-
format=dd.AgentRolloutFormat.CLAUDE_CODE,
197-
path="my_traces/",
198-
file_pattern="*.jsonl",
199-
)
200-
```
201-
202-
For ATIF trajectories:
203-
204-
```python
205-
seed_source = dd.AgentRolloutSeedSource(
206-
format=dd.AgentRolloutFormat.ATIF,
207-
path="my_atif_traces/",
208-
)
209-
```
210-
211-
`AgentRolloutSeedSource` exposes a rich set of seeded columns:
212-
213-
- `trace_id` — unique identifier for the trace
214-
- `source_kind` — the rollout format (e.g. `"atif"`, `"claude_code"`, `"codex"`, `"hermes_agent"`)
215-
- `source_path` — full path to the source file
216-
- `root_session_id` — top-level session identifier
217-
- `agent_id` — agent identifier (if present)
218-
- `is_sidechain` — whether this trace is a delegated subtask
219-
- `cwd`, `project_path`, `git_branch` — workspace context
220-
- `started_at`, `ended_at` — trace timestamps
221-
- `messages` — the full message history as a list of dicts
222-
- `source_meta` — additional format-specific metadata
223-
- `message_count`, `tool_call_count` — derived summary statistics
224-
- `final_assistant_message` — the last assistant text in the trace
190+
- supported rollout formats and default locations
191+
- format-specific configuration details like `path` and `file_pattern`
192+
- the full normalized seeded-column schema exposed by `AgentRolloutSeedSource`
225193

226194
!!! tip "Trace Distillation"
227195
See the [Agent Rollout Trace Distillation recipe](../recipes/trace_ingestion/agent_rollout_distillation.md) for a complete example that turns agent traces into supervised fine-tuning data.
228196

229-
!!! tip "Custom Filesystem Readers"
230-
If you need custom row construction, fan-out behavior, or expensive hydration logic for any directory-backed seed source, build a custom `FileSystemSeedReader` and pass it via `DataDesigner(seed_readers=[...])`. See the [FileSystemSeedReader Plugins](../plugins/filesystem_seed_reader.md) guide.
231-
232197
## Sampling Strategies
233198

234199
Control how rows are read from the seed dataset.

docs/concepts/traces.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -212,4 +212,5 @@ The `extract_reasoning_content` option is available on all LLM column types:
212212

213213
## See Also
214214

215+
- **[Agent Rollout Ingestion](agent-rollout-ingestion.md)**: Import external agent traces from disk into normalized seed rows
215216
- **[Safety and Limits](mcp/safety-and-limits.md)**: Understand turn limits and timeout behavior

docs/recipes/cards.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,7 @@ Each recipe is a self-contained example that can be run independently.
114114
---
115115

116116
[:material-book-open-page-variant: View Recipe](trace_ingestion/agent_rollout_distillation.md){ .md-button }
117+
[:material-file-document-outline: Ingestion Guide](../concepts/agent-rollout-ingestion.md){ .md-button }
117118
[Download Code :octicons-download-24:](../assets/recipes/trace_ingestion/agent_rollout_distillation.py){ .md-button download="agent_rollout_distillation.py" }
118119

119120

docs/recipes/trace_ingestion/agent_rollout_distillation.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,9 @@ imported trace into a compact task digest, a standalone instruction-response pai
88
judge-scored quality signal you can use for downstream filtering. It supports both full dataset creation and in-memory
99
preview mode via `--preview`.
1010

11+
!!! info "Looking for ingestion details?"
12+
See [Agent Rollout Ingestion](../../concepts/agent-rollout-ingestion.md) for supported formats, default paths, normalized columns, and rollout-specific parsing behavior. This recipe stays focused on the distillation pipeline.
13+
1114
```python
1215
--8<-- "assets/recipes/trace_ingestion/agent_rollout_distillation.py"
1316
```

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ nav:
99
- Concepts:
1010
- Columns: concepts/columns.md
1111
- Seed Datasets: concepts/seed-datasets.md
12+
- Agent Rollout Ingestion: concepts/agent-rollout-ingestion.md
1213
- Models:
1314
- Default Model Settings: concepts/models/default-model-settings.md
1415
- Configure with the CLI: concepts/models/configure-model-settings-with-the-cli.md

0 commit comments

Comments
 (0)