Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 34 additions & 22 deletions docs/concepts/agent-rollout-ingestion.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,18 @@ Use `AgentRolloutSeedSource` when you want to work from existing agent traces in
)
```

=== "Pi Coding Agent"

Uses `~/.pi/agent/sessions` and `*.jsonl` by default. Sessions are tree-structured JSONL files; the active conversation path is resolved automatically.

```python
import data_designer.config as dd

seed_source = dd.AgentRolloutSeedSource(
format=dd.AgentRolloutFormat.PI_CODING_AGENT,
)
```

=== "ATIF"

ATIF requires an explicit `path`. See Harbor's [ATIF documentation](https://harborframework.com/docs/trajectory-format) for the format specification.
Expand All @@ -63,31 +75,31 @@ You can override `path` and `file_pattern` for any format when your rollout arti

All supported rollout formats map into the same seeded row schema. In the table below, `None` means the source artifact does not expose that field directly, and `derived` means Data Designer computes it from normalized `messages`.

| Normalized field | ATIF | Claude Code | Codex | Hermes Agent |
|---|---|---|---|---|
| `trace_id` | `session_id` | `sessionId[:agentId]` | `session_meta.id` or file stem | CLI `session_id` or file stem; gateway file stem |
| `source_kind` | `"atif"` | `"claude_code"` | `"codex"` | `"hermes_agent"` |
| `source_path` | Parsed `.json` path | Parsed `.jsonl` trace path | Parsed `rollout-*.jsonl` path | Parsed CLI `.json` or gateway `.jsonl` path |
| `root_session_id` | `session_id` | `sessionId` or file stem | `trace_id` | `trace_id` |
| `agent_id` | `None` | `agentId` | `None` | `None` |
| `is_sidechain` | `False` | `isSidechain` | `False` | `False` |
| `cwd` | `agent.extra.cwd` | First non-null record `cwd` | `session_meta.cwd` | `None` |
| `project_path` | `extra.project_path` or `cwd` | `projectPath` or `cwd` | `cwd` | `None` |
| `git_branch` | `agent.extra.git_branch` | First non-null record `gitBranch` | `session_meta.git_branch` | `None` |
| `started_at` | Earliest step timestamp | Earliest row timestamp | `session_meta.timestamp` or earliest record timestamp | CLI `session_start`; gateway `created_at` |
| `ended_at` | Latest step timestamp | Latest row timestamp | Latest record timestamp | CLI `last_updated`; gateway `updated_at` |
| `messages` | Normalized steps | Normalized trace rows | Normalized response items | Normalized CLI or gateway rows |
| `source_meta` | ATIF metadata | Claude metadata | Codex metadata | Hermes metadata |
| `message_count` | `derived` | `derived` | `derived` | `derived` |
| `tool_call_count` | `derived` | `derived` | `derived` | `derived` |
| `final_assistant_message` | `derived` | `derived` | `derived` | `derived` |
| Normalized field | ATIF | Claude Code | Codex | Hermes Agent | Pi Coding Agent |
|---|---|---|---|---|---|
| `trace_id` | `session_id` | `sessionId[:agentId]` | `session_meta.id` or file stem | CLI `session_id` or file stem; gateway file stem | Session header `id` |
| `source_kind` | `"atif"` | `"claude_code"` | `"codex"` | `"hermes_agent"` | `"pi_coding_agent"` |
| `source_path` | Parsed `.json` path | Parsed `.jsonl` trace path | Parsed `rollout-*.jsonl` path | Parsed CLI `.json` or gateway `.jsonl` path | Parsed `.jsonl` session path |
| `root_session_id` | `session_id` | `sessionId` or file stem | `trace_id` | `trace_id` | Session header `id` |
| `agent_id` | `None` | `agentId` | `None` | `None` | `None` |
| `is_sidechain` | `False` | `isSidechain` | `False` | `False` | `False` |
| `cwd` | `agent.extra.cwd` | First non-null record `cwd` | `session_meta.cwd` | `None` | Session header `cwd` |
| `project_path` | `extra.project_path` or `cwd` | `projectPath` or `cwd` | `cwd` | `None` | Session header `cwd` |
| `git_branch` | `agent.extra.git_branch` | First non-null record `gitBranch` | `session_meta.git_branch` | `None` | `None` |
| `started_at` | Earliest step timestamp | Earliest row timestamp | `session_meta.timestamp` or earliest record timestamp | CLI `session_start`; gateway `created_at` | Earliest entry timestamp |
| `ended_at` | Latest step timestamp | Latest row timestamp | Latest record timestamp | CLI `last_updated`; gateway `updated_at` | Latest entry timestamp |
| `messages` | Normalized steps | Normalized trace rows | Normalized response items | Normalized CLI or gateway rows | Normalized active-path messages |
| `source_meta` | ATIF metadata | Claude metadata | Codex metadata | Hermes metadata | Pi session metadata |
| `message_count` | `derived` | `derived` | `derived` | `derived` | `derived` |
| `tool_call_count` | `derived` | `derived` | `derived` | `derived` | `derived` |
| `final_assistant_message` | `derived` | `derived` | `derived` | `derived` | `derived` |

### Notes

- `trace_id`: Claude Code appends `agentId` when present. Hermes uses either the CLI session ID or the gateway transcript file stem.
- `is_sidechain`: ATIF and Hermes currently normalize this to `False`. Claude Code preserves `isSidechain` directly.
- `messages`: All formats normalize into the same chat-style message schema. See [Message Traces](traces.md) for the shared block structure.
- `source_meta`: This is where format-specific details live, such as ATIF copied-context metadata, Claude summaries, Codex response-item types, or Hermes tool/session metadata.
- `trace_id`: Claude Code appends `agentId` when present. Hermes uses either the CLI session ID or the gateway transcript file stem. Pi uses the session header `id`.
- `is_sidechain`: ATIF, Hermes, and Pi currently normalize this to `False`. Claude Code preserves `isSidechain` directly.
- `messages`: All formats normalize into the same chat-style message schema. See [Message Traces](traces.md) for the shared block structure. Pi sessions are tree-structured; only the active conversation path (from the last entry back to root) is included.
- `source_meta`: This is where format-specific details live, such as ATIF copied-context metadata, Claude summaries, Codex response-item types, Hermes tool/session metadata, or Pi session version and branch information.

## Example: Summarize a Random Turn

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -176,6 +176,10 @@ def get_hermes_agent_default_path() -> str:
return str(Path("~/.hermes/sessions").expanduser())


def get_pi_coding_agent_default_path() -> str:
return str(Path("~/.pi/agent/sessions").expanduser())


def _validate_filesystem_seed_source_path(value: str | None) -> str | None:
if value is None:
return None
Expand All @@ -200,6 +204,7 @@ class AgentRolloutFormat(StrEnum):
CLAUDE_CODE = "claude_code"
CODEX = "codex"
HERMES_AGENT = "hermes_agent"
PI_CODING_AGENT = "pi_coding_agent"


def get_agent_rollout_format_defaults(fmt: AgentRolloutFormat) -> tuple[str | None, str]:
Expand All @@ -211,6 +216,8 @@ def get_agent_rollout_format_defaults(fmt: AgentRolloutFormat) -> tuple[str | No
return (get_codex_default_path(), "*.jsonl")
if fmt == AgentRolloutFormat.HERMES_AGENT:
return (get_hermes_agent_default_path(), "*.json*")
if fmt == AgentRolloutFormat.PI_CODING_AGENT:
return (get_pi_coding_agent_default_path(), "*.jsonl")
raise ValueError(f"πŸ›‘ Unknown agent rollout format: {fmt!r}")


Expand All @@ -228,7 +235,8 @@ class AgentRolloutSeedSource(FileSystemSeedSource):
"Directory containing agent rollout artifacts. This field is required for ATIF trajectories. "
"When omitted, built-in defaults are used for formats that define one. "
"Claude Code defaults to ~/.claude/projects, Codex defaults to ~/.codex/sessions, "
"and Hermes Agent defaults to ~/.hermes/sessions. "
"Hermes Agent defaults to ~/.hermes/sessions, "
"and Pi Coding Agent defaults to ~/.pi/agent/sessions. "
"Relative paths are resolved from the current working directory when the config is loaded, "
"not from the config file location."
),
Expand All @@ -238,7 +246,7 @@ class AgentRolloutSeedSource(FileSystemSeedSource):
None,
description=(
"Case-sensitive filename pattern used to match agent rollout files. When omitted, "
"ATIF defaults to '*.json', Claude Code and Codex default to '*.jsonl', "
"ATIF defaults to '*.json', Claude Code, Codex, and Pi Coding Agent default to '*.jsonl', "
"and Hermes Agent defaults to '*.json*'."
),
)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
coerce_optional_str,
load_json_object,
load_jsonl_rows,
normalize_message_content,
normalize_message_role,
require_string,
stringify_json_value,
Expand Down Expand Up @@ -244,7 +245,7 @@ def normalize_hermes_messages(
normalized_messages.append(
build_message(
role="tool",
content=_normalize_message_content(raw_message.get("content")),
content=normalize_message_content(raw_message.get("content")),
tool_call_id=require_string(
raw_message.get("tool_call_id"),
f"Hermes tool message tool_call_id #{message_index} in {file_path}",
Expand All @@ -253,7 +254,7 @@ def normalize_hermes_messages(
)
continue

content = _normalize_message_content(raw_message.get("content"))
content = normalize_message_content(raw_message.get("content"))
reasoning_content = coerce_optional_str(raw_message.get("reasoning"))
tool_calls = normalize_hermes_tool_calls(
raw_message.get("tool_calls"),
Expand Down Expand Up @@ -413,22 +414,6 @@ def _require_message_list(raw_messages: Any, *, file_path: Path, context: str) -
return raw_messages


def _normalize_message_content(content: Any) -> Any:
"""Coerce Hermes message content into the normalized content shape.

Args:
content: Raw Hermes message content.

Returns:
A string or content-block list compatible with ``build_message``.
"""
if content is None:
return ""
if isinstance(content, (str, list)):
return content
return stringify_json_value(content)


def _extract_finish_reasons(raw_messages: list[dict[str, Any]]) -> list[str]:
"""Collect distinct assistant finish reasons in first-seen order.

Expand Down
Loading
Loading