Skip to content

Commit fdd5ebb

Browse files
authored
feat: add Pi Coding Agent rollout seed source (#513) (#514)
Add support for ingesting Pi Coding Agent session artifacts as an agent rollout seed source. Pi sessions are tree-structured JSONL files; the handler resolves the active conversation path by walking from the last entry back to the root via id/parentId links. Key points: - Tree-structured sessions with automatic active-path resolution - Entry-level types: model_change, compaction, branch_summary, custom_message, thinking_level_change - Message roles: user, assistant (inline ToolCall/ThinkingContent/ TextContent blocks), toolResult, bashExecution (synthesized as tool-call pairs), custom, compactionSummary, branchSummary - Extract shared normalize_message_content to utils.py (was duplicated in Hermes handler)
1 parent c27ad62 commit fdd5ebb

File tree

7 files changed

+1345
-42
lines changed

7 files changed

+1345
-42
lines changed

docs/concepts/agent-rollout-ingestion.md

Lines changed: 34 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,18 @@ Use `AgentRolloutSeedSource` when you want to work from existing agent traces in
4242
)
4343
```
4444

45+
=== "Pi Coding Agent"
46+
47+
Uses `~/.pi/agent/sessions` and `*.jsonl` by default. Sessions are tree-structured JSONL files; the active conversation path is resolved automatically.
48+
49+
```python
50+
import data_designer.config as dd
51+
52+
seed_source = dd.AgentRolloutSeedSource(
53+
format=dd.AgentRolloutFormat.PI_CODING_AGENT,
54+
)
55+
```
56+
4557
=== "ATIF"
4658

4759
ATIF requires an explicit `path`. See Harbor's [ATIF documentation](https://harborframework.com/docs/trajectory-format) for the format specification.
@@ -63,31 +75,31 @@ You can override `path` and `file_pattern` for any format when your rollout arti
6375

6476
All supported rollout formats map into the same seeded row schema. In the table below, `None` means the source artifact does not expose that field directly, and `derived` means Data Designer computes it from normalized `messages`.
6577

66-
| Normalized field | ATIF | Claude Code | Codex | Hermes Agent |
67-
|---|---|---|---|---|
68-
| `trace_id` | `session_id` | `sessionId[:agentId]` | `session_meta.id` or file stem | CLI `session_id` or file stem; gateway file stem |
69-
| `source_kind` | `"atif"` | `"claude_code"` | `"codex"` | `"hermes_agent"` |
70-
| `source_path` | Parsed `.json` path | Parsed `.jsonl` trace path | Parsed `rollout-*.jsonl` path | Parsed CLI `.json` or gateway `.jsonl` path |
71-
| `root_session_id` | `session_id` | `sessionId` or file stem | `trace_id` | `trace_id` |
72-
| `agent_id` | `None` | `agentId` | `None` | `None` |
73-
| `is_sidechain` | `False` | `isSidechain` | `False` | `False` |
74-
| `cwd` | `agent.extra.cwd` | First non-null record `cwd` | `session_meta.cwd` | `None` |
75-
| `project_path` | `extra.project_path` or `cwd` | `projectPath` or `cwd` | `cwd` | `None` |
76-
| `git_branch` | `agent.extra.git_branch` | First non-null record `gitBranch` | `session_meta.git_branch` | `None` |
77-
| `started_at` | Earliest step timestamp | Earliest row timestamp | `session_meta.timestamp` or earliest record timestamp | CLI `session_start`; gateway `created_at` |
78-
| `ended_at` | Latest step timestamp | Latest row timestamp | Latest record timestamp | CLI `last_updated`; gateway `updated_at` |
79-
| `messages` | Normalized steps | Normalized trace rows | Normalized response items | Normalized CLI or gateway rows |
80-
| `source_meta` | ATIF metadata | Claude metadata | Codex metadata | Hermes metadata |
81-
| `message_count` | `derived` | `derived` | `derived` | `derived` |
82-
| `tool_call_count` | `derived` | `derived` | `derived` | `derived` |
83-
| `final_assistant_message` | `derived` | `derived` | `derived` | `derived` |
78+
| Normalized field | ATIF | Claude Code | Codex | Hermes Agent | Pi Coding Agent |
79+
|---|---|---|---|---|---|
80+
| `trace_id` | `session_id` | `sessionId[:agentId]` | `session_meta.id` or file stem | CLI `session_id` or file stem; gateway file stem | Session header `id` |
81+
| `source_kind` | `"atif"` | `"claude_code"` | `"codex"` | `"hermes_agent"` | `"pi_coding_agent"` |
82+
| `source_path` | Parsed `.json` path | Parsed `.jsonl` trace path | Parsed `rollout-*.jsonl` path | Parsed CLI `.json` or gateway `.jsonl` path | Parsed `.jsonl` session path |
83+
| `root_session_id` | `session_id` | `sessionId` or file stem | `trace_id` | `trace_id` | Session header `id` |
84+
| `agent_id` | `None` | `agentId` | `None` | `None` | `None` |
85+
| `is_sidechain` | `False` | `isSidechain` | `False` | `False` | `False` |
86+
| `cwd` | `agent.extra.cwd` | First non-null record `cwd` | `session_meta.cwd` | `None` | Session header `cwd` |
87+
| `project_path` | `extra.project_path` or `cwd` | `projectPath` or `cwd` | `cwd` | `None` | Session header `cwd` |
88+
| `git_branch` | `agent.extra.git_branch` | First non-null record `gitBranch` | `session_meta.git_branch` | `None` | `None` |
89+
| `started_at` | Earliest step timestamp | Earliest row timestamp | `session_meta.timestamp` or earliest record timestamp | CLI `session_start`; gateway `created_at` | Earliest entry timestamp |
90+
| `ended_at` | Latest step timestamp | Latest row timestamp | Latest record timestamp | CLI `last_updated`; gateway `updated_at` | Latest entry timestamp |
91+
| `messages` | Normalized steps | Normalized trace rows | Normalized response items | Normalized CLI or gateway rows | Normalized active-path messages |
92+
| `source_meta` | ATIF metadata | Claude metadata | Codex metadata | Hermes metadata | Pi session metadata |
93+
| `message_count` | `derived` | `derived` | `derived` | `derived` | `derived` |
94+
| `tool_call_count` | `derived` | `derived` | `derived` | `derived` | `derived` |
95+
| `final_assistant_message` | `derived` | `derived` | `derived` | `derived` | `derived` |
8496

8597
### Notes
8698

87-
- `trace_id`: Claude Code appends `agentId` when present. Hermes uses either the CLI session ID or the gateway transcript file stem.
88-
- `is_sidechain`: ATIF and Hermes currently normalize this to `False`. Claude Code preserves `isSidechain` directly.
89-
- `messages`: All formats normalize into the same chat-style message schema. See [Message Traces](traces.md) for the shared block structure.
90-
- `source_meta`: This is where format-specific details live, such as ATIF copied-context metadata, Claude summaries, Codex response-item types, or Hermes tool/session metadata.
99+
- `trace_id`: Claude Code appends `agentId` when present. Hermes uses either the CLI session ID or the gateway transcript file stem. Pi uses the session header `id`.
100+
- `is_sidechain`: ATIF, Hermes, and Pi currently normalize this to `False`. Claude Code preserves `isSidechain` directly.
101+
- `messages`: All formats normalize into the same chat-style message schema. See [Message Traces](traces.md) for the shared block structure. Pi sessions are tree-structured; only the active conversation path (from the last entry back to root) is included.
102+
- `source_meta`: This is where format-specific details live, such as ATIF copied-context metadata, Claude summaries, Codex response-item types, Hermes tool/session metadata, or Pi session version and branch information.
91103

92104
## Example: Summarize a Random Turn
93105

packages/data-designer-config/src/data_designer/config/seed_source.py

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -176,6 +176,10 @@ def get_hermes_agent_default_path() -> str:
176176
return str(Path("~/.hermes/sessions").expanduser())
177177

178178

179+
def get_pi_coding_agent_default_path() -> str:
180+
return str(Path("~/.pi/agent/sessions").expanduser())
181+
182+
179183
def _validate_filesystem_seed_source_path(value: str | None) -> str | None:
180184
if value is None:
181185
return None
@@ -200,6 +204,7 @@ class AgentRolloutFormat(StrEnum):
200204
CLAUDE_CODE = "claude_code"
201205
CODEX = "codex"
202206
HERMES_AGENT = "hermes_agent"
207+
PI_CODING_AGENT = "pi_coding_agent"
203208

204209

205210
def get_agent_rollout_format_defaults(fmt: AgentRolloutFormat) -> tuple[str | None, str]:
@@ -211,6 +216,8 @@ def get_agent_rollout_format_defaults(fmt: AgentRolloutFormat) -> tuple[str | No
211216
return (get_codex_default_path(), "*.jsonl")
212217
if fmt == AgentRolloutFormat.HERMES_AGENT:
213218
return (get_hermes_agent_default_path(), "*.json*")
219+
if fmt == AgentRolloutFormat.PI_CODING_AGENT:
220+
return (get_pi_coding_agent_default_path(), "*.jsonl")
214221
raise ValueError(f"🛑 Unknown agent rollout format: {fmt!r}")
215222

216223

@@ -228,7 +235,8 @@ class AgentRolloutSeedSource(FileSystemSeedSource):
228235
"Directory containing agent rollout artifacts. This field is required for ATIF trajectories. "
229236
"When omitted, built-in defaults are used for formats that define one. "
230237
"Claude Code defaults to ~/.claude/projects, Codex defaults to ~/.codex/sessions, "
231-
"and Hermes Agent defaults to ~/.hermes/sessions. "
238+
"Hermes Agent defaults to ~/.hermes/sessions, "
239+
"and Pi Coding Agent defaults to ~/.pi/agent/sessions. "
232240
"Relative paths are resolved from the current working directory when the config is loaded, "
233241
"not from the config file location."
234242
),
@@ -238,7 +246,7 @@ class AgentRolloutSeedSource(FileSystemSeedSource):
238246
None,
239247
description=(
240248
"Case-sensitive filename pattern used to match agent rollout files. When omitted, "
241-
"ATIF defaults to '*.json', Claude Code and Codex default to '*.jsonl', "
249+
"ATIF defaults to '*.json', Claude Code, Codex, and Pi Coding Agent default to '*.jsonl', "
242250
"and Hermes Agent defaults to '*.json*'."
243251
),
244252
)

packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout/hermes_agent.py

Lines changed: 3 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
coerce_optional_str,
1717
load_json_object,
1818
load_jsonl_rows,
19+
normalize_message_content,
1920
normalize_message_role,
2021
require_string,
2122
stringify_json_value,
@@ -244,7 +245,7 @@ def normalize_hermes_messages(
244245
normalized_messages.append(
245246
build_message(
246247
role="tool",
247-
content=_normalize_message_content(raw_message.get("content")),
248+
content=normalize_message_content(raw_message.get("content")),
248249
tool_call_id=require_string(
249250
raw_message.get("tool_call_id"),
250251
f"Hermes tool message tool_call_id #{message_index} in {file_path}",
@@ -253,7 +254,7 @@ def normalize_hermes_messages(
253254
)
254255
continue
255256

256-
content = _normalize_message_content(raw_message.get("content"))
257+
content = normalize_message_content(raw_message.get("content"))
257258
reasoning_content = coerce_optional_str(raw_message.get("reasoning"))
258259
tool_calls = normalize_hermes_tool_calls(
259260
raw_message.get("tool_calls"),
@@ -413,22 +414,6 @@ def _require_message_list(raw_messages: Any, *, file_path: Path, context: str) -
413414
return raw_messages
414415

415416

416-
def _normalize_message_content(content: Any) -> Any:
417-
"""Coerce Hermes message content into the normalized content shape.
418-
419-
Args:
420-
content: Raw Hermes message content.
421-
422-
Returns:
423-
A string or content-block list compatible with ``build_message``.
424-
"""
425-
if content is None:
426-
return ""
427-
if isinstance(content, (str, list)):
428-
return content
429-
return stringify_json_value(content)
430-
431-
432417
def _extract_finish_reasons(raw_messages: list[dict[str, Any]]) -> list[str]:
433418
"""Collect distinct assistant finish reasons in first-seen order.
434419

0 commit comments

Comments
 (0)