Skip to content

Commit 9fbf971

Browse files
committed
Merge branch 'main' into nmulepati/feat/479-skip-conditional-gen-implementation
2 parents bf6332e + 13cd687 commit 9fbf971

15 files changed

Lines changed: 1620 additions & 63 deletions

File tree

.github/workflows/docs-preview.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,6 @@ on:
66
paths:
77
- "docs/**"
88
- "mkdocs.yml"
9-
- "packages/*/src/data_designer/**"
109
- ".github/workflows/docs-preview.yml"
1110

1211
jobs:

README.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,16 @@ Data Designer helps you create synthetic datasets that go beyond simple LLM prom
2222

2323
---
2424

25+
### ⚠️ Security Notice: LiteLLM Supply-Chain Incident (2026-03-24)
26+
27+
On March 24, 2026, malicious versions of `litellm` ([1.82.7 and 1.82.8](https://github.com/BerriAI/litellm/issues/24518)) were published to PyPI containing a credential stealer. The compromised packages were available for [approximately five hours](https://www.okta.com/blog/threat-intelligence/litellm-supply-chain-attack--an-explainer-for-identity-pros/) (10:39 – 16:00 UTC) before being removed.
28+
29+
The only Data Designer releases that could resolve to these versions are **v0.2.2** (Dec 2025) and **v0.2.3** (Jan 2026), which carried a looser `litellm<2` upper bound. These are nearly three months old and have been superseded by eight subsequent releases — both have been yanked from PyPI as a precaution. All other releases (v0.3.0 – v0.5.3) pinned `litellm` to `>=1.73.6,<1.80.12` and were never compatible with 1.82.x. Starting with v0.5.4, `litellm` is no longer a dependency.
30+
31+
To have been impacted through Data Designer, you would need to have had one of these two old versions explicitly pinned *and* run a fresh `pip install` or dependency-cache update that resolved `litellm` during the five-hour window on March 24. If you believe you may be affected, see [BerriAI's incident report](https://github.com/BerriAI/litellm/issues/24518) for remediation steps.
32+
33+
---
34+
2535
## Quick Start
2636

2737
### 1. Install

docs/concepts/agent-rollout-ingestion.md

Lines changed: 34 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,18 @@ Use `AgentRolloutSeedSource` when you want to work from existing agent traces in
4242
)
4343
```
4444

45+
=== "Pi Coding Agent"
46+
47+
Uses `~/.pi/agent/sessions` and `*.jsonl` by default. Sessions are tree-structured JSONL files; the active conversation path is resolved automatically.
48+
49+
```python
50+
import data_designer.config as dd
51+
52+
seed_source = dd.AgentRolloutSeedSource(
53+
format=dd.AgentRolloutFormat.PI_CODING_AGENT,
54+
)
55+
```
56+
4557
=== "ATIF"
4658

4759
ATIF requires an explicit `path`. See Harbor's [ATIF documentation](https://harborframework.com/docs/trajectory-format) for the format specification.
@@ -63,31 +75,31 @@ You can override `path` and `file_pattern` for any format when your rollout arti
6375

6476
All supported rollout formats map into the same seeded row schema. In the table below, `None` means the source artifact does not expose that field directly, and `derived` means Data Designer computes it from normalized `messages`.
6577

66-
| Normalized field | ATIF | Claude Code | Codex | Hermes Agent |
67-
|---|---|---|---|---|
68-
| `trace_id` | `session_id` | `sessionId[:agentId]` | `session_meta.id` or file stem | CLI `session_id` or file stem; gateway file stem |
69-
| `source_kind` | `"atif"` | `"claude_code"` | `"codex"` | `"hermes_agent"` |
70-
| `source_path` | Parsed `.json` path | Parsed `.jsonl` trace path | Parsed `rollout-*.jsonl` path | Parsed CLI `.json` or gateway `.jsonl` path |
71-
| `root_session_id` | `session_id` | `sessionId` or file stem | `trace_id` | `trace_id` |
72-
| `agent_id` | `None` | `agentId` | `None` | `None` |
73-
| `is_sidechain` | `False` | `isSidechain` | `False` | `False` |
74-
| `cwd` | `agent.extra.cwd` | First non-null record `cwd` | `session_meta.cwd` | `None` |
75-
| `project_path` | `extra.project_path` or `cwd` | `projectPath` or `cwd` | `cwd` | `None` |
76-
| `git_branch` | `agent.extra.git_branch` | First non-null record `gitBranch` | `session_meta.git_branch` | `None` |
77-
| `started_at` | Earliest step timestamp | Earliest row timestamp | `session_meta.timestamp` or earliest record timestamp | CLI `session_start`; gateway `created_at` |
78-
| `ended_at` | Latest step timestamp | Latest row timestamp | Latest record timestamp | CLI `last_updated`; gateway `updated_at` |
79-
| `messages` | Normalized steps | Normalized trace rows | Normalized response items | Normalized CLI or gateway rows |
80-
| `source_meta` | ATIF metadata | Claude metadata | Codex metadata | Hermes metadata |
81-
| `message_count` | `derived` | `derived` | `derived` | `derived` |
82-
| `tool_call_count` | `derived` | `derived` | `derived` | `derived` |
83-
| `final_assistant_message` | `derived` | `derived` | `derived` | `derived` |
78+
| Normalized field | ATIF | Claude Code | Codex | Hermes Agent | Pi Coding Agent |
79+
|---|---|---|---|---|---|
80+
| `trace_id` | `session_id` | `sessionId[:agentId]` | `session_meta.id` or file stem | CLI `session_id` or file stem; gateway file stem | Session header `id` |
81+
| `source_kind` | `"atif"` | `"claude_code"` | `"codex"` | `"hermes_agent"` | `"pi_coding_agent"` |
82+
| `source_path` | Parsed `.json` path | Parsed `.jsonl` trace path | Parsed `rollout-*.jsonl` path | Parsed CLI `.json` or gateway `.jsonl` path | Parsed `.jsonl` session path |
83+
| `root_session_id` | `session_id` | `sessionId` or file stem | `trace_id` | `trace_id` | Session header `id` |
84+
| `agent_id` | `None` | `agentId` | `None` | `None` | `None` |
85+
| `is_sidechain` | `False` | `isSidechain` | `False` | `False` | `False` |
86+
| `cwd` | `agent.extra.cwd` | First non-null record `cwd` | `session_meta.cwd` | `None` | Session header `cwd` |
87+
| `project_path` | `extra.project_path` or `cwd` | `projectPath` or `cwd` | `cwd` | `None` | Session header `cwd` |
88+
| `git_branch` | `agent.extra.git_branch` | First non-null record `gitBranch` | `session_meta.git_branch` | `None` | `None` |
89+
| `started_at` | Earliest step timestamp | Earliest row timestamp | `session_meta.timestamp` or earliest record timestamp | CLI `session_start`; gateway `created_at` | Earliest entry timestamp |
90+
| `ended_at` | Latest step timestamp | Latest row timestamp | Latest record timestamp | CLI `last_updated`; gateway `updated_at` | Latest entry timestamp |
91+
| `messages` | Normalized steps | Normalized trace rows | Normalized response items | Normalized CLI or gateway rows | Normalized active-path messages |
92+
| `source_meta` | ATIF metadata | Claude metadata | Codex metadata | Hermes metadata | Pi session metadata |
93+
| `message_count` | `derived` | `derived` | `derived` | `derived` | `derived` |
94+
| `tool_call_count` | `derived` | `derived` | `derived` | `derived` | `derived` |
95+
| `final_assistant_message` | `derived` | `derived` | `derived` | `derived` | `derived` |
8496

8597
### Notes
8698

87-
- `trace_id`: Claude Code appends `agentId` when present. Hermes uses either the CLI session ID or the gateway transcript file stem.
88-
- `is_sidechain`: ATIF and Hermes currently normalize this to `False`. Claude Code preserves `isSidechain` directly.
89-
- `messages`: All formats normalize into the same chat-style message schema. See [Message Traces](traces.md) for the shared block structure.
90-
- `source_meta`: This is where format-specific details live, such as ATIF copied-context metadata, Claude summaries, Codex response-item types, or Hermes tool/session metadata.
99+
- `trace_id`: Claude Code appends `agentId` when present. Hermes uses either the CLI session ID or the gateway transcript file stem. Pi uses the session header `id`.
100+
- `is_sidechain`: ATIF, Hermes, and Pi currently normalize this to `False`. Claude Code preserves `isSidechain` directly.
101+
- `messages`: All formats normalize into the same chat-style message schema. See [Message Traces](traces.md) for the shared block structure. Pi sessions are tree-structured; only the active conversation path (from the last entry back to root) is included.
102+
- `source_meta`: This is where format-specific details live, such as ATIF copied-context metadata, Claude summaries, Codex response-item types, Hermes tool/session metadata, or Pi session version and branch information.
91103

92104
## Example: Summarize a Random Turn
93105

packages/data-designer-config/src/data_designer/config/column_configs.py

Lines changed: 22 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -44,8 +44,10 @@ class SamplerColumnConfig(SingleColumnConfig):
4444
conditional_params: Optional dictionary for conditional parameters. The dict keys
4545
are the conditions that must be met (e.g., "age > 21") for the conditional parameters
4646
to be used. The values of dict are the parameters to use when the condition is met.
47-
convert_to: Optional type conversion to apply after sampling. Must be one of "float", "int", or "str".
48-
Useful for converting numerical samples to strings or other types.
47+
convert_to: Optional type conversion to apply after sampling. For numerical samplers,
48+
must be one of "float", "int", or "str". For datetime and timedelta samplers, accepts
49+
a strftime format string (e.g., ``"%Y-%m-%d"``, ``"%m/%d/%Y %H:%M"``). When omitted,
50+
datetime/timedelta columns default to ISO-8601 format (e.g., ``2024-01-15T09:30:00``).
4951
5052
Inherited Attributes:
5153
name (required): Unique name of the column to be generated.
@@ -70,7 +72,12 @@ class SamplerColumnConfig(SingleColumnConfig):
7072
description="Optional dictionary for conditional parameters; keys are conditions, values are params to use when met",
7173
)
7274
convert_to: str | None = Field(
73-
default=None, description="Optional type conversion after sampling: 'float', 'int', or 'str'"
75+
default=None,
76+
description=(
77+
"Optional type conversion after sampling: 'float', 'int', or 'str' for numerical samplers; "
78+
"a strftime format string (e.g., '%Y-%m-%d') for datetime/timedelta samplers. "
79+
"Datetime/timedelta columns default to ISO-8601 (e.g., 2024-01-15T09:30:00) when omitted."
80+
),
7481
)
7582
column_type: Literal["sampler"] = "sampler"
7683

@@ -178,14 +185,17 @@ def get_column_emoji() -> str:
178185

179186
@property
180187
def required_columns(self) -> list[str]:
181-
"""Get columns referenced in the prompt and system_prompt templates.
188+
"""Get columns referenced in prompt templates and multi-modal context.
182189
183190
Returns:
184-
List of unique column names referenced in Jinja2 templates.
191+
List of unique column names referenced in Jinja2 templates
192+
and multi-modal context configurations.
185193
"""
186194
required_cols = list(extract_keywords_from_jinja2_template(self.prompt))
187195
if self.system_prompt:
188196
required_cols.extend(list(extract_keywords_from_jinja2_template(self.system_prompt)))
197+
if self.multi_modal_context:
198+
required_cols.extend(ctx.column_name for ctx in self.multi_modal_context)
189199
return list(set(required_cols))
190200

191201
@property
@@ -586,12 +596,16 @@ def get_column_emoji() -> str:
586596

587597
@property
588598
def required_columns(self) -> list[str]:
589-
"""Get columns referenced in the prompt template.
599+
"""Get columns referenced in the prompt template and multi-modal context.
590600
591601
Returns:
592-
List of unique column names referenced in Jinja2 templates.
602+
List of unique column names referenced in Jinja2 templates
603+
and multi-modal context configurations.
593604
"""
594-
return list(extract_keywords_from_jinja2_template(self.prompt))
605+
required_cols = list(extract_keywords_from_jinja2_template(self.prompt))
606+
if self.multi_modal_context:
607+
required_cols.extend(ctx.column_name for ctx in self.multi_modal_context)
608+
return list(set(required_cols))
595609

596610
@model_validator(mode="after")
597611
def assert_prompt_valid_jinja(self) -> Self:

packages/data-designer-config/src/data_designer/config/seed_source.py

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -176,6 +176,10 @@ def get_hermes_agent_default_path() -> str:
176176
return str(Path("~/.hermes/sessions").expanduser())
177177

178178

179+
def get_pi_coding_agent_default_path() -> str:
180+
return str(Path("~/.pi/agent/sessions").expanduser())
181+
182+
179183
def _validate_filesystem_seed_source_path(value: str | None) -> str | None:
180184
if value is None:
181185
return None
@@ -200,6 +204,7 @@ class AgentRolloutFormat(StrEnum):
200204
CLAUDE_CODE = "claude_code"
201205
CODEX = "codex"
202206
HERMES_AGENT = "hermes_agent"
207+
PI_CODING_AGENT = "pi_coding_agent"
203208

204209

205210
def get_agent_rollout_format_defaults(fmt: AgentRolloutFormat) -> tuple[str | None, str]:
@@ -211,6 +216,8 @@ def get_agent_rollout_format_defaults(fmt: AgentRolloutFormat) -> tuple[str | No
211216
return (get_codex_default_path(), "*.jsonl")
212217
if fmt == AgentRolloutFormat.HERMES_AGENT:
213218
return (get_hermes_agent_default_path(), "*.json*")
219+
if fmt == AgentRolloutFormat.PI_CODING_AGENT:
220+
return (get_pi_coding_agent_default_path(), "*.jsonl")
214221
raise ValueError(f"🛑 Unknown agent rollout format: {fmt!r}")
215222

216223

@@ -228,7 +235,8 @@ class AgentRolloutSeedSource(FileSystemSeedSource):
228235
"Directory containing agent rollout artifacts. This field is required for ATIF trajectories. "
229236
"When omitted, built-in defaults are used for formats that define one. "
230237
"Claude Code defaults to ~/.claude/projects, Codex defaults to ~/.codex/sessions, "
231-
"and Hermes Agent defaults to ~/.hermes/sessions. "
238+
"Hermes Agent defaults to ~/.hermes/sessions, "
239+
"and Pi Coding Agent defaults to ~/.pi/agent/sessions. "
232240
"Relative paths are resolved from the current working directory when the config is loaded, "
233241
"not from the config file location."
234242
),
@@ -238,7 +246,7 @@ class AgentRolloutSeedSource(FileSystemSeedSource):
238246
None,
239247
description=(
240248
"Case-sensitive filename pattern used to match agent rollout files. When omitted, "
241-
"ATIF defaults to '*.json', Claude Code and Codex default to '*.jsonl', "
249+
"ATIF defaults to '*.json', Claude Code, Codex, and Pi Coding Agent default to '*.jsonl', "
242250
"and Hermes Agent defaults to '*.json*'."
243251
),
244252
)

packages/data-designer-config/tests/config/test_columns.py

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
from data_designer.config.column_configs import (
1010
EmbeddingColumnConfig,
1111
ExpressionColumnConfig,
12+
ImageColumnConfig,
1213
LLMCodeColumnConfig,
1314
LLMJudgeColumnConfig,
1415
LLMStructuredColumnConfig,
@@ -26,6 +27,7 @@
2627
is_plugin_column_type,
2728
)
2829
from data_designer.config.errors import InvalidConfigError
30+
from data_designer.config.models import ImageContext
2931
from data_designer.config.sampler_params import (
3032
CategorySamplerParams,
3133
GaussianSamplerParams,
@@ -122,6 +124,36 @@ def test_llm_text_column_config():
122124
)
123125

124126

127+
def test_llm_text_column_config_required_columns_includes_multi_modal_context():
128+
config = LLMTextColumnConfig(
129+
name="test_llm_text",
130+
prompt="Classify this image: {{ description }}",
131+
model_alias=stub_model_alias,
132+
multi_modal_context=[ImageContext(column_name="image_base64")],
133+
)
134+
assert set(config.required_columns) == {"description", "image_base64"}
135+
136+
137+
def test_llm_text_column_config_required_columns_deduplicates_multi_modal_and_prompt():
138+
config = LLMTextColumnConfig(
139+
name="test_llm_text",
140+
prompt="Classify this: {{ image_col }}",
141+
model_alias=stub_model_alias,
142+
multi_modal_context=[ImageContext(column_name="image_col")],
143+
)
144+
assert config.required_columns == ["image_col"]
145+
146+
147+
def test_image_column_config_required_columns_includes_multi_modal_context():
148+
config = ImageColumnConfig(
149+
name="test_image",
150+
prompt="Generate based on {{ style }}",
151+
model_alias=stub_model_alias,
152+
multi_modal_context=[ImageContext(column_name="reference_image")],
153+
)
154+
assert set(config.required_columns) == {"style", "reference_image"}
155+
156+
125157
def test_llm_text_column_config_with_trace_serialization() -> None:
126158
"""Test that with_trace field serializes and deserializes correctly."""
127159
config = LLMTextColumnConfig(

packages/data-designer-engine/src/data_designer/engine/resources/agent_rollout/hermes_agent.py

Lines changed: 3 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
coerce_optional_str,
1717
load_json_object,
1818
load_jsonl_rows,
19+
normalize_message_content,
1920
normalize_message_role,
2021
require_string,
2122
stringify_json_value,
@@ -244,7 +245,7 @@ def normalize_hermes_messages(
244245
normalized_messages.append(
245246
build_message(
246247
role="tool",
247-
content=_normalize_message_content(raw_message.get("content")),
248+
content=normalize_message_content(raw_message.get("content")),
248249
tool_call_id=require_string(
249250
raw_message.get("tool_call_id"),
250251
f"Hermes tool message tool_call_id #{message_index} in {file_path}",
@@ -253,7 +254,7 @@ def normalize_hermes_messages(
253254
)
254255
continue
255256

256-
content = _normalize_message_content(raw_message.get("content"))
257+
content = normalize_message_content(raw_message.get("content"))
257258
reasoning_content = coerce_optional_str(raw_message.get("reasoning"))
258259
tool_calls = normalize_hermes_tool_calls(
259260
raw_message.get("tool_calls"),
@@ -413,22 +414,6 @@ def _require_message_list(raw_messages: Any, *, file_path: Path, context: str) -
413414
return raw_messages
414415

415416

416-
def _normalize_message_content(content: Any) -> Any:
417-
"""Coerce Hermes message content into the normalized content shape.
418-
419-
Args:
420-
content: Raw Hermes message content.
421-
422-
Returns:
423-
A string or content-block list compatible with ``build_message``.
424-
"""
425-
if content is None:
426-
return ""
427-
if isinstance(content, (str, list)):
428-
return content
429-
return stringify_json_value(content)
430-
431-
432417
def _extract_finish_reasons(raw_messages: list[dict[str, Any]]) -> list[str]:
433418
"""Collect distinct assistant finish reasons in first-seen order.
434419

0 commit comments

Comments
 (0)