Skip to content

Commit 9f640ca

Browse files
committed
docs: add to_config_builder convenience method and concrete use cases
1 parent d1c6c64 commit 9f640ca

1 file changed

Lines changed: 111 additions & 1 deletion

File tree

plans/workflow-chaining/workflow-chaining.md

Lines changed: 111 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,24 @@ results["conversations"].load_dataset() # stage 2 output
5656
results["judged"].load_dataset() # final output
5757
```
5858

59+
**Convenience method on results (lightweight, for notebooks):**
60+
61+
For interactive use where a full pipeline is overkill, a `to_config_builder()` method on `DatasetCreationResults` returns a pre-seeded `DataDesignerConfigBuilder`:
62+
63+
```python
64+
# Stage 1
65+
result = dd.create(config_personas, num_records=100)
66+
67+
# Stage 2 - just grab the result and keep going
68+
config_convos = (
69+
result.to_config_builder(columns=["name", "age", "background"]) # optional column selection
70+
.add_column(name="conversation", column_type="llm_text", prompt="...")
71+
)
72+
result_2 = dd.create(config_convos, num_records=1000)
73+
```
74+
75+
This is a thin wrapper: loads the dataset, optionally filters columns, wraps in `DataFrameSeedSource`, returns a new config builder. No tracking, no provenance, no callbacks - just a quick bridge for iteration.
76+
5977
**Auto-chaining from a single config (future):**
6078

6179
The engine detects columns that were previously `allow_resize=True` (or a new marker like `stage_boundary=True`) and auto-splits the DAG into stages. This is a convenience layer on top of the explicit API - not required for v1.
@@ -163,10 +181,102 @@ For users who need programmatic filtering at the seed boundary, a seed reader pl
163181

164182
The engine does not know about pipelines. Each stage is a regular `DatasetBuilder.build()` call.
165183

184+
## Use cases for implementation and testing
185+
186+
These should guide the implementation and serve as the basis for tutorial notebooks.
187+
188+
### 1. Explode: personas to conversations
189+
190+
Generate a small, high-quality set of personas, then produce many conversations from each.
191+
192+
```python
193+
# Stage 1: 100 diverse personas
194+
config_personas = (
195+
DataDesignerConfigBuilder()
196+
.add_column(name="name", column_type="sampler", sampler_type="person_name")
197+
.add_column(name="age", column_type="sampler", sampler_type="uniform_int", params=...)
198+
.add_column(name="background", column_type="llm_text", prompt="Write a short background for {{ name }}, age {{ age }}.")
199+
)
200+
201+
# Stage 2: 1000 conversations (each persona used ~10 times via seed cycling)
202+
config_convos = (
203+
DataDesignerConfigBuilder()
204+
.add_column(name="topic", column_type="llm_text", prompt="Generate a conversation topic for {{ name }}...")
205+
.add_column(name="conversation", column_type="llm_text", prompt="Write a conversation between {{ name }} and an assistant about {{ topic }}...")
206+
)
207+
208+
pipeline = dd.pipeline()
209+
pipeline.add_stage("personas", config_personas, num_records=100)
210+
pipeline.add_stage("conversations", config_convos, num_records=1000)
211+
results = pipeline.run()
212+
```
213+
214+
### 2. Filter-then-enrich
215+
216+
Generate candidates, use a between-stage callback to filter, then enrich survivors.
217+
218+
```python
219+
config_gen = ... # generates rows with a quality_score column
220+
config_enrich = ... # adds detailed analysis columns
221+
222+
def keep_high_quality(stage_output_path: Path) -> Path:
223+
df = pd.read_parquet(stage_output_path / "parquet-files")
224+
df = df[df["quality_score"] > 0.8]
225+
out = stage_output_path.parent / "filtered"
226+
out.mkdir(exist_ok=True)
227+
df.to_parquet(out / "data.parquet")
228+
return out
229+
230+
pipeline = dd.pipeline()
231+
pipeline.add_stage("candidates", config_gen, num_records=5000)
232+
pipeline.add_stage("enriched", config_enrich, after=keep_high_quality)
233+
results = pipeline.run()
234+
```
235+
236+
### 3. Generate-then-judge with different models
237+
238+
Iterate on the judging config without re-generating the base data.
239+
240+
```python
241+
# Stage 1: generate with a fast model
242+
config_gen = DataDesignerConfigBuilder(model_configs=[fast_model])...
243+
244+
# Stage 2: judge with a stronger model
245+
config_judge = DataDesignerConfigBuilder(model_configs=[strong_model])...
246+
247+
pipeline = dd.pipeline()
248+
pipeline.add_stage("generated", config_gen, num_records=1000)
249+
pipeline.add_stage("judged", config_judge)
250+
results = pipeline.run()
251+
252+
# Later: tweak judging config, resume from stage 1 output
253+
pipeline_v2 = dd.pipeline()
254+
pipeline_v2.add_stage("generated", config_gen, num_records=1000)
255+
pipeline_v2.add_stage("judged", config_judge_v2)
256+
results_v2 = pipeline_v2.run(resume=True) # skips stage 1
257+
```
258+
259+
### 4. Interactive notebook chaining (lightweight, no pipeline)
260+
261+
Quick iteration using `to_config_builder()`:
262+
263+
```python
264+
result = dd.create(config_personas, num_records=50)
265+
result.load_dataset() # inspect, looks good
266+
267+
# Chain into next step
268+
config_2 = (
269+
result.to_config_builder(columns=["name", "background"])
270+
.add_column(name="question", column_type="llm_text", prompt="...")
271+
)
272+
result_2 = dd.create(config_2, num_records=200) # explode: 50 -> 200
273+
```
274+
166275
## Implementation phases
167276

168-
### Phase 1: Pipeline class (can ship independently)
277+
### Phase 1: Pipeline class and `to_config_builder()` (can ship independently)
169278

279+
- Add `to_config_builder()` on `DatasetCreationResults` and `PreviewResults`.
170280
- Add `Pipeline` class with `add_stage()`, `run()`, between-stage callbacks.
171281
- Add `pipeline-metadata.json` writing.
172282
- Add `dd.pipeline()` factory method on `DataDesigner`.

0 commit comments

Comments
 (0)