From 002750f7c4f6d15ddead7f78ebf1c243af402121 Mon Sep 17 00:00:00 2001 From: Nabin Mulepati Date: Wed, 25 Mar 2026 13:17:45 -0600 Subject: [PATCH 1/9] save progress --- .agents/skills/create-pr/SKILL.md | 53 ++++++---- .agents/skills/review-code/SKILL.md | 13 +++ .github/ISSUE_TEMPLATE/bug-report.yml | 20 ++++ .github/ISSUE_TEMPLATE/config.yml | 5 +- .github/ISSUE_TEMPLATE/development-task.yml | 16 ++- .github/ISSUE_TEMPLATE/feature-request.yml | 16 +++ .github/PULL_REQUEST_TEMPLATE.md | 19 ++++ architecture/agent-introspection.md | 79 +++++++++++++-- architecture/cli.md | 81 +++++++++++++-- architecture/config.md | 74 ++++++++++++-- architecture/dataset-builders.md | 104 ++++++++++++++++++-- architecture/engine.md | 87 ++++++++++++++-- architecture/mcp.md | 60 +++++++++-- architecture/models.md | 78 +++++++++++++-- architecture/overview.md | 68 +++++++++++-- architecture/plugins.md | 94 ++++++++++++++++-- architecture/sampling.md | 70 +++++++++++-- plans/427/pr-2-status.md | 91 +++++++++++++++++ 18 files changed, 916 insertions(+), 112 deletions(-) create mode 100644 .github/PULL_REQUEST_TEMPLATE.md create mode 100644 plans/427/pr-2-status.md diff --git a/.agents/skills/create-pr/SKILL.md b/.agents/skills/create-pr/SKILL.md index b3f4fe941..6ff013511 100644 --- a/.agents/skills/create-pr/SKILL.md +++ b/.agents/skills/create-pr/SKILL.md @@ -1,6 +1,6 @@ --- name: create-pr -description: Create a GitHub PR with a well-formatted description including summary, categorized changes, and attention areas +description: Create a GitHub PR with a well-formatted description matching the repository PR template argument-hint: [special instructions] disable-model-invocation: true metadata: @@ -9,7 +9,7 @@ metadata: # Create Pull Request -Create a well-formatted GitHub pull request for the current branch. +Create a well-formatted GitHub pull request for the current branch. The PR description must conform to the repository's PR template (`.github/PULL_REQUEST_TEMPLATE.md`). ## Arguments @@ -75,35 +75,44 @@ If commits have mixed types, use the primary/most significant type. git push -u origin ``` -2. **Create PR** using this template: +2. **Build the PR body** using the repository's template structure: ```markdown ## ๐Ÿ“‹ Summary -[1-2 sentence overview of what this PR accomplishes] +[1-3 sentences: what this PR does and why. Focus on the "why".] + +## ๐Ÿ”— Related Issue + +[Fixes #NNN or Closes #NNN โ€” link to the issue this addresses] ## ๐Ÿ”„ Changes -### โœจ Added -- [New features/files - link to key files when helpful] +- [Bullet list of key changes, grouped logically] +- [Link to key files when helpful for reviewers] +- [Reference commits for specific changes in multi-commit PRs] + +## ๐Ÿงช Testing + +- [x] `make test` passes +- [x] Unit tests added/updated (or: N/A โ€” no testable logic) +- [ ] E2E tests added/updated (if applicable) -### ๐Ÿ”ง Changed -- [Modified functionality - reference commits for specific changes] +## โœ… Checklist -### ๐Ÿ—‘๏ธ Removed -- [Deleted items] +- [x] Follows commit message conventions +- [x] Commits are signed off (DCO) +- [ ] Architecture docs updated (if applicable) +``` -### ๐Ÿ› Fixed -- [Bug fixes - if applicable] +If there are genuinely important attention areas for reviewers, add an **Attention Areas** section after Changes: +```markdown ## ๐Ÿ” Attention Areas > โš ๏ธ **Reviewers:** Please pay special attention to the following: -- [`path/to/critical/file.py`](https://github.com///blob//path/to/critical/file.py) - [Why this needs attention] - ---- -๐Ÿค– *Generated with AI* +- [`path/to/critical/file.py`](https://github.com///blob//path/to/critical/file.py) โ€” [Why this needs attention] ``` 3. **Execute**: @@ -118,15 +127,18 @@ If commits have mixed types, use the primary/most significant type. ## Section Guidelines -- **Summary**: Always include - be concise and focus on the "why" -- **Changes**: Group by type, omit empty sections +- **Summary**: Always include โ€” be concise and focus on the "why", not just the "what" +- **Related Issue**: Always include if an issue exists. Use `Fixes #NNN` for bugs, `Closes #NNN` for features/tasks +- **Changes**: Bullet list grouped logically. Omit trivial changes (formatting, imports) unless they are the point of the PR +- **Testing**: Check off items that apply. Mark N/A items explicitly rather than leaving them unchecked without explanation +- **Checklist**: Check off items that are true. Leave unchecked with a note if something doesn't apply - **Attention Areas**: Only include if there are genuinely important items; omit for simple PRs - **Links**: Include links to code and commits where helpful for reviewers: - - **File links require full URLs** - relative paths don't work in PR descriptions + - **File links require full URLs** โ€” relative paths don't work in PR descriptions - Link to a file: `[filename](https://github.com///blob//path/to/file.py)` - Link to specific lines: `[description](https://github.com///blob//path/to/file.py#L42-L50)` - Use the branch name (from Step 1) in the URL so links point to the PR's version of files - - Reference commits: `abc1234` - GitHub auto-links short commit SHAs in PR descriptions + - Reference commits: `abc1234` โ€” GitHub auto-links short commit SHAs in PR descriptions - For multi-commit PRs, reference individual commits when describing specific changes ## Edge Cases @@ -135,3 +147,4 @@ If commits have mixed types, use the primary/most significant type. - **Uncommitted work**: Warn and ask before proceeding - **Large PRs** (>20 files): Summarize by directory/module - **Single commit**: PR title can match commit message +- **No related issue**: Note "N/A" in the Related Issue section rather than omitting it diff --git a/.agents/skills/review-code/SKILL.md b/.agents/skills/review-code/SKILL.md index 910664269..326a909f4 100644 --- a/.agents/skills/review-code/SKILL.md +++ b/.agents/skills/review-code/SKILL.md @@ -106,6 +106,7 @@ Before diving into details, build a mental model: 4. **Identify the primary goal** (feature, refactor, bugfix, etc.) 5. **Note cross-cutting concerns** (e.g., a rename that touches many files vs. substantive logic changes) 6. **Check existing feedback** (PR mode): inspect both inline comments (Step 1, item 5) and PR-level review bodies (Step 1, item 5b) so you don't duplicate feedback already given +7. **Check PR template conformance** (PR mode): verify the PR description includes the required template sections (๐Ÿ“‹ Summary, ๐Ÿ”— Related Issue, ๐Ÿ”„ Changes, ๐Ÿงช Testing, โœ… Checklist). Flag missing or empty sections as a warning in the review. The template lives at `.github/PULL_REQUEST_TEMPLATE.md` ## Step 4: Review Each Changed File (Multi-Pass) @@ -242,6 +243,18 @@ Separate each finding with a blank line. Use bold file-and-title as a heading li #### Suggestions โ€” Consider improving > Style improvements, minor simplifications, or optional enhancements that would improve code quality. +### PR Template Conformance (PR mode only) + +Check that the PR description follows the repository template (`.github/PULL_REQUEST_TEMPLATE.md`): + +- **๐Ÿ“‹ Summary** โ€” present and describes the "why", not just the "what" +- **๐Ÿ”— Related Issue** โ€” links to an issue (`Fixes #NNN` or `Closes #NNN`), or explicitly states N/A +- **๐Ÿ”„ Changes** โ€” bullet list of key changes +- **๐Ÿงช Testing** โ€” checklist items checked or marked N/A with explanation +- **โœ… Checklist** โ€” all items addressed + +Flag missing or empty sections as a warning. This check is skipped in branch mode. + ### What Looks Good Call out 2-3 things done well (good abstractions, thorough tests, clean refactoring, etc.). Positive feedback is part of a good review. diff --git a/.github/ISSUE_TEMPLATE/bug-report.yml b/.github/ISSUE_TEMPLATE/bug-report.yml index b5b4232a2..cd48c5cb4 100644 --- a/.github/ISSUE_TEMPLATE/bug-report.yml +++ b/.github/ISSUE_TEMPLATE/bug-report.yml @@ -44,8 +44,28 @@ body: placeholder: A clear and concise description of what you expected to happen. validations: required: true + - type: textarea + id: agent-diagnostic + attributes: + label: Agent Diagnostic / Prior Investigation + description: | + If you used an agent, paste the output from its investigation (e.g., from the `search-docs` or `search-github` skills). + If you couldn't or didn't use an agent, briefly say why and include the troubleshooting you already tried. + placeholder: | + Paste agent output here, or describe the manual investigation you performed. - type: textarea id: context attributes: label: Additional context placeholder: Add any other context about the problem here (e.g., screenshots, logs, browser version). + - type: checkboxes + id: checklist + attributes: + label: Checklist + options: + - label: I reproduced this issue or provided a minimal example + required: true + - label: I searched the docs/issues myself, or had my agent do so + required: true + - label: If I used an agent, I included its diagnostics above + required: false diff --git a/.github/ISSUE_TEMPLATE/config.yml b/.github/ISSUE_TEMPLATE/config.yml index c88eeb6d4..a238e8be1 100644 --- a/.github/ISSUE_TEMPLATE/config.yml +++ b/.github/ISSUE_TEMPLATE/config.yml @@ -2,4 +2,7 @@ blank_issues_enabled: false contact_links: - name: ๐Ÿ’ฌ Ask a Question url: https://github.com/NVIDIA-NeMo/DataDesigner/discussions - about: Please use GitHub Discussions for general questions. + about: >- + Have a question? Try pointing your agent at the repo first โ€” it can search docs, + find issues, and more. See CONTRIBUTING.md for the recommended workflow. + If that doesn't help, use GitHub Discussions. diff --git a/.github/ISSUE_TEMPLATE/development-task.yml b/.github/ISSUE_TEMPLATE/development-task.yml index d9711cb2d..2f1417267 100644 --- a/.github/ISSUE_TEMPLATE/development-task.yml +++ b/.github/ISSUE_TEMPLATE/development-task.yml @@ -1,5 +1,5 @@ name: ๐Ÿ› ๏ธ Development Task -description: Track internal development work, refactoring, or infrastructure +description: Track internal development work, refactoring, or infrastructure changes labels: ["task"] body: - type: dropdown @@ -25,6 +25,20 @@ body: attributes: label: Technical Details & Implementation Plan placeholder: Describe the technical approach, files affected, or logic changes. + - type: textarea + id: investigation + attributes: + label: Investigation / Context + description: | + Relevant issue links, architecture context, or notes from prior exploration. + placeholder: Link related issues, reference architecture docs, or describe relevant context. + - type: textarea + id: agent-plan + attributes: + label: Agent Plan / Findings + description: | + If an agent investigated this task, paste its findings or proposed plan here. + placeholder: Paste agent output here, if applicable. - type: input id: dependencies attributes: diff --git a/.github/ISSUE_TEMPLATE/feature-request.yml b/.github/ISSUE_TEMPLATE/feature-request.yml index c8e8c60c0..12132f0c9 100644 --- a/.github/ISSUE_TEMPLATE/feature-request.yml +++ b/.github/ISSUE_TEMPLATE/feature-request.yml @@ -38,8 +38,24 @@ body: attributes: label: Describe alternatives you've considered placeholder: A clear and concise description of any alternative solutions or features you've considered. + - type: textarea + id: agent-investigation + attributes: + label: Agent Investigation + description: | + If your agent explored the codebase to assess feasibility (e.g., using the `search-docs` or `search-github` skills), paste its findings here. + placeholder: Paste agent output here, if applicable. - type: textarea id: context attributes: label: Additional context placeholder: Add any other context or screenshots about the feature request here. + - type: checkboxes + id: checklist + attributes: + label: Checklist + options: + - label: I've reviewed existing issues and the documentation + required: true + - label: This is a design proposal, not a "please build this" request + required: true diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md new file mode 100644 index 000000000..29b081333 --- /dev/null +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -0,0 +1,19 @@ +## ๐Ÿ“‹ Summary + + +## ๐Ÿ”— Related Issue + + +## ๐Ÿ”„ Changes + + +## ๐Ÿงช Testing + +- [ ] `make test` passes +- [ ] Unit tests added/updated +- [ ] E2E tests added/updated (if applicable) + +## โœ… Checklist +- [ ] Follows commit message conventions +- [ ] Commits are signed off (DCO) +- [ ] Architecture docs updated (if applicable) diff --git a/architecture/agent-introspection.md b/architecture/agent-introspection.md index 5f714ca3d..5e97073c9 100644 --- a/architecture/agent-introspection.md +++ b/architecture/agent-introspection.md @@ -1,21 +1,82 @@ # Agent Introspection -> Stub โ€” to be populated. See source code at `packages/data-designer/src/data_designer/cli/`. +The agent introspection subsystem provides machine-readable CLI commands that let agents discover DataDesigner's type system, configuration state, and available operations at runtime. + +Source: `packages/data-designer/src/data_designer/cli/commands/agent.py` and `packages/data-designer/src/data_designer/cli/utils/agent_introspection.py` ## Overview - + +Agent introspection solves a specific problem: agents working with DataDesigner need to know what column types, sampler types, validator types, and processor types are available โ€” including any installed plugins. Rather than hardcoding this knowledge or parsing source code, agents can call `data-designer agent` commands to get structured, up-to-date information. ## Key Components - + +### Commands + +All commands live under the `data-designer agent` group: + +| Command | Purpose | +|---------|---------| +| `data-designer agent context` | Full context dump: version, paths, type catalogs, model aliases, persona state, available operations | +| `data-designer agent types [family]` | Type catalog for one or all families, with descriptions and source file locations | +| `data-designer agent state model-aliases` | Configured model aliases with usability status (missing provider, missing API key, etc.) | +| `data-designer agent state persona-datasets` | Available persona datasets with download status per locale | + +### FamilySpec + +Maps a **family name** to a **discriminated union type** and its **discriminator field**: + +| Family | Union Type | Discriminator | +|--------|-----------|---------------| +| `column-types` | `ColumnConfigT` | `column_type` | +| `sampler-types` | `SamplerParamsT` | `sampler_type` | +| `validator-types` | `ValidatorParamsT` | `validator_type` | +| `processor-types` | `ProcessorConfigT` | `processor_type` | +| `constraint-types` | `ColumnConstraintT` | `constraint_type` | + +### Type Discovery + +`discover_family_types` walks `typing.get_args(type_union)`, reads each Pydantic model's discriminator field annotation (must be `Literal[...]`), and builds a map of discriminator string โ†’ model class. Detects and reports duplicate discriminator values. + +`get_family_catalog` yields the class name and first docstring paragraph for each type โ€” enough for an agent to understand what each type does without reading source code. + +`get_family_source_files` uses `inspect.getfile` and normalizes paths under `data_designer/` (absolute path fallback for plugin types outside the tree). + +### State Commands + +Reuse the CLI's repository stack: +- **Model aliases**: `ModelRepository` + `ProviderRepository` + `get_providers_with_missing_api_keys` to report usability status (configured, missing provider, missing API key) +- **Personas**: `PersonaRepository` + `DownloadService` for locale availability and download status + +### Error Handling + +`AgentIntrospectionError` carries a `code`, `message`, and `details` dict. Commands catch these and output structured error information to stderr with exit code 1, making errors parseable by agents. + +### Command Registration + +`AGENT_COMMANDS` in `agent_command_defs.py` drives both the lazy Typer command map in `main.py` and `get_operations()` in introspection. This single source of truth ensures the operations table in `agent context` output stays in sync with the actual commands. ## Data Flow - + +``` +Agent calls: data-designer agent types column-types + โ†’ Typer dispatches to agent.get_types("column-types") + โ†’ FamilySpec maps "column-types" โ†’ ColumnConfigT union + โ†’ discover_family_types walks union members + โ†’ get_family_catalog extracts names + descriptions + โ†’ get_family_source_files resolves source locations + โ†’ Formatted output returned to agent +``` ## Design Decisions - + +- **Declarative type discovery from Pydantic unions** rather than maintaining a separate type inventory. The discriminated unions are the source of truth for what types exist (including plugins), so introspection reads directly from them. +- **Structured errors with codes** enable agents to handle failures programmatically (retry, report, escalate) rather than parsing human-readable error messages. +- **Single command registration source** (`AGENT_COMMANDS`) prevents the operations table from drifting out of sync with actual CLI commands. +- **Source file resolution** helps agents navigate to implementations when they need to understand a type's behavior, not just its existence. ## Cross-References - -- [System Architecture](overview.md) -- [CLI](cli.md) -- [Config Layer](config.md) + +- [System Architecture](overview.md) โ€” where agent introspection fits +- [CLI](cli.md) โ€” the CLI architecture that hosts these commands +- [Config Layer](config.md) โ€” the discriminated unions that introspection reads +- [Plugins](plugins.md) โ€” how plugin types appear in introspection results diff --git a/architecture/cli.md b/architecture/cli.md index 12b4d99b5..db171691d 100644 --- a/architecture/cli.md +++ b/architecture/cli.md @@ -1,20 +1,85 @@ # CLI -> Stub โ€” to be populated. See source code at `packages/data-designer/src/data_designer/cli/`. +The CLI (`data-designer`) provides an interactive command-line interface for configuring models, providers, tools, and personas, as well as running dataset generation. It uses a layered architecture for config management and delegates generation to the public `DataDesigner` API. + +Source: `packages/data-designer/src/data_designer/cli/` ## Overview - + +The CLI is built on Typer with lazy command loading to keep startup fast. Config management commands follow a **command โ†’ controller โ†’ service โ†’ repository** layering pattern. Generation commands bypass this stack and use the public `DataDesigner` class directly. ## Key Components - + +### Entry Point + +`data-designer` is registered as a console script pointing to `data_designer.cli.main:main`. On startup: +1. `ensure_cli_default_model_settings()` initializes default model/provider configs +2. `app()` launches the Typer application + +### Lazy Command Loading + +`create_lazy_typer_group` and `_LazyCommand` stubs defer importing command modules until a command is actually invoked. This keeps `data-designer --help` fast โ€” only the command names and descriptions are loaded eagerly; the full module (and its dependencies) loads on first use. + +### Layering Pattern (Config Management) + +Config management commands (models, providers, tools, personas) follow a consistent four-layer pattern: + +| Layer | Role | Example | +|-------|------|---------| +| **Command** | Thin Typer entry, wires `DATA_DESIGNER_HOME` | `models_command` โ†’ `ModelController(DATA_DESIGNER_HOME).run()` | +| **Controller** | UX flow: menus, forms, success/error display | `ModelController` composes repos + services + `ModelFormBuilder` | +| **Service** | Domain rules: uniqueness, merge, delete-all | `ModelService.add/update/delete` over `ModelRepository` | +| **Repository** | File I/O for typed config registries | `ModelRepository` extends `ConfigRepository[ModelConfigRegistry]` | + +Repositories: `ModelRepository`, `ProviderRepository`, `ToolRepository`, `MCPProviderRepository`, `PersonaRepository`. + +Services mirror the repository domains with business logic (validation, conflict resolution). + +### Generation Commands + +`preview`, `create`, and `validate` commands use `GenerationController`, which: +1. Loads config via `load_config_builder` +2. Calls `DataDesigner.preview()`, `DataDesigner.create()`, or `DataDesigner.validate()` directly +3. Handles output display and error formatting + +This keeps generation aligned with the public Python API โ€” the CLI is a thin wrapper, not a separate code path. + +### UI Utilities + +- `cli/ui.py` โ€” Rich console helpers for formatted output +- `cli/forms/` โ€” interactive form builders for config creation/editing +- `cli/utils/config_loader.py` โ€” config file resolution and loading +- `sample_records_pager.py` โ€” paginated display of generated records ## Data Flow - + +### Config Management +``` +User invokes command (e.g., `data-designer models add`) + โ†’ Command function wires DATA_DESIGNER_HOME + โ†’ Controller presents interactive form + โ†’ Service validates and applies changes + โ†’ Repository reads/writes config files +``` + +### Generation +``` +User invokes command (e.g., `data-designer create config.yaml`) + โ†’ GenerationController loads config + โ†’ DataDesigner.create() runs the full pipeline + โ†’ Results displayed via Rich console +``` ## Design Decisions - + +- **Lazy command loading** keeps CLI startup under ~200ms regardless of how many commands exist. Heavy imports (engine, models) only load when the relevant command runs. +- **Controller/service/repo for config, direct API for generation** โ€” config management benefits from the layered pattern (testable services, swappable repositories). Generation doesn't need this indirection; it delegates to the same `DataDesigner` class that Python users call directly. +- **`DATA_DESIGNER_HOME`** centralizes all CLI-managed state (model configs, provider configs, tool configs, personas) in a single directory, defaulting to `~/.data_designer/`. +- **Rich-based UI** provides formatted tables, progress bars, and interactive prompts without requiring a web interface. ## Cross-References - -- [System Architecture](overview.md) -- [Agent Introspection](agent-introspection.md) + +- [System Architecture](overview.md) โ€” where the CLI fits +- [Agent Introspection](agent-introspection.md) โ€” the `agent` command group +- [Config Layer](config.md) โ€” config objects the CLI manages +- [Models](models.md) โ€” model/provider configuration diff --git a/architecture/config.md b/architecture/config.md index 894211d67..497cd1dee 100644 --- a/architecture/config.md +++ b/architecture/config.md @@ -1,21 +1,77 @@ # Config Layer -> Stub โ€” to be populated. See source code at `packages/data-designer-config/src/data_designer/config/`. +The config layer (`data_designer.config`) defines the declarative surface of DataDesigner. Users describe what their data should look like; the config layer validates and structures those declarations. It never calls the engine directly. + +Source: `packages/data-designer-config/src/data_designer/config/` ## Overview - + +The config layer provides: +- **`DataDesignerConfigBuilder`** โ€” fluent builder for constructing dataset configs +- **`DataDesignerConfig`** โ€” the root config object holding columns, models, constraints, processors, and profilers +- **Column configs** โ€” a discriminated union of Pydantic models, one per column type +- **Model configs** โ€” LLM endpoint configuration with inference parameters +- **Sampler params** โ€” statistical generator parameters with their own discriminated union +- **Plugin integration** โ€” runtime extension of config unions via entry-point plugins ## Key Components - + +### Builder API + +`DataDesignerConfigBuilder` is the primary construction surface. It holds mutable state (column configs, constraints, processors) and produces an immutable `DataDesignerConfig` on `build()`. + +- **Fluent mutators**: `add_column`, `add_constraint`, `add_processor`, `add_profiler`, `add_model_config`, `add_tool_config`, `with_seed_dataset` +- **Column shorthand**: pass `name` + `column_type` + kwargs instead of a full config instance; the builder resolves the correct config class via `get_column_config_from_kwargs` +- **Config loading**: `from_config` accepts dicts, file paths, URLs, or `BuilderConfig` objects; normalizes shorthand formats into the full structure + +`BuilderConfig` wraps `DataDesignerConfig` with a `library_version` field validated against the running version. + +### Column Configs + +All column configs inherit from `SingleColumnConfig(ConfigBase, ABC)`, which provides `name`, `drop`, `allow_resize`, and the `column_type` discriminator field. + +Concrete types include: `SamplerColumnConfig`, `LLMTextColumnConfig`, `LLMStructuredColumnConfig`, `LLMCodeColumnConfig`, `LLMJudgeColumnConfig`, `EmbeddingColumnConfig`, `ImageColumnConfig`, `ValidationColumnConfig`, `ExpressionColumnConfig`, `SeedDatasetColumnConfig`, `CustomColumnConfig`. + +Each fixes `column_type: Literal["..."]` with a kebab-case string. The full union `ColumnConfigT` is built at module load time and extended by plugins. + +### Discriminated Unions + +Pydantic discriminated unions are the backbone of config deserialization: + +- **`DataDesignerConfig.columns`**: `list[Annotated[ColumnConfigT, Field(discriminator="column_type")]]` โ€” picks the right config class from the `column_type` field +- **`SamplerColumnConfig.params`**: `Annotated[SamplerParamsT, Discriminator("sampler_type")]` โ€” nested discrimination for sampler parameters +- **`InferenceParamsT`**: discriminated on `generation_type` (chat completion, embedding, image) + +A `model_validator(mode="before")` on `SamplerColumnConfig` injects `sampler_type` into nested param dicts when users omit it, enabling a cleaner shorthand. + +### Model Configs + +`ModelConfig` holds `alias`, `model`, `inference_parameters` (discriminated), optional `provider`, and `skip_health_check`. Inference parameters support distribution-valued fields (`temperature`, `top_p` can be `UniformDistribution` or `ManualDistribution` with a `sample()` method). + +`ModelProvider` configures the endpoint: URL, provider type (default `openai`), auth, headers, extra body parameters. + +### ConfigBase + +`ConfigBase` is the shared Pydantic base: `extra="forbid"`, enums serialized as values. It must not import other `data_designer.*` modules to keep it as a minimal dependency island. ## Data Flow - + +1. User calls builder methods or loads YAML/JSON +2. Builder resolves column type โ†’ config class via `get_column_config_cls_from_type` (built-in map, then plugin fallback) +3. For sampler columns, `_resolve_sampler_kwargs` maps `sampler_type` โ†’ params class via `SAMPLER_PARAMS` +4. `build()` triggers Pydantic validation on the full `DataDesignerConfig` +5. The validated config is passed to the engine for compilation and execution ## Design Decisions - + +- **Config objects are data, not behavior.** They define structure and constraints but never call the engine. This keeps the dependency direction clean (engine depends on config, not the reverse). +- **Discriminated unions over class hierarchies** for column types. Pydantic handles deserialization dispatch; adding a new type means adding a config class with the right `Literal` discriminator, not modifying a factory. +- **Plugin injection at the type level.** `PluginManager.inject_into_column_config_type_union` ORs plugin config classes into `ColumnConfigT` so Pydantic validation and static typing stay aligned with installed plugins. +- **Lazy imports via `__getattr__`.** `data_designer.config.__init__` maps public names to `(module_path, attribute_name)` and loads on first access, keeping `import data_designer.config` fast. ## Cross-References - -- [System Architecture](overview.md) -- [Engine Layer](engine.md) -- [Plugins](plugins.md) + +- [System Architecture](overview.md) โ€” package relationships and data flow +- [Engine Layer](engine.md) โ€” how configs are compiled and executed +- [Plugins](plugins.md) โ€” entry-point discovery and union injection +- [Sampling](sampling.md) โ€” sampler parameter types and constraints diff --git a/architecture/dataset-builders.md b/architecture/dataset-builders.md index bf055c7af..b2de64405 100644 --- a/architecture/dataset-builders.md +++ b/architecture/dataset-builders.md @@ -1,21 +1,107 @@ # Dataset Builders -> Stub โ€” to be populated. See source code at `packages/data-designer-engine/src/data_designer/engine/dataset_builders/`. +The dataset builder subsystem orchestrates the end-to-end generation of a dataset from compiled column configs. It supports two execution modes: a sequential batch loop and an async DAG-based scheduler. + +Source: `packages/data-designer-engine/src/data_designer/engine/dataset_builders/` ## Overview - + +`DatasetBuilder` is the central orchestrator. It receives a compiled `DataDesignerConfig`, instantiates column generators from the registry, and executes them in dependency order. The execution mode is selected by the `DATA_DESIGNER_ASYNC_ENGINE` environment variable. + +Both modes produce the same output: batched parquet files managed by `DatasetBatchManager`, with post-generation processing and profiling. ## Key Components - + +### DatasetBuilder + +Entry point for generation. `build()` branches: +- **Sequential path** (default): `DatasetBatchManager.start` โ†’ batch loop โ†’ `_run_batch` per batch โ†’ `finish()` โ†’ `ProcessorRunner.run_after_generation` โ†’ `model_registry.log_model_usage` +- **Async path** (`DATA_DESIGNER_ASYNC_ENGINE=1`): `_prepare_async_run` โ†’ `AsyncTaskScheduler.run()` โ†’ telemetry and metadata + +### Sequential Execution (`_run_batch`) + +Iterates compiled column order. For each generator: +1. `log_pre_generation()` โ€” logs model and optional MCP tool alias +2. **From-scratch generators** (empty buffer): `generate_from_scratch` โ†’ optional `run_pre_batch` after first seed column +3. **`CELL_BY_CELL` generators**: `_fan_out_with_threads` or `_fan_out_with_async` โ€” parallel cell generation +4. **`FULL_COLUMN` generators**: `generate` on the whole batch DataFrame; optional resize via `allow_resize` + +### Async Execution (`_build_async`) + +Preparation (`_prepare_async_run`): +1. Builds `gen_map` โ€” maps each column name to its generator instance (multi-column configs share a single instance) +2. Creates `ExecutionGraph` from column dependencies +3. Partitions rows into row groups by `buffer_size` +4. Constructs `CompletionTracker`, `RowGroupBufferManager`, `AsyncTaskScheduler` +5. Hooks `ProcessorRunner` for pre-batch and post-batch stages + +`AsyncTaskScheduler` runs on a dedicated async loop with semaphore-based concurrency, salvage rounds for failed tasks, and order-dependent locks for columns that must execute sequentially. + +### Execution Graph + +`ExecutionGraph` (in `dataset_builders/utils/execution_graph.py`) models column dependencies: +- Upstream/downstream sets derived from `required_columns` and side-effect columns +- `GenerationStrategy` per column (CELL_BY_CELL or FULL_COLUMN) +- Kahn topological sort for execution order +- `split_upstream_by_strategy` โ€” separates batch-level from cell-level dependencies + +### CompletionTracker + +Tracks per-row-group, per-column completion state: +- **Cell-level**: completed cell indices for `CELL_BY_CELL` columns +- **Batch-level**: full-column completion flags for `FULL_COLUMN` columns +- **Frontier**: computes ready tasks when backed by `ExecutionGraph` +- Handles dropped rows and downstream task enqueuing + +### DAG (Config-Level) + +`dataset_builders/utils/dag.py` provides `topologically_sort_column_configs` โ€” builds a NetworkX graph from `required_columns` and side-effect columns, returns a topological ordering. Used by both execution modes for initial column ordering. + +### DatasetBatchManager + +Manages in-memory row buffers and persistence: +- `finish_batch` โ†’ writes parquet via `ArtifactStorage` +- Updates dataset metadata between batches +- The async path uses `RowGroupBufferManager` for per-row-group DataFrames and checkpointing ## Data Flow - + +### Sequential +``` +DatasetBuilder.build() + โ†’ DatasetBatchManager.start() + โ†’ for each batch: + โ†’ for each generator (topological order): + โ†’ generate_from_scratch / generate (FULL_COLUMN) / fan_out (CELL_BY_CELL) + โ†’ DatasetBatchManager.finish_batch() โ†’ parquet + โ†’ ProcessorRunner.run_after_generation() + โ†’ model_registry.log_model_usage() +``` + +### Async +``` +DatasetBuilder.build() + โ†’ _prepare_async_run() + โ†’ ExecutionGraph.create() + โ†’ CompletionTracker.with_graph() + โ†’ AsyncTaskScheduler(semaphores, salvage_rounds) + โ†’ scheduler.run() + โ†’ for each row group, dispatch ready tasks from frontier + โ†’ tasks execute generators, update CompletionTracker + โ†’ checkpoints via RowGroupBufferManager + โ†’ collect TaskTraces, emit telemetry +``` ## Design Decisions - + +- **Dual execution engines behind one API.** The sequential engine is simpler and easier to debug; the async engine adds row-group parallelism for throughput. Users switch via an environment variable without changing their code. +- **DAG-driven ordering** ensures columns with dependencies (e.g., a judge column that depends on a text column) are generated in the correct order, regardless of the order they appear in the config. +- **Salvage rounds in async mode** retry failed tasks after all other tasks in a round complete, improving resilience against transient LLM failures without blocking the entire generation. +- **Separate config-level and runtime DAGs.** The config-level DAG (`dag.py`) determines column ordering; the runtime `ExecutionGraph` adds strategy-aware dependency tracking for the async scheduler. ## Cross-References - -- [System Architecture](overview.md) -- [Engine Layer](engine.md) -- [Config Layer](config.md) + +- [System Architecture](overview.md) โ€” end-to-end data flow +- [Engine Layer](engine.md) โ€” compilation and generator hierarchy +- [Models](models.md) โ€” how generators access LLMs +- [Config Layer](config.md) โ€” column configs and dependency declarations diff --git a/architecture/engine.md b/architecture/engine.md index 110ca18fb..6a4ff770f 100644 --- a/architecture/engine.md +++ b/architecture/engine.md @@ -1,22 +1,89 @@ # Engine Layer -> Stub โ€” to be populated. See source code at `packages/data-designer-engine/src/data_designer/engine/`. +The engine layer (`data_designer.engine`) compiles declarative configs into executable generation plans and runs them. It owns column generators, dataset builders, model access, MCP integration, sampling, validation, and profiling. + +Source: `packages/data-designer-engine/src/data_designer/engine/` ## Overview - + +The engine is the largest package, organized into focused subsystems: + +| Subsystem | Path | Role | +|-----------|------|------| +| Column generators | `column_generators/` | Registry + concrete generators for each column type | +| Dataset builders | `dataset_builders/` | Sync/async orchestration, DAG, batching | +| Models | `models/` | Facade, registry, clients, parsers, recipes, usage | +| MCP | `mcp/` | Tool registry, facade, I/O service | +| Sampling | `sampling_gen/` | Schema, DAG, data sources, person/entity helpers | +| Processing | `processing/` | Processors, Ginja (Jinja for generation), Gsonschema | +| Validators | `validators/` | Runtime row/batch validation | +| Analysis | `analysis/` | Dataset/column profiling | +| Registry | `registry/` | Generic `TaskRegistry` base + `DataDesignerRegistry` aggregator | +| Resources | `resources/` | Seed/person readers, managed datasets | +| Storage | `storage/` | Artifact and media storage | + +Top-level modules handle cross-cutting concerns: `compiler.py` (config compilation), `validation.py` (static config validation), `context.py` (execution context), `configurable_task.py` (base for all tasks), `secret_resolver.py`, `model_provider.py`. ## Key Components - + +### Compilation Pipeline + +`compiler.py` transforms a `DataDesignerConfig` into an execution-ready form: + +1. Enriches the config with seed columns and an internal UUID column +2. Runs static validation (`validation.py`) โ€” checks Jinja references, code columns, processor targets, constraint consistency +3. Produces `Violation` objects with typed `ViolationType` for structured error reporting + +### Registry System + +`TaskRegistry` (in `registry/base.py`) is the generic base: maps an enum value to a task class + config class. Uses `__new__`-based singleton per subclass to prevent duplicate instances. + +`DataDesignerRegistry` bundles the three registries used by `DatasetBuilder`: +- `ColumnGeneratorRegistry` โ€” column type โ†’ generator class +- `ColumnProfilerRegistry` โ€” column type โ†’ profiler class +- Processor registry + +`create_default_column_generator_registry()` registers all built-in types and merges plugin entry points. + +### Column Generator Hierarchy + +``` +ConfigurableTask + โ””โ”€โ”€ ColumnGenerator (abstract: get_generation_strategy, generate/agenerate) + โ”œโ”€โ”€ FromScratchColumnGenerator (can_generate_from_scratch) + โ”œโ”€โ”€ ColumnGeneratorWithModelRegistry + โ”‚ โ””โ”€โ”€ ColumnGeneratorWithModel (cached model, inference params, MCP) + โ”œโ”€โ”€ ColumnGeneratorCellByCell (strategy: CELL_BY_CELL, generate(dict)) + โ””โ”€โ”€ ColumnGeneratorFullColumn (strategy: FULL_COLUMN, generate(DataFrame)) +``` + +Each concrete generator (e.g., `SamplerColumnGenerator`, `LLMTextColumnGenerator`) combines the appropriate base classes. The `GenerationStrategy` enum (`CELL_BY_CELL` or `FULL_COLUMN`) determines how the dataset builder dispatches work. + +### ResourceProvider + +Bundles everything a generator needs at runtime: `ModelRegistry`, `MCPRegistry`, `ArtifactStorage`, seed readers, person readers, secret resolver. Passed to generators during initialization. ## Data Flow - + +1. `DatasetBuilder` receives a `DataDesignerConfig` and a `DataDesignerRegistry` +2. Compilation produces a topologically sorted list of column configs +3. Generators are instantiated from the registry for each column config +4. The builder executes generators in dependency order (see [Dataset Builders](dataset-builders.md)) +5. Post-generation processors and profilers run on the completed dataset ## Design Decisions - + +- **Registry + strategy pattern** decouples column type definitions (config) from generation behavior (engine). Adding a new column type means registering a config class and a generator class โ€” no changes to orchestration code. +- **`ConfigurableTask` as the universal base** ensures all tasks (generators, profilers, processors) share config validation and resource access patterns. +- **Static validation before execution** catches config errors (missing references, invalid templates) before any LLM calls are made, failing fast and cheaply. +- **Sync/async bridge** on `ColumnGenerator` allows generators to be written as async and called from sync contexts via `_run_coroutine_sync` / `asyncio.to_thread`. ## Cross-References - -- [System Architecture](overview.md) -- [Config Layer](config.md) -- [Dataset Builders](dataset-builders.md) -- [Models](models.md) + +- [System Architecture](overview.md) โ€” package relationships +- [Config Layer](config.md) โ€” column configs and builder API +- [Dataset Builders](dataset-builders.md) โ€” sync/async execution, DAG +- [Models](models.md) โ€” model facade and client adapters +- [MCP](mcp.md) โ€” tool execution integration +- [Sampling](sampling.md) โ€” sampler generators +- [Plugins](plugins.md) โ€” how plugins register generators diff --git a/architecture/mcp.md b/architecture/mcp.md index a7884d452..233af0c9b 100644 --- a/architecture/mcp.md +++ b/architecture/mcp.md @@ -1,21 +1,63 @@ # MCP -> Stub โ€” to be populated. See source code at `packages/data-designer-engine/src/data_designer/engine/`. +The MCP (Model Context Protocol) subsystem enables tool-augmented LLM generation. It manages tool discovery, session pooling, and parallel tool execution for column generators that use external tools. + +Source: `packages/data-designer-engine/src/data_designer/engine/mcp/` ## Overview - + +MCP integration allows column generators to augment LLM completions with tool calls. When a column config specifies a `tool_alias`, the model facade routes tool calls through the MCP subsystem, which handles session management, tool schema discovery, and parallel execution. + +The subsystem has three layers: +- **`MCPIOService`** โ€” low-level I/O: session pooling, tool listing, tool execution on a background async loop +- **`MCPFacade`** โ€” scoped to a single tool config: schema formatting, completion response processing, tool call execution +- **`MCPRegistry`** โ€” maps tool aliases to configs, lazy facade construction, health checks ## Key Components - + +### MCPIOService + +Singleton module-level service that manages the async I/O layer: + +- **Background async loop** โ€” runs on a daemon thread; sync callers use `asyncio.run_coroutine_threadsafe` to bridge +- **Session pool** โ€” `_sessions` keyed by provider cache key (JSON of provider config); `_get_or_create_session` with in-flight deduplication prevents redundant connections +- **Tool listing** โ€” cached per session; coalescing for concurrent list requests via `_inflight_tools` prevents duplicate discovery calls +- **Tool execution** โ€” parallel tool calls within a single completion response + +Module-level functions (`list_tools`, `call_tools`, `clear_session_pool`) delegate to the singleton instance. + +### MCPFacade + +Scoped to one `ToolConfig`. Provides the interface that `ModelFacade` uses: + +- **`get_tool_schemas()`** โ€” returns tool schemas in OpenAI function-calling format +- **`process_completion_response`** โ€” extracts tool calls from a completion, executes them in parallel via `MCPIOService`, returns `ChatMessage` list with results +- **`refuse_completion_response`** โ€” handles tool-call turn limits (prevents infinite tool loops) + +### MCPRegistry + +Maps `tool_alias` โ†’ `ToolConfig`. Lazy `MCPFacade` construction mirrors `ModelRegistry`. Provides health checks for configured tools. ## Data Flow - + +1. Column config declares `tool_alias` referencing a configured MCP tool +2. Generator's `ModelFacade` includes tool schemas in the completion request +3. LLM returns a completion with tool calls +4. `ModelFacade` delegates to `MCPFacade.process_completion_response` +5. `MCPFacade` extracts tool calls, executes them in parallel via `MCPIOService` +6. Tool results are formatted as `ChatMessage`s and fed back to the LLM for another completion round +7. Process repeats until the LLM produces a final response or the turn limit is reached ## Design Decisions - + +- **Single background async loop** avoids creating event loops per request. All MCP I/O funnels through one loop on a daemon thread, with sync callers bridging via `run_coroutine_threadsafe`. +- **Session pooling with in-flight deduplication** prevents redundant connections when multiple generators discover tools from the same provider concurrently. +- **Tool schema coalescing** โ€” concurrent `list_tools` calls for the same session share a single in-flight request, reducing startup latency when many columns use the same tool. +- **Turn limits on tool loops** prevent runaway tool-call chains. `refuse_completion_response` gracefully terminates when the limit is reached. ## Cross-References - -- [System Architecture](overview.md) -- [Engine Layer](engine.md) -- [Models](models.md) + +- [System Architecture](overview.md) โ€” where MCP fits in the stack +- [Models](models.md) โ€” how `ModelFacade` integrates MCP tool loops +- [Engine Layer](engine.md) โ€” `ResourceProvider` provides `MCPRegistry` to generators +- [Config Layer](config.md) โ€” `ToolConfig` definition diff --git a/architecture/models.md b/architecture/models.md index 46e26e242..5e0824796 100644 --- a/architecture/models.md +++ b/architecture/models.md @@ -1,20 +1,82 @@ # Models -> Stub โ€” to be populated. See source code at `packages/data-designer-engine/src/data_designer/engine/models/`. +The model subsystem provides a unified interface for LLM access: chat completions, embeddings, and image generation. It handles client creation, retry, rate-limit throttling, usage tracking, and MCP tool integration. + +Source: `packages/data-designer-engine/src/data_designer/engine/models/` ## Overview - + +The model subsystem is layered: + +``` +ModelRegistry (lazy facade-per-alias) + โ””โ”€โ”€ ModelFacade (completion, embeddings, image gen, MCP tool loops) + โ””โ”€โ”€ ThrottledModelClient (AIMD rate limiting) + โ””โ”€โ”€ ModelClient (OpenAI-compatible or Anthropic adapter) + โ””โ”€โ”€ RetryTransport (httpx-level retries) +``` + +Generators never interact with HTTP clients directly. They request a `ModelFacade` by alias from the `ModelRegistry`, which handles lazy construction and shared throttle state. ## Key Components - + +### ModelClient (Protocol) + +Defines the contract: sync/async chat, embeddings, image generation, `supports_*` capability checks, `close` / `aclose`. Two implementations: + +- **`OpenAICompatibleClient`** โ€” wraps the OpenAI SDK; works with any OpenAI-compatible endpoint (NIM, vLLM, etc.) +- **`AnthropicClient`** โ€” wraps the Anthropic SDK + +### Client Factory + +`create_model_client` routes by provider type to the appropriate adapter. Optionally wraps with: +- **`RetryTransport`** โ€” httpx-level retries via `httpx_retries.RetryTransport`. Rate-limit 429s are excluded from transport retries when `strip_rate_limit_codes=True` so they surface to the throttle layer. +- **`ThrottledModelClient`** โ€” AIMD (Additive Increase, Multiplicative Decrease) concurrency control per throttle domain. + +### ThrottleManager + +Manages concurrency limits per `ThrottleDomain` (CHAT, EMBEDDING, IMAGE, HEALTHCHECK), keyed by `(provider_name, model_id)`. Thread-safe with a shared lock for sync/async access. + +`ThrottledModelClient` wraps each API call in a context manager that acquires/releases throttle capacity and adjusts limits on success (additive increase) or rate-limit errors (multiplicative decrease). + +### ModelFacade + +The primary interface for generators. Holds a `ModelConfig`, `ModelClient`, optional `MCPRegistry`, and `ModelUsageStats`. + +- **`completion` / `acompletion`** โ€” consolidates kwargs from inference params + provider extras, calls the client, tracks usage +- **`embeddings` / `aembeddings`** โ€” embedding generation +- **`image_generation` / `aimage_generation`** โ€” image generation +- **MCP tool loops** โ€” when a tool config is active, processes tool calls from completions via `MCPFacade`, feeds results back, and tracks tool usage stats + +### ModelRegistry + +Lazy `ModelFacade` construction per alias. Registers a shared `ThrottleManager` across all facades for coordinated rate limiting. Provides `get_model_usage_stats` and `log_model_usage` for post-build reporting. + +### Usage Tracking + +`ModelUsageStats` aggregates `TokenUsageStats`, `RequestUsageStats`, `ToolUsageStats`, and `ImageUsageStats` per model. Tracked on every successful or failed request for cost and performance visibility. ## Data Flow - + +1. Generator requests a model by alias from `ModelRegistry` +2. Registry lazily creates `ModelFacade` with the appropriate client and throttle config +3. Generator calls `completion()` with prompt/messages +4. `ModelFacade` builds kwargs, calls `ThrottledModelClient` +5. Throttle layer acquires capacity, delegates to `ModelClient` +6. `ModelClient` makes the HTTP request through `RetryTransport` +7. Response flows back; usage is tracked; if MCP tools are configured, tool calls are executed and results fed back for another completion round ## Design Decisions - + +- **Facade pattern** hides HTTP, retry, throttle, and MCP complexity from generators. Generators see `completion()` and get back parsed results. +- **AIMD throttling at the application layer** rather than relying solely on HTTP retries. This provides smoother throughput under rate limits โ€” the transport retry handles transient failures, while the throttle manager adjusts concurrency to avoid sustained 429 storms. +- **429s excluded from transport retries** so rate-limit signals reach the throttle manager immediately rather than being masked by retry delays. +- **Distribution-valued inference parameters** (`temperature`, `top_p` as `UniformDistribution` or `ManualDistribution`) enable controlled randomness across a dataset without per-row config changes. +- **Lazy facade construction** avoids health-checking or connecting to models that are configured but never used in a particular generation run. ## Cross-References - -- [System Architecture](overview.md) -- [Engine Layer](engine.md) + +- [System Architecture](overview.md) โ€” where models fit in the stack +- [Engine Layer](engine.md) โ€” how generators use models +- [MCP](mcp.md) โ€” tool execution integrated into completions +- [Config Layer](config.md) โ€” `ModelConfig` and `ModelProvider` definitions diff --git a/architecture/overview.md b/architecture/overview.md index c829d8f6c..a3f60d14f 100644 --- a/architecture/overview.md +++ b/architecture/overview.md @@ -1,22 +1,70 @@ # System Architecture -> Stub โ€” to be populated. See the three packages under `packages/`. +DataDesigner is split across three installable packages that merge at runtime into a single `data_designer` namespace via PEP 420 implicit namespace packages (no top-level `__init__.py`). ## Overview - + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ data-designer (interface + CLI + integrations) โ”‚ +โ”‚ DataDesigner class, CLI commands, HuggingFace Hub โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ data-designer-engine (execution) โ”‚ +โ”‚ Generators, builders, models, MCP, sampling, โ”‚ +โ”‚ validators, profilers, processing โ”‚ +โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค +โ”‚ data-designer-config (declaration) โ”‚ +โ”‚ Column configs, model configs, sampler params, โ”‚ +โ”‚ builder API, plugin system, lazy imports โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +**Dependency direction:** interface โ†’ engine โ†’ config. No reverse imports. + +Users declare what their data should look like through config objects (columns, types, relationships, validation rules). The engine compiles those configs into an execution plan and generates the dataset. The interface package provides the public `DataDesigner` class and CLI that wire everything together. ## Key Components - + +| Component | Package | Entry Point | +|-----------|---------|-------------| +| `DataDesigner` | `data-designer` | Public API โ€” `create()`, `preview()`, `validate()` | +| `DataDesignerConfigBuilder` | `data-designer-config` | Fluent builder for dataset configs | +| `DatasetBuilder` | `data-designer-engine` | Orchestrates generation (sync or async) | +| `ModelFacade` / `ModelRegistry` | `data-designer-engine` | LLM client abstraction with retry, throttle, usage tracking | +| `MCPFacade` / `MCPRegistry` | `data-designer-engine` | Tool execution via Model Context Protocol | +| `ColumnGeneratorRegistry` | `data-designer-engine` | Maps column types to generator implementations | +| `PluginRegistry` | `data-designer-config` | Discovers and registers entry-point plugins | +| CLI (`data-designer`) | `data-designer` | Typer-based CLI with lazy command loading | ## Data Flow - + +1. **Declaration** โ€” User builds a `DataDesignerConfig` via the builder API or loads YAML/JSON. Columns are a discriminated union on `column_type`; sampler columns add a second discriminated layer on `sampler_type`. + +2. **Compilation** โ€” `compile_data_designer_config` enriches the config (seed columns, internal UUID column), runs static validation (Jinja references, code columns, processors), and produces a compiled column order via topological sort. + +3. **Generation** โ€” `DatasetBuilder` instantiates column generators from the registry, then executes one of two paths: + - **Sequential** (default): batch loop over columns in topological order. Each generator produces its column via `CELL_BY_CELL` (threaded fan-out) or `FULL_COLUMN` strategy. + - **Async** (`DATA_DESIGNER_ASYNC_ENGINE=1`): builds an `ExecutionGraph`, partitions rows into groups, and dispatches tasks via `AsyncTaskScheduler` with semaphore-based concurrency, salvage rounds, and per-row-group checkpointing. + +4. **Post-processing** โ€” `ProcessorRunner` applies transformations (pre-batch, post-batch, after-generation). Profilers analyze the generated dataset. + +5. **Results** โ€” `DatasetCreationResults` wraps the artifact storage, analysis, config, and metadata. Supports `load_dataset()`, record sampling, and `push_to_hub()`. ## Design Decisions - + +- **PEP 420 namespace packages** allow the three packages to be installed independently while sharing the `data_designer` namespace. This enables lighter installs (e.g., config-only for validation tooling) without import conflicts. +- **Lazy imports throughout** โ€” `__getattr__`-based lazy loading in `data_designer.config` and `data_designer.interface`, plus `lazy_heavy_imports` for numpy/pandas, keep startup fast. +- **Dual execution engines** share the same `DatasetBuilder` API. The async engine adds row-group parallelism and DAG-aware scheduling without changing the public interface. +- **Registries as singletons** โ€” `TaskRegistry.__new__` ensures one instance per registry subclass, preventing duplicate registration and enabling consistent plugin injection. ## Cross-References - -- [Config Layer](config.md) -- [Engine Layer](engine.md) -- [Models](models.md) -- [Dataset Builders](dataset-builders.md) + +- [Config Layer](config.md) โ€” builder API, column types, model configs, plugin system +- [Engine Layer](engine.md) โ€” compilation, generators, registries +- [Models](models.md) โ€” model facade, adapters, retry/throttle +- [Dataset Builders](dataset-builders.md) โ€” sync/async orchestration, DAG, batching +- [MCP](mcp.md) โ€” tool execution, session pooling +- [Sampling](sampling.md) โ€” statistical generators, person/entity data +- [CLI](cli.md) โ€” command structure, controller/service/repo pattern +- [Agent Introspection](agent-introspection.md) โ€” type discovery, state commands +- [Plugins](plugins.md) โ€” entry-point discovery, registry injection diff --git a/architecture/plugins.md b/architecture/plugins.md index 09f3ce143..3a4e33a60 100644 --- a/architecture/plugins.md +++ b/architecture/plugins.md @@ -1,21 +1,97 @@ # Plugins -> Stub โ€” to be populated. See source code at `packages/data-designer-config/src/data_designer/`. +The plugin system allows third-party packages to extend DataDesigner with new column types, seed source types, and processor types. Plugins are discovered via Python entry points and injected into the config layer's discriminated unions at import time. + +Source: `packages/data-designer-config/src/data_designer/plugins/` and `packages/data-designer-config/src/data_designer/plugin_manager.py` ## Overview - + +DataDesigner's type system is built on Pydantic discriminated unions. The plugin system extends these unions at runtime so that: +- User configs can reference plugin-provided types by name +- Pydantic validation and deserialization work correctly for plugin types +- The engine's registries can dispatch to plugin-provided generators + +Plugins are standard Python packages that declare entry points in the `data_designer.plugins` group. ## Key Components - + +### Plugin Descriptor + +`Plugin` (in `plugins/plugin.py`) is a Pydantic model describing a plugin: +- **`impl_qualified_name`** โ€” fully qualified name of the implementation class (e.g., generator) +- **`config_qualified_name`** โ€” fully qualified name of the config class +- **`PluginType`** โ€” one of `COLUMN_GENERATOR`, `SEED_SOURCE`, or `PROCESSOR` + +Validators ensure: +- Both modules exist and are importable +- The config class has the correct `Literal` discriminator field (`column_type`, `seed_type`, or `processor_type` depending on plugin type) +- The plugin `name` is derived from the discriminator field's default value + +### PluginRegistry + +Singleton (`__new__` + class-level `_instance`) that scans `importlib.metadata.entry_points(group="data_designer.plugins")` on first construction. Each entry point is loaded and expected to return a `Plugin` instance. + +Plugins can be disabled globally with `DISABLE_DATA_DESIGNER_PLUGINS=true`. + +### PluginManager + +Thin facade over `PluginRegistry` providing typed injection methods: +- `inject_into_column_config_type_union` โ€” extends `ColumnConfigT` +- `inject_into_seed_source_type_union` โ€” extends `SeedSourceT` +- `inject_into_processor_type_union` โ€” extends `ProcessorConfigT` + +Each method ORs the plugin's config class into the existing type union (`type_union |= plugin.config_cls`). + +### Integration Points + +Plugin injection happens at module load time in the config layer: +- `column_types.py` instantiates `PluginManager()` and extends `ColumnConfigT` +- `seed_source_types.py` extends `SeedSourceT` +- `processor_types.py` extends `ProcessorConfigT` + +On the engine side, `create_default_column_generator_registry()` merges plugin entry points into the `ColumnGeneratorRegistry`, mapping plugin column types to their generator implementations. ## Data Flow - + +### Discovery (at import time) +``` +import data_designer.config.column_types + โ†’ PluginManager() โ†’ PluginRegistry() + โ†’ scan entry_points(group="data_designer.plugins") + โ†’ load each entry point โ†’ Plugin instance + โ†’ inject_into_column_config_type_union + โ†’ ColumnConfigT now includes plugin config classes +``` + +### Usage (at runtime) +``` +User config includes column_type: "my-plugin-type" + โ†’ Pydantic discriminated union matches plugin config class + โ†’ DatasetBuilder looks up generator in ColumnGeneratorRegistry + โ†’ Plugin's generator class handles generation +``` + +### Relationship to Custom Columns + +Plugins and custom columns serve different use cases: + +| | Entry-Point Plugins | Custom Columns (`@custom_column_generator`) | +|---|---|---| +| **Scope** | Installable packages, new column types | In-process callables, same session | +| **Discovery** | `importlib.metadata.entry_points` | Decorator attaches metadata to callable | +| **Type system** | New `column_type` discriminator value | Uses built-in `custom` column type | +| **Distribution** | pip-installable | Code in the user's script/notebook | ## Design Decisions - + +- **Entry points over explicit registration** โ€” plugins are discovered automatically when installed. Users don't need to modify DataDesigner configs or code to activate a plugin; `pip install` is sufficient. +- **Union injection at import time** ensures Pydantic validation works for plugin types without any runtime setup. The tradeoff is that plugin discovery runs on first import of the config layer. +- **`DISABLE_DATA_DESIGNER_PLUGINS`** provides an escape hatch for environments where plugin loading is undesirable (testing, CI, restricted environments). +- **Singleton registry** prevents duplicate plugin scanning when multiple modules import the config layer. ## Cross-References - -- [System Architecture](overview.md) -- [Config Layer](config.md) -- [Engine Layer](engine.md) + +- [System Architecture](overview.md) โ€” where plugins fit in the stack +- [Config Layer](config.md) โ€” discriminated unions that plugins extend +- [Engine Layer](engine.md) โ€” generator registry that plugins populate +- [Agent Introspection](agent-introspection.md) โ€” how plugin types appear in type discovery diff --git a/architecture/sampling.md b/architecture/sampling.md index 97d811b67..53aa17de2 100644 --- a/architecture/sampling.md +++ b/architecture/sampling.md @@ -1,21 +1,73 @@ # Sampling -> Stub โ€” to be populated. See source code at `packages/data-designer-engine/src/data_designer/engine/sampling_gen/`. +The sampling subsystem generates statistically distributed data without LLM calls. It handles built-in sampler types (UUID, Category, Gaussian, Person, DateTime, etc.), constraint-based rejection sampling, and locale-aware person/entity generation. + +Source: `packages/data-designer-engine/src/data_designer/engine/sampling_gen/` ## Overview - + +Sampling is used for columns that don't need LLM generation โ€” identifiers, categories, numerical distributions, timestamps, and person data. The subsystem builds a schema DAG from sampler configs, validates acyclicity, and generates data column-by-column with optional inter-column constraints. ## Key Components - + +### DatasetGenerator + +The main entry point for sampler-based generation. Given a `DataSchema` (or `SamplerMultiColumnConfig`): + +1. Builds a NetworkX DAG from the schema's column dependencies +2. Topologically sorts columns +3. Generates each column with rejection sampling when constraints are present +4. Shared kwargs include `people_gen_resource` for person-type samplers + +### DataSchema and DAG + +`DataSchema` defines the sampler columns and their relationships. `Dag` validation ensures acyclicity. Edges come from: +- Conditional parameters (column A's distribution depends on column B's value) +- Required columns (explicit dependencies) +- Constraints (inter-column relationships like "start_date < end_date") + +### Constraint System + +`ConstraintChecker` enforces inter-column constraints during generation: +- **`ScalarInequalityConstraint`** โ€” column value vs. a constant +- **`ColumnInequalityConstraint`** โ€” column value vs. another column's value + +Rejection sampling retries generation when constraints are violated, up to a configurable limit. + +### Person/Entity Generation + +`PeopleGen` (abstract) โ†’ `PeopleGenFaker` (Faker-based) provides locale-aware person data: + +- **Faker integration** โ€” generates names, addresses, and base attributes by locale +- **Managed datasets** โ€” for locales in `LOCALES_WITH_MANAGED_DATASETS`, uses pre-built datasets via `ManagedDatasetGenerator` for higher quality and consistency +- **Derived fields** โ€” `Person` entity computes birth dates, emails, phone numbers, national IDs with locale-specific behavior (e.g., US-only SSN format) + +`PersonReader` on `ResourceProvider` loads managed person datasets when person samplers are used. + +### SamplerColumnGenerator + +The engine-side generator for sampler columns. Extends `FromScratchColumnGenerator` with `FULL_COLUMN` strategy. Uses `DatasetGenerator` internally, passing the appropriate `PeopleGen` resource. ## Data Flow - + +1. `SamplerColumnConfig` declares `sampler_type` and `params` (discriminated union) +2. `SamplerColumnGenerator` creates a `DatasetGenerator` with the schema +3. `DatasetGenerator` topologically sorts columns, then for each: + - Samples values from the configured distribution + - Applies constraint checking via rejection sampling + - For person columns, delegates to `PeopleGen` with the configured locale +4. Returns a DataFrame with all sampler columns populated ## Design Decisions - + +- **Rejection sampling over constraint propagation** keeps the implementation simple and general. Most constraints are satisfied quickly; the retry limit prevents infinite loops on unsatisfiable constraints. +- **Managed datasets for person data** provide realistic, locale-consistent person records that Faker alone cannot guarantee (e.g., matching name ethnicity to locale, consistent address formatting). +- **Separate DAG from the main execution DAG** โ€” sampler columns have their own dependency graph within the `DatasetGenerator`, independent of the broader column execution DAG in `DatasetBuilder`. This is because sampler columns are generated as a batch before LLM columns. +- **Discriminated union for sampler params** mirrors the column config pattern โ€” each sampler type has its own params class with a `Literal` discriminator, enabling type-safe deserialization and validation. ## Cross-References - -- [System Architecture](overview.md) -- [Engine Layer](engine.md) -- [Config Layer](config.md) + +- [System Architecture](overview.md) โ€” where sampling fits in the data flow +- [Engine Layer](engine.md) โ€” `SamplerColumnGenerator` in the generator hierarchy +- [Config Layer](config.md) โ€” `SamplerColumnConfig`, `SamplerParamsT`, constraints +- [Dataset Builders](dataset-builders.md) โ€” how sampler generators are orchestrated diff --git a/plans/427/pr-2-status.md b/plans/427/pr-2-status.md new file mode 100644 index 000000000..16e9e1f4b --- /dev/null +++ b/plans/427/pr-2-status.md @@ -0,0 +1,91 @@ +# PR 2 Status โ€” Phase 3 + Architecture Content + +**Branch:** `nmulepati/docs/427-agent-first-dev-pr-2` +**Last updated:** 2026-03-25 + +--- + +## Completed + +### Step 7 โ€” Issue Templates (4 files modified) + +- **`bug-report.yml`** โ€” added "Agent Diagnostic / Prior Investigation" textarea and a checklist (reproduced issue, searched docs, included diagnostics) +- **`feature-request.yml`** โ€” added "Agent Investigation" textarea and a checklist (reviewed existing issues, this is a design proposal) +- **`development-task.yml`** โ€” clarified description, added "Investigation / Context" and "Agent Plan / Findings" textareas +- **`config.yml`** โ€” updated the Discussions link copy to suggest trying an agent first + +### Step 8 โ€” PR Template (1 new file) + +- **`.github/PULL_REQUEST_TEMPLATE.md`** โ€” lean template: Summary, Related Issue, Changes, Testing checklist, Checklist + +### Step 9 โ€” CODEOWNERS + +No changes needed. Single-group ownership confirmed: `* @NVIDIA-NeMo/data_designer_reviewers` + +### Step 11 โ€” Skill Template Conformance (2 files modified) + +- **`create-pr/SKILL.md`** โ€” rewrote the PR body template to match the new `.github/PULL_REQUEST_TEMPLATE.md` structure (Summary / Related Issue / Changes / Testing / Checklist), with Attention Areas as optional +- **`review-code/SKILL.md`** โ€” added step 7 to "Understand the Scope" (check PR template conformance) and a new "PR Template Conformance" section in the review output format + +### Step 12 โ€” Architecture Docs (10 files populated) + +All stubs replaced with real content based on source code exploration: + +| File | Content | +|------|---------| +| `overview.md` | System architecture, package diagram, key components table, end-to-end data flow, design decisions | +| `config.md` | Builder API, column configs, discriminated unions, model configs, plugin integration | +| `engine.md` | Compilation pipeline, registry system, generator hierarchy, ResourceProvider | +| `models.md` | Model facade layers, client adapters, AIMD throttling, retry strategy, usage tracking | +| `dataset-builders.md` | Sequential vs async execution, ExecutionGraph, CompletionTracker, DAG, batching | +| `mcp.md` | MCPIOService, session pooling, tool schema coalescing, facade, registry | +| `sampling.md` | DatasetGenerator, constraint system, person/entity generation, locale support | +| `cli.md` | Lazy loading, controller/service/repo pattern, generation commands | +| `agent-introspection.md` | FamilySpec, type discovery from unions, state commands, error handling | +| `plugins.md` | Entry-point discovery, PluginRegistry, union injection, custom columns comparison | + +--- + +## Remaining + +### Step 10 โ€” Label Creation + +Create workflow labels via `gh label create`. Not a file change โ€” requires GitHub API access. + +Labels to create: + +| Label | Purpose | +|-------|---------| +| `agent-ready` | Human-approved, agent can build | +| `review-ready` | Agent has posted a plan, needs human review | +| `in-progress` | Agent is actively building | +| `pr-opened` | Implementation complete, PR submitted | +| `spike` | Needs deeper investigation | +| `needs-more-context` | Issue missing reproduction/investigation context | +| `good-first-issue` | Suitable for new contributors (with agents) | + +--- + +## All Changed Files + +``` + M .agents/skills/create-pr/SKILL.md + M .agents/skills/review-code/SKILL.md + M .github/ISSUE_TEMPLATE/bug-report.yml + M .github/ISSUE_TEMPLATE/config.yml + M .github/ISSUE_TEMPLATE/development-task.yml + M .github/ISSUE_TEMPLATE/feature-request.yml + M architecture/agent-introspection.md + M architecture/cli.md + M architecture/config.md + M architecture/dataset-builders.md + M architecture/engine.md + M architecture/mcp.md + M architecture/models.md + M architecture/overview.md + M architecture/plugins.md + M architecture/sampling.md +?? .github/PULL_REQUEST_TEMPLATE.md +``` + +16 modified, 1 new โ€” nothing committed yet. From afdf71301c76ace0e04b0e8f92afd7ca11e3f603 Mon Sep 17 00:00:00 2001 From: Nabin Mulepati Date: Mon, 30 Mar 2026 10:12:38 -0600 Subject: [PATCH 2/9] undo review-code skill change --- .agents/skills/review-code/SKILL.md | 13 ------------- plans/427/pr-2-status.md | 30 ++++++++++++++--------------- 2 files changed, 15 insertions(+), 28 deletions(-) diff --git a/.agents/skills/review-code/SKILL.md b/.agents/skills/review-code/SKILL.md index 326a909f4..910664269 100644 --- a/.agents/skills/review-code/SKILL.md +++ b/.agents/skills/review-code/SKILL.md @@ -106,7 +106,6 @@ Before diving into details, build a mental model: 4. **Identify the primary goal** (feature, refactor, bugfix, etc.) 5. **Note cross-cutting concerns** (e.g., a rename that touches many files vs. substantive logic changes) 6. **Check existing feedback** (PR mode): inspect both inline comments (Step 1, item 5) and PR-level review bodies (Step 1, item 5b) so you don't duplicate feedback already given -7. **Check PR template conformance** (PR mode): verify the PR description includes the required template sections (๐Ÿ“‹ Summary, ๐Ÿ”— Related Issue, ๐Ÿ”„ Changes, ๐Ÿงช Testing, โœ… Checklist). Flag missing or empty sections as a warning in the review. The template lives at `.github/PULL_REQUEST_TEMPLATE.md` ## Step 4: Review Each Changed File (Multi-Pass) @@ -243,18 +242,6 @@ Separate each finding with a blank line. Use bold file-and-title as a heading li #### Suggestions โ€” Consider improving > Style improvements, minor simplifications, or optional enhancements that would improve code quality. -### PR Template Conformance (PR mode only) - -Check that the PR description follows the repository template (`.github/PULL_REQUEST_TEMPLATE.md`): - -- **๐Ÿ“‹ Summary** โ€” present and describes the "why", not just the "what" -- **๐Ÿ”— Related Issue** โ€” links to an issue (`Fixes #NNN` or `Closes #NNN`), or explicitly states N/A -- **๐Ÿ”„ Changes** โ€” bullet list of key changes -- **๐Ÿงช Testing** โ€” checklist items checked or marked N/A with explanation -- **โœ… Checklist** โ€” all items addressed - -Flag missing or empty sections as a warning. This check is skipped in branch mode. - ### What Looks Good Call out 2-3 things done well (good abstractions, thorough tests, clean refactoring, etc.). Positive feedback is part of a good review. diff --git a/plans/427/pr-2-status.md b/plans/427/pr-2-status.md index 16e9e1f4b..d978f208e 100644 --- a/plans/427/pr-2-status.md +++ b/plans/427/pr-2-status.md @@ -1,7 +1,7 @@ # PR 2 Status โ€” Phase 3 + Architecture Content **Branch:** `nmulepati/docs/427-agent-first-dev-pr-2` -**Last updated:** 2026-03-25 +**Last updated:** 2026-03-30 --- @@ -44,25 +44,25 @@ All stubs replaced with real content based on source code exploration: | `agent-introspection.md` | FamilySpec, type discovery from unions, state commands, error handling | | `plugins.md` | Entry-point discovery, PluginRegistry, union injection, custom columns comparison | ---- +### Step 10 โ€” Label Creation (7 labels created via GitHub API) -## Remaining +Created workflow labels via `gh label create`: -### Step 10 โ€” Label Creation +| Label | Color | Purpose | +|-------|-------|---------| +| `agent-ready` | `#0E8A16` (green) | Human-approved, agent can build | +| `review-ready` | `#FBCA04` (yellow) | Agent has posted a plan, needs human review | +| `in-progress` | `#1D76DB` (blue) | Agent is actively building | +| `pr-opened` | `#5319E7` (purple) | Implementation complete, PR submitted | +| `spike` | `#D93F0B` (orange) | Needs deeper investigation | +| `needs-more-context` | `#E99695` (pink) | Issue missing reproduction/investigation context | +| `good-first-issue` | `#7057ff` (violet) | Suitable for new contributors (with agents) โ€” replaced GitHub default `good first issue` | -Create workflow labels via `gh label create`. Not a file change โ€” requires GitHub API access. +--- -Labels to create: +## All Steps Complete -| Label | Purpose | -|-------|---------| -| `agent-ready` | Human-approved, agent can build | -| `review-ready` | Agent has posted a plan, needs human review | -| `in-progress` | Agent is actively building | -| `pr-opened` | Implementation complete, PR submitted | -| `spike` | Needs deeper investigation | -| `needs-more-context` | Issue missing reproduction/investigation context | -| `good-first-issue` | Suitable for new contributors (with agents) | +No remaining work for PR 2. --- From c516b40c1ec10fc4a75db475c5deb37c951b54f6 Mon Sep 17 00:00:00 2001 From: Nabin Mulepati Date: Mon, 30 Mar 2026 10:17:32 -0600 Subject: [PATCH 3/9] delete status file --- plans/427/pr-2-status.md | 91 ---------------------------------------- 1 file changed, 91 deletions(-) delete mode 100644 plans/427/pr-2-status.md diff --git a/plans/427/pr-2-status.md b/plans/427/pr-2-status.md deleted file mode 100644 index d978f208e..000000000 --- a/plans/427/pr-2-status.md +++ /dev/null @@ -1,91 +0,0 @@ -# PR 2 Status โ€” Phase 3 + Architecture Content - -**Branch:** `nmulepati/docs/427-agent-first-dev-pr-2` -**Last updated:** 2026-03-30 - ---- - -## Completed - -### Step 7 โ€” Issue Templates (4 files modified) - -- **`bug-report.yml`** โ€” added "Agent Diagnostic / Prior Investigation" textarea and a checklist (reproduced issue, searched docs, included diagnostics) -- **`feature-request.yml`** โ€” added "Agent Investigation" textarea and a checklist (reviewed existing issues, this is a design proposal) -- **`development-task.yml`** โ€” clarified description, added "Investigation / Context" and "Agent Plan / Findings" textareas -- **`config.yml`** โ€” updated the Discussions link copy to suggest trying an agent first - -### Step 8 โ€” PR Template (1 new file) - -- **`.github/PULL_REQUEST_TEMPLATE.md`** โ€” lean template: Summary, Related Issue, Changes, Testing checklist, Checklist - -### Step 9 โ€” CODEOWNERS - -No changes needed. Single-group ownership confirmed: `* @NVIDIA-NeMo/data_designer_reviewers` - -### Step 11 โ€” Skill Template Conformance (2 files modified) - -- **`create-pr/SKILL.md`** โ€” rewrote the PR body template to match the new `.github/PULL_REQUEST_TEMPLATE.md` structure (Summary / Related Issue / Changes / Testing / Checklist), with Attention Areas as optional -- **`review-code/SKILL.md`** โ€” added step 7 to "Understand the Scope" (check PR template conformance) and a new "PR Template Conformance" section in the review output format - -### Step 12 โ€” Architecture Docs (10 files populated) - -All stubs replaced with real content based on source code exploration: - -| File | Content | -|------|---------| -| `overview.md` | System architecture, package diagram, key components table, end-to-end data flow, design decisions | -| `config.md` | Builder API, column configs, discriminated unions, model configs, plugin integration | -| `engine.md` | Compilation pipeline, registry system, generator hierarchy, ResourceProvider | -| `models.md` | Model facade layers, client adapters, AIMD throttling, retry strategy, usage tracking | -| `dataset-builders.md` | Sequential vs async execution, ExecutionGraph, CompletionTracker, DAG, batching | -| `mcp.md` | MCPIOService, session pooling, tool schema coalescing, facade, registry | -| `sampling.md` | DatasetGenerator, constraint system, person/entity generation, locale support | -| `cli.md` | Lazy loading, controller/service/repo pattern, generation commands | -| `agent-introspection.md` | FamilySpec, type discovery from unions, state commands, error handling | -| `plugins.md` | Entry-point discovery, PluginRegistry, union injection, custom columns comparison | - -### Step 10 โ€” Label Creation (7 labels created via GitHub API) - -Created workflow labels via `gh label create`: - -| Label | Color | Purpose | -|-------|-------|---------| -| `agent-ready` | `#0E8A16` (green) | Human-approved, agent can build | -| `review-ready` | `#FBCA04` (yellow) | Agent has posted a plan, needs human review | -| `in-progress` | `#1D76DB` (blue) | Agent is actively building | -| `pr-opened` | `#5319E7` (purple) | Implementation complete, PR submitted | -| `spike` | `#D93F0B` (orange) | Needs deeper investigation | -| `needs-more-context` | `#E99695` (pink) | Issue missing reproduction/investigation context | -| `good-first-issue` | `#7057ff` (violet) | Suitable for new contributors (with agents) โ€” replaced GitHub default `good first issue` | - ---- - -## All Steps Complete - -No remaining work for PR 2. - ---- - -## All Changed Files - -``` - M .agents/skills/create-pr/SKILL.md - M .agents/skills/review-code/SKILL.md - M .github/ISSUE_TEMPLATE/bug-report.yml - M .github/ISSUE_TEMPLATE/config.yml - M .github/ISSUE_TEMPLATE/development-task.yml - M .github/ISSUE_TEMPLATE/feature-request.yml - M architecture/agent-introspection.md - M architecture/cli.md - M architecture/config.md - M architecture/dataset-builders.md - M architecture/engine.md - M architecture/mcp.md - M architecture/models.md - M architecture/overview.md - M architecture/plugins.md - M architecture/sampling.md -?? .github/PULL_REQUEST_TEMPLATE.md -``` - -16 modified, 1 new โ€” nothing committed yet. From 3b66a654246024173848266927d49eb88d5b37d7 Mon Sep 17 00:00:00 2001 From: Nabin Mulepati Date: Mon, 6 Apr 2026 09:03:21 -0600 Subject: [PATCH 4/9] small tweaks --- .agents/skills/create-pr/SKILL.md | 39 +++++++++++++++++++++++++------ .github/PULL_REQUEST_TEMPLATE.md | 10 +++++++- architecture/cli.md | 2 +- architecture/dataset-builders.md | 1 + 4 files changed, 43 insertions(+), 9 deletions(-) diff --git a/.agents/skills/create-pr/SKILL.md b/.agents/skills/create-pr/SKILL.md index 6ff013511..cf87e6d11 100644 --- a/.agents/skills/create-pr/SKILL.md +++ b/.agents/skills/create-pr/SKILL.md @@ -1,6 +1,6 @@ --- name: create-pr -description: Create a GitHub PR with a well-formatted description matching the repository PR template +description: Create a GitHub PR with a well-formatted description matching the repository PR template (flat Changes by default; optional Added/Changed/Removed/Fixed grouping) argument-hint: [special instructions] disable-model-invocation: true metadata: @@ -40,7 +40,9 @@ Run these commands in parallel to understand the changes: ## Step 2: Analyze and Categorize Changes -### By Change Type (from commits and diff) +Use change types below to **decide** how to write the Changes section (flat vs grouped). You still describe testing under **Testing**, not under these buckets. + +### By change type (internal checklist) - โœจ **Added**: New files, features, capabilities - ๐Ÿ”ง **Changed**: Modified existing functionality - ๐Ÿ—‘๏ธ **Removed**: Deleted files or features @@ -48,7 +50,11 @@ Run these commands in parallel to understand the changes: - ๐Ÿ“š **Docs**: Documentation updates - ๐Ÿงช **Tests**: Test additions/modifications -### Identify Attention Areas ๐Ÿ” +### When to use optional grouping in **Changes** +- **Flat bullet list** (default): Small PRs, single theme, or when categories would be sparse or redundant. +- **Grouped subheadings** (`### โœจ Added`, `### ๐Ÿ”ง Changed`, `### ๐Ÿ—‘๏ธ Removed`, `### ๐Ÿ› Fixed`): Large PRs, release-note-style summaries, or clearly distinct fix + feature mixes. **Omit any empty section** โ€” do not leave placeholder headings. + +### Identify attention areas Flag for special reviewer attention: - Files with significant changes (>100 lines) - Changes to base classes, interfaces, or public API @@ -75,7 +81,9 @@ If commits have mixed types, use the primary/most significant type. git push -u origin ``` -2. **Build the PR body** using the repository's template structure: +2. **Build the PR body** using the repository's template structure. + +**Default โ€” flat Changes** (remove the HTML comment block from the template when filling in, or replace with your bullets only): ```markdown ## ๐Ÿ“‹ Summary @@ -88,7 +96,7 @@ If commits have mixed types, use the primary/most significant type. ## ๐Ÿ”„ Changes -- [Bullet list of key changes, grouped logically] +- [Bullet list of key changes] - [Link to key files when helpful for reviewers] - [Reference commits for specific changes in multi-commit PRs] @@ -105,6 +113,23 @@ If commits have mixed types, use the primary/most significant type. - [ ] Architecture docs updated (if applicable) ``` +**Optional โ€” grouped Changes** (only when Step 2 criteria apply; omit empty sections): + +```markdown +## ๐Ÿ”„ Changes + +### โœจ Added +- [...] + +### ๐Ÿ”ง Changed +- [...] + +### ๐Ÿ› Fixed +- [...] +``` + +(Include `### ๐Ÿ—‘๏ธ Removed` only when something was deleted.) + If there are genuinely important attention areas for reviewers, add an **Attention Areas** section after Changes: ```markdown @@ -129,7 +154,7 @@ If there are genuinely important attention areas for reviewers, add an **Attenti - **Summary**: Always include โ€” be concise and focus on the "why", not just the "what" - **Related Issue**: Always include if an issue exists. Use `Fixes #NNN` for bugs, `Closes #NNN` for features/tasks -- **Changes**: Bullet list grouped logically. Omit trivial changes (formatting, imports) unless they are the point of the PR +- **Changes**: Default to a flat list. Use Added/Changed/Removed/Fixed subheadings only for large or mixed PRs; never emit empty subsection headings - **Testing**: Check off items that apply. Mark N/A items explicitly rather than leaving them unchecked without explanation - **Checklist**: Check off items that are true. Leave unchecked with a note if something doesn't apply - **Attention Areas**: Only include if there are genuinely important items; omit for simple PRs @@ -145,6 +170,6 @@ If there are genuinely important attention areas for reviewers, add an **Attenti - **No changes**: Inform user there's nothing to create a PR for - **Uncommitted work**: Warn and ask before proceeding -- **Large PRs** (>20 files): Summarize by directory/module +- **Large PRs** (>20 files): Summarize by directory/module; grouped Changes often helps here - **Single commit**: PR title can match commit message - **No related issue**: Note "N/A" in the Related Issue section rather than omitting it diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md index 29b081333..7863fff9a 100644 --- a/.github/PULL_REQUEST_TEMPLATE.md +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -5,7 +5,15 @@ ## ๐Ÿ”„ Changes - + ## ๐Ÿงช Testing diff --git a/architecture/cli.md b/architecture/cli.md index db171691d..05484e2cd 100644 --- a/architecture/cli.md +++ b/architecture/cli.md @@ -72,7 +72,7 @@ User invokes command (e.g., `data-designer create config.yaml`) ## Design Decisions -- **Lazy command loading** keeps CLI startup under ~200ms regardless of how many commands exist. Heavy imports (engine, models) only load when the relevant command runs. +- **Lazy command loading** keeps `data-designer --help` responsive: command modules (and their heavy dependencies, such as the engine and model stacks) load only when a command is invoked, not at process startup. - **Controller/service/repo for config, direct API for generation** โ€” config management benefits from the layered pattern (testable services, swappable repositories). Generation doesn't need this indirection; it delegates to the same `DataDesigner` class that Python users call directly. - **`DATA_DESIGNER_HOME`** centralizes all CLI-managed state (model configs, provider configs, tool configs, personas) in a single directory, defaulting to `~/.data_designer/`. - **Rich-based UI** provides formatted tables, progress bars, and interactive prompts without requiring a web interface. diff --git a/architecture/dataset-builders.md b/architecture/dataset-builders.md index b2de64405..9c10cf531 100644 --- a/architecture/dataset-builders.md +++ b/architecture/dataset-builders.md @@ -81,6 +81,7 @@ DatasetBuilder.build() ### Async ``` DatasetBuilder.build() + โ†’ _build_async() โ†’ _prepare_async_run() โ†’ ExecutionGraph.create() โ†’ CompletionTracker.with_graph() From b59190bdf3446b37500ea57df63a56d83bb0acb8 Mon Sep 17 00:00:00 2001 From: Nabin Mulepati Date: Mon, 6 Apr 2026 09:07:31 -0600 Subject: [PATCH 5/9] Fix 429 info --- architecture/models.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/architecture/models.md b/architecture/models.md index 5e0824796..eb7776e6d 100644 --- a/architecture/models.md +++ b/architecture/models.md @@ -30,7 +30,7 @@ Defines the contract: sync/async chat, embeddings, image generation, `supports_* ### Client Factory `create_model_client` routes by provider type to the appropriate adapter. Optionally wraps with: -- **`RetryTransport`** โ€” httpx-level retries via `httpx_retries.RetryTransport`. Rate-limit 429s are excluded from transport retries when `strip_rate_limit_codes=True` so they surface to the throttle layer. +- **`RetryTransport`** โ€” httpx-level retries via `httpx_retries.RetryTransport`. `HttpModelClient` sets `strip_rate_limit_codes=True` for the async client and `False` for the sync client (`http_model_client.py`), which controls whether 429 responses are eligible for transport-layer retries. - **`ThrottledModelClient`** โ€” AIMD (Additive Increase, Multiplicative Decrease) concurrency control per throttle domain. ### ThrottleManager @@ -69,8 +69,8 @@ Lazy `ModelFacade` construction per alias. Registers a shared `ThrottleManager` ## Design Decisions - **Facade pattern** hides HTTP, retry, throttle, and MCP complexity from generators. Generators see `completion()` and get back parsed results. -- **AIMD throttling at the application layer** rather than relying solely on HTTP retries. This provides smoother throughput under rate limits โ€” the transport retry handles transient failures, while the throttle manager adjusts concurrency to avoid sustained 429 storms. -- **429s excluded from transport retries** so rate-limit signals reach the throttle manager immediately rather than being masked by retry delays. +- **AIMD throttling at the application layer** rather than relying solely on HTTP retries. This provides smoother throughput under rate limits โ€” the transport layer still handles many transient failures, while the throttle manager adjusts concurrency to avoid sustained 429 storms. +- **429 handling depends on sync vs async `HttpModelClient`** โ€” The async client uses `strip_rate_limit_codes=True`, so 429s are not retried at the transport layer and rate-limit signals reach `ThrottledModelClient` / AIMD quickly. The sync client uses `strip_rate_limit_codes=False`, so 429s may still be retried transparently at the transport layer before surfacing to callers. - **Distribution-valued inference parameters** (`temperature`, `top_p` as `UniformDistribution` or `ManualDistribution`) enable controlled randomness across a dataset without per-row config changes. - **Lazy facade construction** avoids health-checking or connecting to models that are configured but never used in a particular generation run. From 2d7167805e321a3e0ee685f4f74d080d7c7e1319 Mon Sep 17 00:00:00 2001 From: Nabin Mulepati Date: Mon, 6 Apr 2026 09:12:26 -0600 Subject: [PATCH 6/9] update workind on skill info --- .github/ISSUE_TEMPLATE/bug-report.yml | 2 +- .github/ISSUE_TEMPLATE/feature-request.yml | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/.github/ISSUE_TEMPLATE/bug-report.yml b/.github/ISSUE_TEMPLATE/bug-report.yml index cd48c5cb4..d5ea56aad 100644 --- a/.github/ISSUE_TEMPLATE/bug-report.yml +++ b/.github/ISSUE_TEMPLATE/bug-report.yml @@ -49,7 +49,7 @@ body: attributes: label: Agent Diagnostic / Prior Investigation description: | - If you used an agent, paste the output from its investigation (e.g., from the `search-docs` or `search-github` skills). + If you used an agent, paste the output from its investigation (for example, what it found in the docs or issue tracker). If you couldn't or didn't use an agent, briefly say why and include the troubleshooting you already tried. placeholder: | Paste agent output here, or describe the manual investigation you performed. diff --git a/.github/ISSUE_TEMPLATE/feature-request.yml b/.github/ISSUE_TEMPLATE/feature-request.yml index 12132f0c9..96601c940 100644 --- a/.github/ISSUE_TEMPLATE/feature-request.yml +++ b/.github/ISSUE_TEMPLATE/feature-request.yml @@ -43,7 +43,7 @@ body: attributes: label: Agent Investigation description: | - If your agent explored the codebase to assess feasibility (e.g., using the `search-docs` or `search-github` skills), paste its findings here. + If your agent explored the codebase to assess feasibility (for example by searching project documentation or existing issues), paste its findings here. placeholder: Paste agent output here, if applicable. - type: textarea id: context From 3944ec0cd2ee37894514b9c72536dd7cffd0aab7 Mon Sep 17 00:00:00 2001 From: Nabin Mulepati Date: Mon, 6 Apr 2026 09:40:57 -0600 Subject: [PATCH 7/9] updates --- .agents/skills/review-code/SKILL.md | 16 ++++++++++++++++ architecture/agent-introspection.md | 2 +- architecture/mcp.md | 4 ++-- architecture/overview.md | 2 +- 4 files changed, 20 insertions(+), 4 deletions(-) diff --git a/.agents/skills/review-code/SKILL.md b/.agents/skills/review-code/SKILL.md index 7e0b59ae7..31b450acb 100644 --- a/.agents/skills/review-code/SKILL.md +++ b/.agents/skills/review-code/SKILL.md @@ -94,6 +94,11 @@ Read the following files at the repository root to load the project's standards - **`STYLEGUIDE.md`** โ€” code style rules (formatting, naming, imports, type annotations), design principles (DRY, KISS, YAGNI, SOLID), common pitfalls, lazy loading and `TYPE_CHECKING` patterns - **`DEVELOPMENT.md`** โ€” testing patterns and expectations +**Documentation sources (load when the changeset touches matching areas):** + +- **`architecture/*.md`** โ€” subsystem maps aligned with `packages/` (e.g. `engine/mcp/` โ†” `architecture/mcp.md`). Use to verify the PR does not leave recorded architecture false relative to new behavior. +- **`docs/`** โ€” published user-facing documentation. Cross-check when public API, CLI behavior, or config surface changes would affect what readers are told. + Use these guidelines as the baseline for the entire review. Project-specific rules take precedence over general best practices. ## Step 3: Understand the Scope @@ -147,6 +152,17 @@ Re-read the changed files with a focus on **structure and design of the new/modi - Obvious inefficiencies introduced by this change (N+1 queries, repeated computation, unnecessary copies) - Appropriate data structures for the access pattern +**Documentation alignment (same pass โ€” scoped, not a full docs audit):** + +When **code** under `packages/` changes behavior, structure, or public contracts in a way that a maintainer would reasonably describe in `architecture/` or `docs/`: + +1. Identify the closest **`architecture/.md`** (and any obvious `docs/` pages) for that subsystem. +2. If the PR **also edits** those docs, sanity-check that the edits match the code. +3. If the PR **does not** edit docs but the change **contradicts** what `architecture/` or `docs/` currently asserts, flag it (**Warnings** if contributors rely on that text; **Suggestions** if impact is narrow). Suggest updating the same PR or an explicit follow-up issue. +4. **Skip** this check for pure refactors with no observable behavior change, typo-only PRs, or changes already limited to documentation. + +The local **`search-docs`** skill can help locate `docs/` pages by topic when the right file is not obvious. + ### Pass 3: Standards, Testing & Polish Final pass focused on **project conventions and test quality for new/modified code only**: diff --git a/architecture/agent-introspection.md b/architecture/agent-introspection.md index 5e97073c9..6a17c763a 100644 --- a/architecture/agent-introspection.md +++ b/architecture/agent-introspection.md @@ -6,7 +6,7 @@ Source: `packages/data-designer/src/data_designer/cli/commands/agent.py` and `pa ## Overview -Agent introspection solves a specific problem: agents working with DataDesigner need to know what column types, sampler types, validator types, and processor types are available โ€” including any installed plugins. Rather than hardcoding this knowledge or parsing source code, agents can call `data-designer agent` commands to get structured, up-to-date information. +Agent introspection solves a specific problem: when an agent helps someone **author a dataset configuration** (columns, samplers, validators, processors, and related options), it needs an accurate catalog of what is available โ€” including types added by installed plugins. Rather than hardcoding that knowledge or parsing source code, the agent can call `data-designer agent` commands to get structured, up-to-date information. ## Key Components diff --git a/architecture/mcp.md b/architecture/mcp.md index 233af0c9b..86f14bed1 100644 --- a/architecture/mcp.md +++ b/architecture/mcp.md @@ -17,14 +17,14 @@ The subsystem has three layers: ### MCPIOService -Singleton module-level service that manages the async I/O layer: +The `io.py` module exposes MCP I/O through **one shared `MCPIOService` instance** (`_MCP_IO_SERVICE`) created at import; `atexit` registers `shutdown`. Async state (loop, sessions, caches) lives on that instance. - **Background async loop** โ€” runs on a daemon thread; sync callers use `asyncio.run_coroutine_threadsafe` to bridge - **Session pool** โ€” `_sessions` keyed by provider cache key (JSON of provider config); `_get_or_create_session` with in-flight deduplication prevents redundant connections - **Tool listing** โ€” cached per session; coalescing for concurrent list requests via `_inflight_tools` prevents duplicate discovery calls - **Tool execution** โ€” parallel tool calls within a single completion response -Module-level functions (`list_tools`, `call_tools`, `clear_session_pool`) delegate to the singleton instance. +Module-level functions (`list_tools`, `call_tools`, `clear_session_pool`) delegate to `_MCP_IO_SERVICE`. ### MCPFacade diff --git a/architecture/overview.md b/architecture/overview.md index a3f60d14f..d170b604f 100644 --- a/architecture/overview.md +++ b/architecture/overview.md @@ -55,7 +55,7 @@ Users declare what their data should look like through config objects (columns, - **PEP 420 namespace packages** allow the three packages to be installed independently while sharing the `data_designer` namespace. This enables lighter installs (e.g., config-only for validation tooling) without import conflicts. - **Lazy imports throughout** โ€” `__getattr__`-based lazy loading in `data_designer.config` and `data_designer.interface`, plus `lazy_heavy_imports` for numpy/pandas, keep startup fast. - **Dual execution engines** share the same `DatasetBuilder` API. The async engine adds row-group parallelism and DAG-aware scheduling without changing the public interface. -- **Registries as singletons** โ€” `TaskRegistry.__new__` ensures one instance per registry subclass, preventing duplicate registration and enabling consistent plugin injection. +- **`TaskRegistry` subclasses: one instance per class** โ€” `TaskRegistry.__new__` (`registry/base.py`) ensures a single instance of each concrete registry (column generators, profilers, processors). **`ModelRegistry`** and **`MCPRegistry`** are ordinary classes, constructed per run with injected dependencies. **`PluginRegistry`** (`plugins/registry.py`) uses `__new__` so entry points are discovered once per process. ## Cross-References From 2c540f9e493a6e4b6fd9f75e555b216cfdd42d4a Mon Sep 17 00:00:00 2001 From: Nabin Mulepati Date: Mon, 6 Apr 2026 10:11:50 -0600 Subject: [PATCH 8/9] Update architecture/overview.md Co-authored-by: Johnny Greco --- architecture/overview.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/architecture/overview.md b/architecture/overview.md index d170b604f..30c91bdfb 100644 --- a/architecture/overview.md +++ b/architecture/overview.md @@ -15,7 +15,7 @@ DataDesigner is split across three installable packages that merge at runtime in โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ data-designer-config (declaration) โ”‚ โ”‚ Column configs, model configs, sampler params, โ”‚ -โ”‚ builder API, plugin system, lazy imports โ”‚ +โ”‚ builder API, plugin system, lazy imports โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` From 3d3fc1dc8e5b00f64f50d43e98c473f55c90b9f7 Mon Sep 17 00:00:00 2001 From: Nabin Mulepati Date: Mon, 6 Apr 2026 11:56:17 -0600 Subject: [PATCH 9/9] fix: correct symbol names and CLI commands in architecture docs Address review comments: - models.md: describe clients as native httpx adapters, not SDK wrappers - agent-introspection.md: use actual family keys (columns, samplers, etc.) not column-types - cli.md: use correct command `data-designer config models` - plugins.md: SEED_READER not SEED_SOURCE, inject_into_processor_config_type_union Made-with: Cursor --- architecture/agent-introspection.md | 16 ++++++++-------- architecture/cli.md | 4 ++-- architecture/models.md | 4 ++-- architecture/plugins.md | 4 ++-- 4 files changed, 14 insertions(+), 14 deletions(-) diff --git a/architecture/agent-introspection.md b/architecture/agent-introspection.md index 6a17c763a..36bfd01bd 100644 --- a/architecture/agent-introspection.md +++ b/architecture/agent-introspection.md @@ -27,11 +27,11 @@ Maps a **family name** to a **discriminated union type** and its **discriminator | Family | Union Type | Discriminator | |--------|-----------|---------------| -| `column-types` | `ColumnConfigT` | `column_type` | -| `sampler-types` | `SamplerParamsT` | `sampler_type` | -| `validator-types` | `ValidatorParamsT` | `validator_type` | -| `processor-types` | `ProcessorConfigT` | `processor_type` | -| `constraint-types` | `ColumnConstraintT` | `constraint_type` | +| `columns` | `ColumnConfigT` | `column_type` | +| `samplers` | `SamplerParamsT` | `sampler_type` | +| `validators` | `ValidatorParamsT` | `validator_type` | +| `processors` | `ProcessorConfigT` | `processor_type` | +| `constraints` | `ColumnConstraintT` | `constraint_type` | ### Type Discovery @@ -58,9 +58,9 @@ Reuse the CLI's repository stack: ## Data Flow ``` -Agent calls: data-designer agent types column-types - โ†’ Typer dispatches to agent.get_types("column-types") - โ†’ FamilySpec maps "column-types" โ†’ ColumnConfigT union +Agent calls: data-designer agent types columns + โ†’ Typer dispatches to agent.get_types("columns") + โ†’ FamilySpec maps "columns" โ†’ ColumnConfigT union โ†’ discover_family_types walks union members โ†’ get_family_catalog extracts names + descriptions โ†’ get_family_source_files resolves source locations diff --git a/architecture/cli.md b/architecture/cli.md index 05484e2cd..eba2a3ca7 100644 --- a/architecture/cli.md +++ b/architecture/cli.md @@ -55,9 +55,9 @@ This keeps generation aligned with the public Python API โ€” the CLI is a thin w ### Config Management ``` -User invokes command (e.g., `data-designer models add`) +User invokes command (e.g., `data-designer config models`) โ†’ Command function wires DATA_DESIGNER_HOME - โ†’ Controller presents interactive form + โ†’ Controller presents interactive menu โ†’ Service validates and applies changes โ†’ Repository reads/writes config files ``` diff --git a/architecture/models.md b/architecture/models.md index eb7776e6d..d7af0cdac 100644 --- a/architecture/models.md +++ b/architecture/models.md @@ -24,8 +24,8 @@ Generators never interact with HTTP clients directly. They request a `ModelFacad Defines the contract: sync/async chat, embeddings, image generation, `supports_*` capability checks, `close` / `aclose`. Two implementations: -- **`OpenAICompatibleClient`** โ€” wraps the OpenAI SDK; works with any OpenAI-compatible endpoint (NIM, vLLM, etc.) -- **`AnthropicClient`** โ€” wraps the Anthropic SDK +- **`OpenAICompatibleClient`** โ€” native httpx adapter for OpenAI-compatible endpoints (NIM, vLLM, etc.) +- **`AnthropicClient`** โ€” native httpx adapter for the Anthropic Messages API ### Client Factory diff --git a/architecture/plugins.md b/architecture/plugins.md index 3a4e33a60..96514d68d 100644 --- a/architecture/plugins.md +++ b/architecture/plugins.md @@ -20,7 +20,7 @@ Plugins are standard Python packages that declare entry points in the `data_desi `Plugin` (in `plugins/plugin.py`) is a Pydantic model describing a plugin: - **`impl_qualified_name`** โ€” fully qualified name of the implementation class (e.g., generator) - **`config_qualified_name`** โ€” fully qualified name of the config class -- **`PluginType`** โ€” one of `COLUMN_GENERATOR`, `SEED_SOURCE`, or `PROCESSOR` +- **`PluginType`** โ€” one of `COLUMN_GENERATOR`, `SEED_READER`, or `PROCESSOR` Validators ensure: - Both modules exist and are importable @@ -38,7 +38,7 @@ Plugins can be disabled globally with `DISABLE_DATA_DESIGNER_PLUGINS=true`. Thin facade over `PluginRegistry` providing typed injection methods: - `inject_into_column_config_type_union` โ€” extends `ColumnConfigT` - `inject_into_seed_source_type_union` โ€” extends `SeedSourceT` -- `inject_into_processor_type_union` โ€” extends `ProcessorConfigT` +- `inject_into_processor_config_type_union` โ€” extends `ProcessorConfigT` Each method ORs the plugin's config class into the existing type union (`type_union |= plugin.config_cls`).