ci: add executable smoke checks and harden runner memory

andreatgretel · andreatgretel · commit 30a6de2d07a0 · 2026-04-14T04:16:48.000Z
Add executable smoke checks to test-health and code-quality recipes
that exercise real code paths (config build, validate, import timing,
registry completeness, error hierarchy, input rejection) without
needing an LLM provider. Checks are split into fixed canaries (same
every run) and creative checks (agent varies inputs each run).

Harden runner memory: define JSON schema in _runner.md with TTL and
size rules, validate state file after agent runs, only update
last_run on success, drop unused audit-log.md. Add make install-dev
workflow step so recipes can run Python against the installed packages.
diff --git a/.agents/recipes/_runner.md b/.agents/recipes/_runner.md
@@ -9,6 +9,48 @@ DataDesigner is an NVIDIA NeMo framework for creating synthetic datasets.
 See AGENTS.md at the repo root for an overview and links to detailed docs
 (architecture, style guide, development workflow).
 
+## Environment
+
+The workflow runs `make install-dev` and adds `.venv/bin` to PATH before
+your recipe executes. All three DataDesigner packages are installed in
+development mode. You can use `python` directly to run code - it resolves
+to the project venv.
+
+## Runner memory
+
+Each recipe has access to `{{memory_path}}/runner-state.json`. The workflow
+loads it from cache before your run and saves it after. Use this schema:
+
+```json
+{
+  "suite": "test-health",
+  "last_run": "2026-04-14T08:00:00Z",
+  "known_issues": [
+    {
+      "id": "short-hash-of-finding",
+      "category": "hollow-test",
+      "summary": "test_foo only asserts not None",
+      "first_seen": "2026-04-07",
+      "last_seen": "2026-04-14"
+    }
+  ],
+  "baselines": {
+    "test_to_source_ratio": "45/52",
+    "type_coverage_pct": 78,
+    "import_time_s": 1.8
+  }
+}
+```
+
+Rules:
+- **`known_issues`**: skip re-reporting issues already here. Update
+  `last_seen` if the issue is still present. Remove entries whose
+  `last_seen` is more than 4 weeks old - the issue was likely fixed.
+- **`baselines`**: store current metric values. Compare against these on
+  the next run to detect trends (improving or regressing).
+- **Keep it small.** The whole file should stay under 50KB. If
+  `known_issues` grows past 100 entries, prune the oldest resolved ones.
+
 ## Constraints
 
 - **No interactive prompts.** If something is ambiguous, make a reasonable choice
diff --git a/.agents/recipes/code-quality/recipe.md b/.agents/recipes/code-quality/recipe.md
@@ -106,7 +106,63 @@ Known gap: `packages/data-designer-config/src/data_designer/custom_column.py`
 and `packages/data-designer-config/src/data_designer/analysis/` have been
 flagged before.
 
-### 4. TODO/FIXME/HACK aging
+### 4. Executable quality checks
+
+Run a few checks that exercise real code paths to catch regressions that
+static analysis misses. The workflow puts `.venv/bin` on PATH via
+`make install-dev`, so `python` resolves to the project venv.
+
+#### 4a. Error type hierarchy (fixed - run as written)
+
+Verify that the project's error types are importable and properly
+structured. Silent breakage here means third-party exceptions leak to users:
+
+```bash
+python -c "
+from data_designer.errors import DataDesignerError
+assert issubclass(DataDesignerError, Exception), 'DataDesignerError must be an Exception'
+print('OK: error hierarchy intact')
+" 2>&1 || echo "WARN: error hierarchy check failed"
+```
+
+#### 4b. Input validation checks (creative - vary each run)
+
+Verify the config builder rejects bad inputs rather than silently
+producing corrupt configs. **Design your own invalid inputs each run**
+to maximize coverage over time.
+
+Examples of things to test (pick 2-3 per run, and invent new ones):
+- Invalid `column_type` string (should raise)
+- `column_type='sampler'` without `sampler_type` (should raise)
+- Empty builder `.build()` (should handle gracefully)
+- Duplicate column names (should raise or deduplicate clearly)
+- Invalid sampler params (e.g., `gaussian` with negative `std`, `category`
+  with empty `values` list)
+- Column names with special characters or very long strings
+- Recently changed validators (check `git log --oneline -10 -- packages/*/src/data_designer/config/`)
+
+**API reference:**
+
+```python
+from data_designer.config.config_builder import DataDesignerConfigBuilder
+
+# Test that invalid input is rejected (not silently accepted)
+try:
+    DataDesignerConfigBuilder().add_column(
+        name='x', column_type='nonexistent_type'
+    ).build()
+    print('FAIL: invalid column type was silently accepted')
+except Exception as e:
+    print(f'OK: invalid column type rejected ({type(e).__name__})')
+```
+
+The pattern: try something that should fail, print FAIL if it succeeds
+silently, print OK if it raises. A FAIL means a validation regression
+that could lead to silent data corruption.
+
+Report what you tested and why. Any FAIL is a critical finding.
+
+### 5. TODO/FIXME/HACK aging
 
 Inventory markers with their git blame age:
 
@@ -155,6 +211,14 @@ Write the report to `/tmp/audit-{{suite}}.md`:
 
 **Coverage:** ~X% of public functions fully annotated (previous: Y%)
 
+### Executable quality checks
+
+| Check | Type | Status | Detail |
+|-------|------|--------|--------|
+| Error hierarchy | fixed | OK/FAIL | DataDesignerError is properly structured |
+| (describe input tested) | creative | OK/FAIL | (what was tested and why) |
+| ... | creative | ... | ... |
+
 ### TODO/FIXME/HACK inventory
 
 | File | Line | Marker | Age (days) | Commit |
@@ -168,6 +232,7 @@ Write the report to `/tmp/audit-{{suite}}.md`:
 - N complexity hotspots (M trending up)
 - N exception hygiene issues (M new)
 - Type coverage: X% (delta: +/-N% from last run)
+- Executable checks: N/2 passed (any FAIL is critical)
 - N aging TODO/FIXME markers (M new)
 ```
 
diff --git a/.agents/recipes/test-health/recipe.md b/.agents/recipes/test-health/recipe.md
@@ -111,7 +111,132 @@ These should use `data_designer.lazy_heavy_imports`. Cross-reference with
 the structure recipe's findings if available in runner memory, but don't
 skip this check - it directly affects user experience.
 
-### 4. Test isolation verification
+### 4. Executable smoke checks
+
+Run lightweight checks that exercise real code paths. These catch silent
+data corruption, column registration gaps, and config wiring issues that
+static analysis misses. None of these require an LLM provider.
+
+**Important**: the workflow puts `.venv/bin` on PATH via `make install-dev`,
+so `python` resolves to the project venv with all packages installed.
+
+There are two kinds of smoke checks: **fixed canaries** that must run
+identically every time (deterministic regressions), and **creative checks**
+where you should vary the inputs each run to maximize coverage over time.
+
+#### Fixed canaries (run these exactly as written)
+
+**4a. Package import verification**
+
+```bash
+python -c "
+from data_designer.config.config_builder import DataDesignerConfigBuilder
+from data_designer.engine.compiler import compile_data_designer_config
+from data_designer.interface.data_designer import DataDesigner
+print('OK: all packages import')
+"
+```
+
+If any import fails, this is a critical finding - it means the package
+layering is broken.
+
+**4b. Import performance timing**
+
+```bash
+python -c "
+import time
+start = time.monotonic()
+from data_designer.interface.data_designer import DataDesigner
+elapsed = time.monotonic() - start
+budget = 3.0
+status = 'OK' if elapsed < budget else 'FAIL'
+print(f'{status}: import took {elapsed:.2f}s (budget: {budget:.0f}s)')
+"
+```
+
+**4c. Column type registry completeness**
+
+```bash
+python -c "
+from data_designer.config.column_types import (
+    DataDesignerColumnType,
+    get_column_config_cls_from_type,
+)
+
+missing = []
+for ct in DataDesignerColumnType:
+    try:
+        cls = get_column_config_cls_from_type(ct)
+        if cls is None:
+            missing.append(ct.value)
+    except Exception as e:
+        missing.append(f'{ct.value} ({e})')
+
+if missing:
+    for m in missing:
+        print(f'FAIL: {m}')
+else:
+    print(f'OK: all {len(list(DataDesignerColumnType))} column types resolve to config classes')
+" 2>&1 || echo "WARN: registry check could not run"
+```
+
+#### Creative checks (vary these each run)
+
+For each run, **design your own** config build and validation checks. The
+goal is to exercise different code paths over time rather than testing the
+same config every day.
+
+**What to vary:**
+- **Sampler types**: pick a different mix each run. Available sampler types:
+  `uuid`, `category`, `subcategory`, `uniform`, `gaussian`, `bernoulli`,
+  `bernoulli_mixture`, `binomial`, `poisson`, `scipy`, `person_from_faker`,
+  `datetime`, `timedelta`. Try 2-5 columns per config.
+- **Column count**: sometimes build a single-column config, sometimes 8+
+- **Edge cases**: empty params where defaults should apply, extreme param
+  values (e.g., `gaussian` with `std=0`), columns with constraints
+- **Recently changed code**: check `git log --oneline -20 -- packages/` for
+  recently modified column types or sampler params, and prioritize testing
+  those
+
+**What to always check:**
+1. Config build round-trip: column count and names survive `.build()`
+2. Validation: `DataDesigner.validate(builder)` succeeds for valid configs
+3. Rejection: invalid inputs raise, not silently produce bad configs
+
+**API reference** for writing checks:
+
+```python
+from data_designer.config.config_builder import DataDesignerConfigBuilder
+from data_designer.interface.data_designer import DataDesigner
+import tempfile
+
+# Build a config - use keyword args: name, column_type, sampler_type, params
+builder = (
+    DataDesignerConfigBuilder()
+    .add_column(name='id', column_type='sampler', sampler_type='uuid')
+    .add_column(name='cat', column_type='sampler', sampler_type='category',
+                params={'values': ['A', 'B', 'C']})
+)
+config = builder.build()
+
+# Verify columns survived the build
+assert len(config.columns) >= 2
+names = {c.name for c in config.columns}
+assert 'id' in names and 'cat' in names
+
+# Validate through the full stack (no LLM needed for sampler-only)
+dd = DataDesigner(artifact_path=tempfile.mkdtemp(), model_providers=[])
+dd.validate(builder)
+```
+
+Run at least 2 creative checks per audit. Document what you chose and why
+in the report (e.g., "tested poisson+datetime combo because poisson params
+were modified in commit abc1234").
+
+**Report smoke check results in a separate table.** If any check fails,
+that is a higher-priority finding than static analysis results.
+
+### 5. Test isolation verification
 
 The CI runs three separate test jobs: config-only, engine+config, and
 full stack. Check that test files respect these boundaries:
@@ -158,6 +283,24 @@ Write the report to `/tmp/audit-{{suite}}.md`:
 | test_import_perf.py exists | yes/no |
 | Heavy top-level imports | N found |
 
+### Executable smoke checks
+
+**Fixed canaries:**
+
+| Check | Status | Detail |
+|-------|--------|--------|
+| Package imports | OK/FAIL | All three packages import cleanly |
+| Import timing | OK/FAIL | Xms (budget: 3s) |
+| Registry completeness | OK/WARN | Column types resolve to config classes |
+
+**Creative checks** (describe what you tested and why):
+
+| Check | Sampler types used | Status | Detail |
+|-------|--------------------|--------|--------|
+| Config build #1 | e.g. uuid+poisson+datetime | OK/FAIL | ... |
+| Validate #1 | ... | OK/FAIL | ... |
+| ... | ... | ... | ... |
+
 ### Test isolation
 
 | Test file | Violation |
@@ -169,6 +312,7 @@ Write the report to `/tmp/audit-{{suite}}.md`:
 - N source files without tests (M new since last run)
 - N hollow tests detected (high confidence only)
 - Import perf: N heavy top-level imports
+- Smoke checks: N passed, M failed (list any FAILs - these are critical)
 - N test isolation violations
 ```
 
diff --git a/.github/workflows/agentic-ci-daily.yml b/.github/workflows/agentic-ci-daily.yml
@@ -112,13 +112,19 @@ jobs:
           }
           EOF
           fi
-          if [ ! -f .agentic-ci-state/audit-log.md ]; then
-            echo "# Audit Log: ${SUITE}" > .agentic-ci-state/audit-log.md
-            echo "" >> .agentic-ci-state/audit-log.md
-          fi
           echo "Runner memory state:"
           cat .agentic-ci-state/runner-state.json
 
+      - name: Install dev environment
+        run: |
+          make install-dev
+          echo "${{ github.workspace }}/.venv/bin" >> "$GITHUB_PATH"
+          .venv/bin/python -c "
+          from data_designer.config._version import __version__ as cv
+          from data_designer.engine._version import __version__ as ev
+          print(f'  config: {cv}  engine: {ev}')
+          " 2>/dev/null || echo "  (version check skipped)"
+
       - name: Pre-flight checks
         env:
           ANTHROPIC_BASE_URL: ${{ secrets.AGENTIC_CI_API_BASE_URL }}
@@ -183,33 +189,24 @@ jobs:
         continue-on-error: true
 
       - name: Update runner memory
-        if: always()
+        if: success()
         env:
           SUITE: ${{ matrix.suite }}
         run: |
-          # Update last_run timestamp in state
-          if command -v python3 &> /dev/null; then
-            python3 -c "
-          import json, datetime
-          with open('.agentic-ci-state/runner-state.json') as f:
-              state = json.load(f)
+          # Validate the agent didn't corrupt the state file
+          python3 -c "
+          import json, datetime, sys
+          try:
+              with open('.agentic-ci-state/runner-state.json') as f:
+                  state = json.load(f)
+          except (json.JSONDecodeError, FileNotFoundError) as e:
+              print(f'::warning::runner-state.json is invalid ({e}), resetting')
+              state = {'suite': '${SUITE}', 'known_issues': [], 'baselines': {}}
           state['last_run'] = datetime.datetime.utcnow().isoformat() + 'Z'
           state['suite'] = '${SUITE}'
           with open('.agentic-ci-state/runner-state.json', 'w') as f:
               json.dump(state, f, indent=2)
           "
-          else
-            echo "python3 not available, skipping state update"
-          fi
-
-          # Append to audit log
-          echo "" >> .agentic-ci-state/audit-log.md
-          echo "## $(date -u +%Y-%m-%d)" >> .agentic-ci-state/audit-log.md
-          if [ -s "/tmp/audit-${SUITE}.md" ]; then
-            echo "Findings reported." >> .agentic-ci-state/audit-log.md
-          else
-            echo "No findings." >> .agentic-ci-state/audit-log.md
-          fi
 
       - name: Write job summary
         if: always()