Skip to content

Commit 30a6de2

Browse files
committed
ci: add executable smoke checks and harden runner memory
Add executable smoke checks to test-health and code-quality recipes that exercise real code paths (config build, validate, import timing, registry completeness, error hierarchy, input rejection) without needing an LLM provider. Checks are split into fixed canaries (same every run) and creative checks (agent varies inputs each run). Harden runner memory: define JSON schema in _runner.md with TTL and size rules, validate state file after agent runs, only update last_run on success, drop unused audit-log.md. Add make install-dev workflow step so recipes can run Python against the installed packages.
1 parent deb5468 commit 30a6de2

4 files changed

Lines changed: 273 additions & 25 deletions

File tree

.agents/recipes/_runner.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,48 @@ DataDesigner is an NVIDIA NeMo framework for creating synthetic datasets.
99
See AGENTS.md at the repo root for an overview and links to detailed docs
1010
(architecture, style guide, development workflow).
1111

12+
## Environment
13+
14+
The workflow runs `make install-dev` and adds `.venv/bin` to PATH before
15+
your recipe executes. All three DataDesigner packages are installed in
16+
development mode. You can use `python` directly to run code - it resolves
17+
to the project venv.
18+
19+
## Runner memory
20+
21+
Each recipe has access to `{{memory_path}}/runner-state.json`. The workflow
22+
loads it from cache before your run and saves it after. Use this schema:
23+
24+
```json
25+
{
26+
"suite": "test-health",
27+
"last_run": "2026-04-14T08:00:00Z",
28+
"known_issues": [
29+
{
30+
"id": "short-hash-of-finding",
31+
"category": "hollow-test",
32+
"summary": "test_foo only asserts not None",
33+
"first_seen": "2026-04-07",
34+
"last_seen": "2026-04-14"
35+
}
36+
],
37+
"baselines": {
38+
"test_to_source_ratio": "45/52",
39+
"type_coverage_pct": 78,
40+
"import_time_s": 1.8
41+
}
42+
}
43+
```
44+
45+
Rules:
46+
- **`known_issues`**: skip re-reporting issues already here. Update
47+
`last_seen` if the issue is still present. Remove entries whose
48+
`last_seen` is more than 4 weeks old - the issue was likely fixed.
49+
- **`baselines`**: store current metric values. Compare against these on
50+
the next run to detect trends (improving or regressing).
51+
- **Keep it small.** The whole file should stay under 50KB. If
52+
`known_issues` grows past 100 entries, prune the oldest resolved ones.
53+
1254
## Constraints
1355

1456
- **No interactive prompts.** If something is ambiguous, make a reasonable choice

.agents/recipes/code-quality/recipe.md

Lines changed: 66 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,63 @@ Known gap: `packages/data-designer-config/src/data_designer/custom_column.py`
106106
and `packages/data-designer-config/src/data_designer/analysis/` have been
107107
flagged before.
108108

109-
### 4. TODO/FIXME/HACK aging
109+
### 4. Executable quality checks
110+
111+
Run a few checks that exercise real code paths to catch regressions that
112+
static analysis misses. The workflow puts `.venv/bin` on PATH via
113+
`make install-dev`, so `python` resolves to the project venv.
114+
115+
#### 4a. Error type hierarchy (fixed - run as written)
116+
117+
Verify that the project's error types are importable and properly
118+
structured. Silent breakage here means third-party exceptions leak to users:
119+
120+
```bash
121+
python -c "
122+
from data_designer.errors import DataDesignerError
123+
assert issubclass(DataDesignerError, Exception), 'DataDesignerError must be an Exception'
124+
print('OK: error hierarchy intact')
125+
" 2>&1 || echo "WARN: error hierarchy check failed"
126+
```
127+
128+
#### 4b. Input validation checks (creative - vary each run)
129+
130+
Verify the config builder rejects bad inputs rather than silently
131+
producing corrupt configs. **Design your own invalid inputs each run**
132+
to maximize coverage over time.
133+
134+
Examples of things to test (pick 2-3 per run, and invent new ones):
135+
- Invalid `column_type` string (should raise)
136+
- `column_type='sampler'` without `sampler_type` (should raise)
137+
- Empty builder `.build()` (should handle gracefully)
138+
- Duplicate column names (should raise or deduplicate clearly)
139+
- Invalid sampler params (e.g., `gaussian` with negative `std`, `category`
140+
with empty `values` list)
141+
- Column names with special characters or very long strings
142+
- Recently changed validators (check `git log --oneline -10 -- packages/*/src/data_designer/config/`)
143+
144+
**API reference:**
145+
146+
```python
147+
from data_designer.config.config_builder import DataDesignerConfigBuilder
148+
149+
# Test that invalid input is rejected (not silently accepted)
150+
try:
151+
DataDesignerConfigBuilder().add_column(
152+
name='x', column_type='nonexistent_type'
153+
).build()
154+
print('FAIL: invalid column type was silently accepted')
155+
except Exception as e:
156+
print(f'OK: invalid column type rejected ({type(e).__name__})')
157+
```
158+
159+
The pattern: try something that should fail, print FAIL if it succeeds
160+
silently, print OK if it raises. A FAIL means a validation regression
161+
that could lead to silent data corruption.
162+
163+
Report what you tested and why. Any FAIL is a critical finding.
164+
165+
### 5. TODO/FIXME/HACK aging
110166

111167
Inventory markers with their git blame age:
112168

@@ -155,6 +211,14 @@ Write the report to `/tmp/audit-{{suite}}.md`:
155211

156212
**Coverage:** ~X% of public functions fully annotated (previous: Y%)
157213

214+
### Executable quality checks
215+
216+
| Check | Type | Status | Detail |
217+
|-------|------|--------|--------|
218+
| Error hierarchy | fixed | OK/FAIL | DataDesignerError is properly structured |
219+
| (describe input tested) | creative | OK/FAIL | (what was tested and why) |
220+
| ... | creative | ... | ... |
221+
158222
### TODO/FIXME/HACK inventory
159223

160224
| File | Line | Marker | Age (days) | Commit |
@@ -168,6 +232,7 @@ Write the report to `/tmp/audit-{{suite}}.md`:
168232
- N complexity hotspots (M trending up)
169233
- N exception hygiene issues (M new)
170234
- Type coverage: X% (delta: +/-N% from last run)
235+
- Executable checks: N/2 passed (any FAIL is critical)
171236
- N aging TODO/FIXME markers (M new)
172237
```
173238

.agents/recipes/test-health/recipe.md

Lines changed: 145 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -111,7 +111,132 @@ These should use `data_designer.lazy_heavy_imports`. Cross-reference with
111111
the structure recipe's findings if available in runner memory, but don't
112112
skip this check - it directly affects user experience.
113113

114-
### 4. Test isolation verification
114+
### 4. Executable smoke checks
115+
116+
Run lightweight checks that exercise real code paths. These catch silent
117+
data corruption, column registration gaps, and config wiring issues that
118+
static analysis misses. None of these require an LLM provider.
119+
120+
**Important**: the workflow puts `.venv/bin` on PATH via `make install-dev`,
121+
so `python` resolves to the project venv with all packages installed.
122+
123+
There are two kinds of smoke checks: **fixed canaries** that must run
124+
identically every time (deterministic regressions), and **creative checks**
125+
where you should vary the inputs each run to maximize coverage over time.
126+
127+
#### Fixed canaries (run these exactly as written)
128+
129+
**4a. Package import verification**
130+
131+
```bash
132+
python -c "
133+
from data_designer.config.config_builder import DataDesignerConfigBuilder
134+
from data_designer.engine.compiler import compile_data_designer_config
135+
from data_designer.interface.data_designer import DataDesigner
136+
print('OK: all packages import')
137+
"
138+
```
139+
140+
If any import fails, this is a critical finding - it means the package
141+
layering is broken.
142+
143+
**4b. Import performance timing**
144+
145+
```bash
146+
python -c "
147+
import time
148+
start = time.monotonic()
149+
from data_designer.interface.data_designer import DataDesigner
150+
elapsed = time.monotonic() - start
151+
budget = 3.0
152+
status = 'OK' if elapsed < budget else 'FAIL'
153+
print(f'{status}: import took {elapsed:.2f}s (budget: {budget:.0f}s)')
154+
"
155+
```
156+
157+
**4c. Column type registry completeness**
158+
159+
```bash
160+
python -c "
161+
from data_designer.config.column_types import (
162+
DataDesignerColumnType,
163+
get_column_config_cls_from_type,
164+
)
165+
166+
missing = []
167+
for ct in DataDesignerColumnType:
168+
try:
169+
cls = get_column_config_cls_from_type(ct)
170+
if cls is None:
171+
missing.append(ct.value)
172+
except Exception as e:
173+
missing.append(f'{ct.value} ({e})')
174+
175+
if missing:
176+
for m in missing:
177+
print(f'FAIL: {m}')
178+
else:
179+
print(f'OK: all {len(list(DataDesignerColumnType))} column types resolve to config classes')
180+
" 2>&1 || echo "WARN: registry check could not run"
181+
```
182+
183+
#### Creative checks (vary these each run)
184+
185+
For each run, **design your own** config build and validation checks. The
186+
goal is to exercise different code paths over time rather than testing the
187+
same config every day.
188+
189+
**What to vary:**
190+
- **Sampler types**: pick a different mix each run. Available sampler types:
191+
`uuid`, `category`, `subcategory`, `uniform`, `gaussian`, `bernoulli`,
192+
`bernoulli_mixture`, `binomial`, `poisson`, `scipy`, `person_from_faker`,
193+
`datetime`, `timedelta`. Try 2-5 columns per config.
194+
- **Column count**: sometimes build a single-column config, sometimes 8+
195+
- **Edge cases**: empty params where defaults should apply, extreme param
196+
values (e.g., `gaussian` with `std=0`), columns with constraints
197+
- **Recently changed code**: check `git log --oneline -20 -- packages/` for
198+
recently modified column types or sampler params, and prioritize testing
199+
those
200+
201+
**What to always check:**
202+
1. Config build round-trip: column count and names survive `.build()`
203+
2. Validation: `DataDesigner.validate(builder)` succeeds for valid configs
204+
3. Rejection: invalid inputs raise, not silently produce bad configs
205+
206+
**API reference** for writing checks:
207+
208+
```python
209+
from data_designer.config.config_builder import DataDesignerConfigBuilder
210+
from data_designer.interface.data_designer import DataDesigner
211+
import tempfile
212+
213+
# Build a config - use keyword args: name, column_type, sampler_type, params
214+
builder = (
215+
DataDesignerConfigBuilder()
216+
.add_column(name='id', column_type='sampler', sampler_type='uuid')
217+
.add_column(name='cat', column_type='sampler', sampler_type='category',
218+
params={'values': ['A', 'B', 'C']})
219+
)
220+
config = builder.build()
221+
222+
# Verify columns survived the build
223+
assert len(config.columns) >= 2
224+
names = {c.name for c in config.columns}
225+
assert 'id' in names and 'cat' in names
226+
227+
# Validate through the full stack (no LLM needed for sampler-only)
228+
dd = DataDesigner(artifact_path=tempfile.mkdtemp(), model_providers=[])
229+
dd.validate(builder)
230+
```
231+
232+
Run at least 2 creative checks per audit. Document what you chose and why
233+
in the report (e.g., "tested poisson+datetime combo because poisson params
234+
were modified in commit abc1234").
235+
236+
**Report smoke check results in a separate table.** If any check fails,
237+
that is a higher-priority finding than static analysis results.
238+
239+
### 5. Test isolation verification
115240

116241
The CI runs three separate test jobs: config-only, engine+config, and
117242
full stack. Check that test files respect these boundaries:
@@ -158,6 +283,24 @@ Write the report to `/tmp/audit-{{suite}}.md`:
158283
| test_import_perf.py exists | yes/no |
159284
| Heavy top-level imports | N found |
160285

286+
### Executable smoke checks
287+
288+
**Fixed canaries:**
289+
290+
| Check | Status | Detail |
291+
|-------|--------|--------|
292+
| Package imports | OK/FAIL | All three packages import cleanly |
293+
| Import timing | OK/FAIL | Xms (budget: 3s) |
294+
| Registry completeness | OK/WARN | Column types resolve to config classes |
295+
296+
**Creative checks** (describe what you tested and why):
297+
298+
| Check | Sampler types used | Status | Detail |
299+
|-------|--------------------|--------|--------|
300+
| Config build #1 | e.g. uuid+poisson+datetime | OK/FAIL | ... |
301+
| Validate #1 | ... | OK/FAIL | ... |
302+
| ... | ... | ... | ... |
303+
161304
### Test isolation
162305

163306
| Test file | Violation |
@@ -169,6 +312,7 @@ Write the report to `/tmp/audit-{{suite}}.md`:
169312
- N source files without tests (M new since last run)
170313
- N hollow tests detected (high confidence only)
171314
- Import perf: N heavy top-level imports
315+
- Smoke checks: N passed, M failed (list any FAILs - these are critical)
172316
- N test isolation violations
173317
```
174318

.github/workflows/agentic-ci-daily.yml

Lines changed: 20 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -112,13 +112,19 @@ jobs:
112112
}
113113
EOF
114114
fi
115-
if [ ! -f .agentic-ci-state/audit-log.md ]; then
116-
echo "# Audit Log: ${SUITE}" > .agentic-ci-state/audit-log.md
117-
echo "" >> .agentic-ci-state/audit-log.md
118-
fi
119115
echo "Runner memory state:"
120116
cat .agentic-ci-state/runner-state.json
121117
118+
- name: Install dev environment
119+
run: |
120+
make install-dev
121+
echo "${{ github.workspace }}/.venv/bin" >> "$GITHUB_PATH"
122+
.venv/bin/python -c "
123+
from data_designer.config._version import __version__ as cv
124+
from data_designer.engine._version import __version__ as ev
125+
print(f' config: {cv} engine: {ev}')
126+
" 2>/dev/null || echo " (version check skipped)"
127+
122128
- name: Pre-flight checks
123129
env:
124130
ANTHROPIC_BASE_URL: ${{ secrets.AGENTIC_CI_API_BASE_URL }}
@@ -183,33 +189,24 @@ jobs:
183189
continue-on-error: true
184190

185191
- name: Update runner memory
186-
if: always()
192+
if: success()
187193
env:
188194
SUITE: ${{ matrix.suite }}
189195
run: |
190-
# Update last_run timestamp in state
191-
if command -v python3 &> /dev/null; then
192-
python3 -c "
193-
import json, datetime
194-
with open('.agentic-ci-state/runner-state.json') as f:
195-
state = json.load(f)
196+
# Validate the agent didn't corrupt the state file
197+
python3 -c "
198+
import json, datetime, sys
199+
try:
200+
with open('.agentic-ci-state/runner-state.json') as f:
201+
state = json.load(f)
202+
except (json.JSONDecodeError, FileNotFoundError) as e:
203+
print(f'::warning::runner-state.json is invalid ({e}), resetting')
204+
state = {'suite': '${SUITE}', 'known_issues': [], 'baselines': {}}
196205
state['last_run'] = datetime.datetime.utcnow().isoformat() + 'Z'
197206
state['suite'] = '${SUITE}'
198207
with open('.agentic-ci-state/runner-state.json', 'w') as f:
199208
json.dump(state, f, indent=2)
200209
"
201-
else
202-
echo "python3 not available, skipping state update"
203-
fi
204-
205-
# Append to audit log
206-
echo "" >> .agentic-ci-state/audit-log.md
207-
echo "## $(date -u +%Y-%m-%d)" >> .agentic-ci-state/audit-log.md
208-
if [ -s "/tmp/audit-${SUITE}.md" ]; then
209-
echo "Findings reported." >> .agentic-ci-state/audit-log.md
210-
else
211-
echo "No findings." >> .agentic-ci-state/audit-log.md
212-
fi
213210
214211
- name: Write job summary
215212
if: always()

0 commit comments

Comments
 (0)