Skip to content

Commit b220f36

Browse files
ci: add daily audit suites with 5 rotating recipes and scheduled workflow (#543)
* ci: add daily audit suites with 5 recipes and scheduled workflow Add the daily maintenance infrastructure (Phase 2+3 of the agentic CI plan). A new workflow runs one audit suite per weekday via day-of-week rotation, with runner memory persisted via actions/cache. Recipes: docs-and-references (Mon), dependencies (Tue), structure (Wed), code-quality (Thu), test-health (Fri). Each targets gaps that CI and ruff don't cover: cross-reference validation, transitive dep analysis, lazy import compliance, complexity trends, and test-to-source mapping. Reports go to the Actions step summary. Code changes use /create-pr. * ci: add executable smoke checks and harden runner memory Add executable smoke checks to test-health and code-quality recipes that exercise real code paths (config build, validate, import timing, registry completeness, error hierarchy, input rejection) without needing an LLM provider. Checks are split into fixed canaries (same every run) and creative checks (agent varies inputs each run). Harden runner memory: define JSON schema in _runner.md with TTL and size rules, validate state file after agent runs, only update last_run on success, drop unused audit-log.md. Add make install-dev workflow step so recipes can run Python against the installed packages. * ci: fix codex review findings - test paths, provider check, step gating Fix issues found by Codex review: - Fix test paths: tests/ does not exist at repo root, use packages/*/tests/ and packages/data-designer/tests/test_import_perf.py - Remove DataDesigner(model_providers=[]) from smoke checks - raises NoModelProvidersError; keep config-layer checks only - Fix audit step gating: remove continue-on-error, use step outcome to gate runner memory update (|| true + continue-on-error made the step always "succeed", defeating the success() condition) * ci: fix review findings - heredoc, state validation, lazy import wording Fix heredoc with indented EOF terminator that never terminates - replace with printf. Run state validation on all outcomes (not just success) so corrupted state from a failed audit is caught before caching. Only stamp last_run when audit succeeds. Align test-health lazy import section with its own Constraints (report count only, don't duplicate structure audit). Also fixes datetime.utcnow() deprecation and shell variable injection in Python string by using os.environ instead.
1 parent a965bc1 commit b220f36

9 files changed

Lines changed: 1378 additions & 13 deletions

File tree

.agents/recipes/_runner.md

Lines changed: 45 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,48 @@ DataDesigner is an NVIDIA NeMo framework for creating synthetic datasets.
99
See AGENTS.md at the repo root for an overview and links to detailed docs
1010
(architecture, style guide, development workflow).
1111

12+
## Environment
13+
14+
The workflow runs `make install-dev` and adds `.venv/bin` to PATH before
15+
your recipe executes. All three DataDesigner packages are installed in
16+
development mode. You can use `python` directly to run code - it resolves
17+
to the project venv.
18+
19+
## Runner memory
20+
21+
Each recipe has access to `{{memory_path}}/runner-state.json`. The workflow
22+
loads it from cache before your run and saves it after. Use this schema:
23+
24+
```json
25+
{
26+
"suite": "test-health",
27+
"last_run": "2026-04-14T08:00:00Z",
28+
"known_issues": [
29+
{
30+
"id": "short-hash-of-finding",
31+
"category": "hollow-test",
32+
"summary": "test_foo only asserts not None",
33+
"first_seen": "2026-04-07",
34+
"last_seen": "2026-04-14"
35+
}
36+
],
37+
"baselines": {
38+
"test_to_source_ratio": "45/52",
39+
"type_coverage_pct": 78,
40+
"import_time_s": 1.8
41+
}
42+
}
43+
```
44+
45+
Rules:
46+
- **`known_issues`**: skip re-reporting issues already here. Update
47+
`last_seen` if the issue is still present. Remove entries whose
48+
`last_seen` is more than 4 weeks old - the issue was likely fixed.
49+
- **`baselines`**: store current metric values. Compare against these on
50+
the next run to detect trends (improving or regressing).
51+
- **Keep it small.** The whole file should stay under 50KB. If
52+
`known_issues` grows past 100 entries, prune the oldest resolved ones.
53+
1254
## Constraints
1355

1456
- **No interactive prompts.** If something is ambiguous, make a reasonable choice
@@ -34,5 +76,6 @@ Write all output to a temp file (e.g., `/tmp/recipe-output.md`). The workflow
3476
will handle posting it. Do not post directly to GitHub - the workflow controls
3577
output routing.
3678

37-
If your recipe produces code changes, make them on the current branch. The
38-
workflow will open a PR from the diff.
79+
If your recipe produces code changes, commit them on a new branch and use
80+
`/create-pr` to open a pull request. The branch name should follow the
81+
pattern `agentic-ci/chore/{suite}-YYYYMMDD`.
Lines changed: 251 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,251 @@
1+
---
2+
name: code-quality
3+
description: Audit code quality gaps not covered by ruff - complexity trends, exception hygiene, type coverage, TODO aging
4+
trigger: schedule
5+
tool: claude-code
6+
timeout_minutes: 20
7+
max_turns: 30
8+
permissions:
9+
contents: write
10+
---
11+
12+
# Code Quality Audit
13+
14+
Catch quality drift that CI doesn't cover. Write findings to
15+
`/tmp/audit-{{suite}}.md`.
16+
17+
**What CI already enforces** (do NOT duplicate):
18+
- Ruff rules: W, F, I, ICN, PIE, TID, UP006, UP007, UP045
19+
- Ruff format with 120-char line length, double quotes
20+
- Test coverage >= 90% aggregate
21+
22+
**What CI does NOT enforce** (this recipe's focus):
23+
- C901 cyclomatic complexity (not in ruff select)
24+
- ANN type annotation completeness (not in ruff select)
25+
- BLE001 bare except handling (not in ruff select)
26+
- Google-style docstring format (D* rules not enabled)
27+
- Complexity growth trends over time
28+
- TODO/FIXME aging
29+
30+
## Runner memory
31+
32+
Read `{{memory_path}}/runner-state.json` for baselines from previous runs
33+
(complexity scores, type coverage, TODO inventory). After the audit, update
34+
`baselines` with current values and `known_issues` with new findings. Skip
35+
re-reporting known issues. Flag metrics that are trending in the wrong
36+
direction compared to the previous baseline.
37+
38+
## Instructions
39+
40+
### 1. Complexity hotspots
41+
42+
Try ruff C901 first (may not be in the config but can be invoked directly):
43+
```bash
44+
ruff check packages/*/src/ --select C901 --output-format json 2>/dev/null || true
45+
```
46+
47+
If ruff is not available or C901 produces no output, manually inspect the
48+
largest source files for functions with:
49+
- Deep nesting (3+ levels of if/for/try)
50+
- Many branches (>5 if/elif chains)
51+
- Long method bodies (>60 lines)
52+
53+
**Track trends**: compare against the previous run's baseline in runner
54+
memory. A function at complexity 12 that was 8 last week is more concerning
55+
than one that has been at 15 for months. Report the delta.
56+
57+
Focus on `packages/data-designer-engine/src/` (core execution) and
58+
`packages/data-designer/src/data_designer/interface/` (public API) where
59+
complexity tends to accumulate.
60+
61+
### 2. Exception hygiene
62+
63+
Check for patterns that violate the project's "errors normalize at
64+
boundaries" principle (AGENTS.md):
65+
66+
```bash
67+
# Bare except clauses (should use specific exception types)
68+
grep -rn "except:" packages/*/src/ --include='*.py' | grep -v "# noqa"
69+
70+
# Swallowed exceptions (except + pass/continue with no logging)
71+
grep -rn -A1 "except" packages/*/src/ --include='*.py' | grep -B1 "pass$\|continue$"
72+
```
73+
74+
The key principle: internal code should NOT leak raw third-party exceptions.
75+
Module boundary functions (public API, entry points) should wrap external
76+
exceptions in `data_designer` error types. Check:
77+
- Functions in `packages/data-designer/src/` that catch third-party exceptions
78+
(httpx, pydantic, etc.) - are they re-raised as `data_designer` errors?
79+
- Plugin loading code (`data_designer/plugins/`) - bare `except:` has been
80+
found here before
81+
82+
### 3. Type annotation coverage
83+
84+
The repo requires typed code (AGENTS.md: "all functions, methods, and class
85+
attributes require type annotations") but has no ANN ruff rules enforcing
86+
this. Check for gaps:
87+
88+
```bash
89+
# Public functions missing return type annotations
90+
grep -rn "def " packages/*/src/ --include='*.py' \
91+
| grep -v "-> " \
92+
| grep -v "def _" \
93+
| grep -v "__init__\|__repr__\|__str__\|__eq__\|__hash__" \
94+
| grep -v "test_"
95+
```
96+
97+
Also check for `Any` usage that could be more specific:
98+
```bash
99+
grep -rn ": Any\| -> Any" packages/*/src/ --include='*.py'
100+
```
101+
102+
**Track coverage percentage**: count public functions with full annotations
103+
vs total public functions. Compare against previous baseline.
104+
105+
Known gap: `packages/data-designer-config/src/data_designer/custom_column.py`
106+
and `packages/data-designer-config/src/data_designer/analysis/` have been
107+
flagged before.
108+
109+
### 4. Executable quality checks
110+
111+
Run a few checks that exercise real code paths to catch regressions that
112+
static analysis misses. The workflow puts `.venv/bin` on PATH via
113+
`make install-dev`, so `python` resolves to the project venv.
114+
115+
#### 4a. Error type hierarchy (fixed - run as written)
116+
117+
Verify that the project's error types are importable and properly
118+
structured. Silent breakage here means third-party exceptions leak to users:
119+
120+
```bash
121+
python -c "
122+
from data_designer.errors import DataDesignerError
123+
assert issubclass(DataDesignerError, Exception), 'DataDesignerError must be an Exception'
124+
print('OK: error hierarchy intact')
125+
" 2>&1 || echo "WARN: error hierarchy check failed"
126+
```
127+
128+
#### 4b. Input validation checks (creative - vary each run)
129+
130+
Verify the config builder rejects bad inputs rather than silently
131+
producing corrupt configs. **Design your own invalid inputs each run**
132+
to maximize coverage over time.
133+
134+
Examples of things to test (pick 2-3 per run, and invent new ones):
135+
- Invalid `column_type` string (should raise)
136+
- `column_type='sampler'` without `sampler_type` (should raise)
137+
- Empty builder `.build()` (should handle gracefully)
138+
- Duplicate column names (should raise or deduplicate clearly)
139+
- Invalid sampler params (e.g., `gaussian` with negative `std`, `category`
140+
with empty `values` list)
141+
- Column names with special characters or very long strings
142+
- Recently changed validators (check `git log --oneline -10 -- packages/*/src/data_designer/config/`)
143+
144+
**API reference:**
145+
146+
```python
147+
from data_designer.config.config_builder import DataDesignerConfigBuilder
148+
149+
# Test that invalid input is rejected (not silently accepted)
150+
try:
151+
DataDesignerConfigBuilder().add_column(
152+
name='x', column_type='nonexistent_type'
153+
).build()
154+
print('FAIL: invalid column type was silently accepted')
155+
except Exception as e:
156+
print(f'OK: invalid column type rejected ({type(e).__name__})')
157+
```
158+
159+
The pattern: try something that should fail, print FAIL if it succeeds
160+
silently, print OK if it raises. A FAIL means a validation regression
161+
that could lead to silent data corruption.
162+
163+
Report what you tested and why. Any FAIL is a critical finding.
164+
165+
### 5. TODO/FIXME/HACK aging
166+
167+
Inventory markers with their git blame age:
168+
169+
```bash
170+
grep -rn "TODO\|FIXME\|HACK" packages/*/src/ --include='*.py'
171+
```
172+
173+
For each marker, get the commit date:
174+
```bash
175+
# Example: get blame date for a specific line
176+
git blame -L 42,42 --date=short path/to/file.py
177+
```
178+
179+
**Only flag items older than 30 days.** Recent TODOs are part of normal
180+
development flow. For old items, include:
181+
- File and line number
182+
- The marker text
183+
- Age in days
184+
- The commit that introduced it (short SHA)
185+
186+
## Output format
187+
188+
Write the report to `/tmp/audit-{{suite}}.md`:
189+
190+
```markdown
191+
<!-- agentic-ci-daily-{{suite}} -->
192+
## Code Quality Audit - {{date}}
193+
194+
### Complexity hotspots
195+
196+
| File | Function | Complexity | Trend |
197+
|------|----------|-----------|-------|
198+
| ... | ... | C901: 18 | +3 since last run |
199+
200+
### Exception hygiene
201+
202+
| File | Line | Pattern | Recommendation |
203+
|------|------|---------|----------------|
204+
| plugins/plugin.py | 99 | bare except | Catch ImportError/ModuleNotFoundError |
205+
206+
### Type annotation coverage
207+
208+
| File | Function | Issue |
209+
|------|----------|-------|
210+
| custom_column.py | generate | Missing return type |
211+
212+
**Coverage:** ~X% of public functions fully annotated (previous: Y%)
213+
214+
### Executable quality checks
215+
216+
| Check | Type | Status | Detail |
217+
|-------|------|--------|--------|
218+
| Error hierarchy | fixed | OK/FAIL | DataDesignerError is properly structured |
219+
| (describe input tested) | creative | OK/FAIL | (what was tested and why) |
220+
| ... | creative | ... | ... |
221+
222+
### TODO/FIXME/HACK inventory
223+
224+
| File | Line | Marker | Age (days) | Commit |
225+
|------|------|--------|-----------|--------|
226+
| ... | ... | TODO: fix this | 45 | abc1234 |
227+
228+
**Aging items:** N markers older than 30 days (M new since last run)
229+
230+
### Summary
231+
232+
- N complexity hotspots (M trending up)
233+
- N exception hygiene issues (M new)
234+
- Type coverage: X% (delta: +/-N% from last run)
235+
- Executable checks: N/2 passed (any FAIL is critical)
236+
- N aging TODO/FIXME markers (M new)
237+
```
238+
239+
If no findings in any category, write `NO_FINDINGS` on the first line instead.
240+
241+
## Constraints
242+
243+
- Do not modify any files. This is a read-only audit.
244+
- Do not flag test files for type coverage or exception hygiene. Tests have
245+
different standards.
246+
- Do not duplicate ruff checks (W, F, I, ICN, PIE, TID, UP*). Those are
247+
already enforced in CI.
248+
- For complexity, focus on growth trends rather than absolute values.
249+
- For TODOs, only flag items older than 30 days.
250+
- For type annotations, focus on public API surface. Internal helpers with
251+
obvious types from context are lower priority.

0 commit comments

Comments
 (0)