fix(ci): improve Agentic CI daily audit reliability by andreatgretel · Pull Request #632 · NVIDIA-NeMo/DataDesigner

andreatgretel · 2026-05-11T16:05:47Z

Summary

Improves Agentic CI reliability in focused places:

Honors each daily recipe's declared max_turns.
Keeps docs and test-health audits bounded enough to produce a report before
they run out of turns.
Keeps structure and code-quality on truthful 50-turn budgets based on recent
successful run history.
Prevents CI recipes from spawning local subagents that may default to an
inaccessible model.
Lets deterministic structure fixes batch same-category backlog entries while
staying under the existing scope gates.
Raises custom API pre-flight timeouts from 10s to 30s where the agentic
workflows still used the old shorter probe.

Changes

Changed

Updated .github/workflows/agentic-ci-daily.yml to read max_turns from
recipe frontmatter instead of always passing 50 to Claude.
Hardened max_turns parsing so inline comments or quoted YAML values do not
break claude --max-turns.
Raised custom API pre-flight curl --max-time from 10s to 30s in daily,
repository triage, and PR review workflows.
Tightened .agents/recipes/docs-and-references/recipe.md so it writes a
partial report early and samples bounded docs/source sets.
Added the same early partial report and turn-budget guard to
.agents/recipes/test-health/recipe.md.
Set structure and code-quality recipe budgets to 50 after recent
successful runs used 34 and 31 turns respectively.
Added a shared runner constraint in .agents/recipes/_runner.md to keep CI
recipes in the main agent session instead of delegated/local agents.
Generalized .agents/recipes/_fix-policy.md and
.agents/recipes/_phase-fix.md to allow suite-declared batchable mechanical
fixes.
Opted structure / missing-future into batching in
.agents/recipes/structure/recipe.md, capped batches at 3 files, and
documented batch grouping by package test target.

Why

Prevents the workflow/recipe mismatch where recipe budgets were ignored by
daily audit execution.
Reduces the chance that docs and test-health audits spend the full run
exploring and leave no useful artifact.
Avoids accidentally breaking structure and code-quality by dropping their
effective budget below recent successful runs.
Avoids local-agent failures where delegated tasks select a default model the
CI key cannot access, then the parent agent keeps running until max turns.
Avoids one-file PRs for purely mechanical same-category structure fixes when
the combined diff still satisfies the localized-fix bar.
Avoids transient agentic workflow failures when the inference endpoint
responds slower than 10 seconds.

Recent failure scan

May 13 daily audit: custom endpoint probe exceeded the old 10s budget, then
passed on retry. Covered by 30s daily pre-flight.
May 11 docs audit: delegated local agents failed auth against a default Haiku
task model, and the main recipe later hit 50 turns with no report. Covered by
no-subagent runner guidance plus docs turn-budget changes.
May 4 docs audit: hit error_max_turns after 50 turns with no report.
Covered by docs turn-budget changes and recipe max_turns enforcement.
Apr 24 test-health audit: hit Reached max turns (30) with no report.
Covered by test-health turn-budget changes.
Apr 21 repository triage: custom endpoint pre-flight failed with HTTP 400
during the old model/config period. Current health-probe covers CLI/model
compatibility; this PR also aligns triage timeout with the health probe.

Claude review

Claude review found no blocking issues. Follow-ups addressed in this PR:

robust max_turns parsing
explicit fix-phase 50-turn rationale
50-turn budgets for structure and code-quality based on run history
explicit 3-file batch cap
documented batch grouping by package test target

Validation

make install-dev
.venv/bin/ruff check --fix .
.venv/bin/ruff format .
Parsed agentic workflow YAML files with PyYAML.
Verified parsed max_turns values for all recipes.
git diff --check
Commit hooks passed: trailing whitespace, EOF, YAML, large file, merge
conflict, mixed line ending. Ruff hooks skipped on the latest commits because
no Python files changed.

…ntic-ci-docs-audit-turn-limit

github-actions · 2026-05-14T18:34:45Z

Code Review: PR #632 — fix(ci): improve Agentic CI daily audit reliability

Summary

This PR is a focused reliability fix for the Agentic CI daily audit workflow. It does five things:

Honors per-recipe max_turns — the daily workflow now parses max_turns from each recipe's frontmatter instead of hard-coding 50.
Sets recipe budgets to match historical reality — structure and code-quality move 30 → 50 (recent successful runs used 31–34 turns); docs-and-references and test-health add early-partial-report and turn-budget guards instead of bumping the cap.
Forbids subagents in CI recipes — adds a _runner.md rule so delegated/Task/Explore agents (which may default to a model the CI key cannot reach) are not spawned.
Allows category-batched mechanical fixes — _fix-policy.md and _phase-fix.md now permit batching siblings of the same suite/category through one PR; structure/missing-future is opted in with a 3-file cap.
Raises pre-flight curl --max-time 10s → 30s in three workflows.

Scope is correct: only .agents/recipes/*.md and .github/workflows/agentic-ci-*.yml. No code changes.

Findings

Code correctness

max_turns parser is robust enough. The awk + grep pipeline in .github/workflows/agentic-ci-daily.yml:190-194 correctly:
- Stops at the closing --- (section == 2 { exit }), so a max_turns: 999 line in the recipe body is ignored.
- Strips inline YAML comments and quotes via grep -oE '[0-9]+' | head -n1.
- Falls back to 50 when the field is missing (MAX_TURNS=${MAX_TURNS:-50}).
  Edge case worth noting but acceptable: a recipe with max_turns: 0 would silently fall through to the :-50 default because ${VAR:-X} substitutes when the variable is empty or unset — 0 is non-empty, so this actually works correctly. ✅
Fix-phase still uses fixed 50 turns — only the audit phase reads recipe max_turns. The inline comment at agentic-ci-daily.yml:263-264 makes this explicit and the rationale (scope gates already bound fix work) is reasonable.
Batch policy is internally consistent. Each batched finding gets its own hidden  marker and its own attempted_fixes entry, so crash recovery via marker grep still reconstructs state correctly. The 3-file cap matches the localized-fix bar.

Project conventions

Follows the existing recipe/policy structure. New rules slot into the same headings (Fix policy contract, Standard fix procedure).
Honors _runner.md's "no workflow modifications" boundary at the agent level — this PR is from a human contributor, not the agent.
Conventional commit style preserved.

Risks / things to watch

Self-imposed turn budgets in docs/test-health are advisory. The "stop after 20 tool calls / 2 new findings" guidance in docs-and-references/recipe.md:46-50 and test-health/recipe.md:39-49 depends on the agent counting and obeying — there is no hard enforcement. If a future model ignores it, the only backstop is the max_turns cap. The early-partial-report write (step 2) is the real reliability win here; that one is concrete and will produce an artifact even on max-turns.
The "no subagents" rule is policy, not enforcement. Same caveat — it relies on the agent reading and obeying _runner.md. Acceptable for now, but if subagent-related failures recur, consider stripping the Task/Explore tools from the CI invocation's allowed-tools list.
Batching introduces a new failure mode. If one file in a 3-file missing-future batch breaks tests, the policy says abandon the whole batch — meaning the other (correct) two are blocked too. This is the right call for atomicity, but it does mean a single noisy file can starve siblings. The top-5 candidate fallback partially mitigates this.
curl --max-time 30 matches the model health probe — good consistency. No retry/backoff added, but a 3× margin is a reasonable single-shot improvement.

Test coverage

N/A — these are CI configuration / recipe text changes. The PR description lists manual validation (YAML parse, max_turns value verification across all recipes, ruff). That's appropriate for the change kind.

Performance / security

No performance impact on the product code path.
No new secrets handling. The curl probe still posts only the structural health-check payload; raising the timeout doesn't change exposure.

Suggestions (non-blocking)

Consider hard-failing on missing max_turns rather than defaulting to 50. A typo like max_trns: 30 would silently use 50, masking the recipe's actual intent. Logging the parsed value (which the PR already does on agentic-ci-daily.yml:195) helps catch this in run logs, but MAX_TURNS=${MAX_TURNS:?recipe missing max_turns} would catch it earlier.
Document the 3-file batch cap rationale in _fix-policy.md (it's currently in structure/recipe.md only). Other suites adding batchable categories will need the same number; pulling it into the shared policy avoids drift.
Sample-size guidance in docs recipe is now inconsistent: section 2 says "at most 10 candidate links", section 3 says "at most 3 architecture files", section 4 says "3-5 high-value pages". With the global "stop after 20 tool calls" cap, these can collide. Worth one editorial pass to ensure the per-section numbers sum to something achievable inside the global cap.
The _phase-fix.md change says "you may add sibling entries from the existing fix_backlog after re-verifying each one" — re-verification is also covered in _fix-policy.md step 4.2. The duplication is fine for clarity but worth noting if the policies ever drift.

Verdict

Approve. Targeted reliability fixes with a clear paper trail (the "Recent failure scan" section maps each change to a specific historical failure). The risks are well-contained: scope is limited to CI configuration, no product code is touched, and the failure modes added by batching are explicitly bounded (3-file cap, atomic abandon). The advisory-only nature of the in-recipe turn budgets is the weakest part — but pairing them with the early-partial-report pattern means even a non-compliant agent run produces a usable artifact, which is the actual reliability goal.

greptile-apps · 2026-05-14T18:36:34Z

Greptile Summary

This PR improves reliability of the Agentic CI daily audit workflows by fixing a recipe/workflow max_turns mismatch, hardening turn budgets for docs and test-health suites, and aligning curl pre-flight timeouts across all three workflow files.

The daily workflow now parses each recipe's max_turns frontmatter field (with a robust awk + grep pipeline) and passes it to claude --max-turns, instead of hardcoding 50 for every suite; the fix phase intentionally keeps its own hardcoded 50-turn budget.
docs-and-references and test-health recipes gain explicit early-report and sampling stop-conditions (20 tool calls or 2 new findings) to ensure a usable partial artifact is always produced; structure and code-quality have their frontmatter budgets raised to 50 to reflect recent successful run history.
_runner.md gains a "No subagents" constraint to prevent CI failures when the default delegated model is inaccessible; _fix-policy.md and structure/recipe.md extend the fix procedure to support batching up to 3 same-category missing-future findings per PR, with one marker and one attempted_fixes entry per finding.

Confidence Score: 5/5

All changes are CI workflow and agent recipe configuration updates with no production code impact; the max_turns parsing is robust and defaults safely to 50.

The max_turns awk pipeline correctly handles inline comments, quoted values, and missing keys with a safe default. Batch fix logic in _fix-policy.md is internally consistent with crash recovery, attempted_fixes recording, and the structure recipe. The fix phase intentionally retains its own hardcoded 50-turn budget and is documented. No logic errors or correctness issues found across any of the 10 changed files.

No files require special attention.

Important Files Changed

Filename	Overview
.github/workflows/agentic-ci-daily.yml	Adds per-recipe max_turns parsing via awk + grep pipeline with inline-comment and quoted-value safety; fix phase correctly retains hardcoded 50; curl pre-flight raised to 30s.
.github/workflows/agentic-ci-issue-triage.yml	Single-line change: curl pre-flight timeout raised from 10s to 30s, consistent with daily and PR-review workflows.
.github/workflows/agentic-ci-pr-review.yml	Single-line change: curl pre-flight timeout raised from 10s to 30s, aligned with other workflow files.
.agents/recipes/_fix-policy.md	Adds batching support: step 4.1 collects siblings, step 4.2 re-verifies and removes stale primary/siblings, crash recovery now parses multiple markers; all internal references are consistent.
.agents/recipes/_phase-fix.md	Step-number reference to _fix-policy.md removed (making it resilient to future renumbering); batch PR recording guidance added, consistent with _fix-policy.md.
.agents/recipes/_runner.md	Added 'No subagents' rule to prevent CI failures from delegated agent model-access errors.
.agents/recipes/structure/recipe.md	max_turns raised to 50; missing-future category opted into batching with 3-file cap, same-test-target grouping, and one marker+entry per file — consistent with _fix-policy.md.
.agents/recipes/code-quality/recipe.md	max_turns raised from 30 to 50 based on recent successful run history (31 turns used).
.agents/recipes/docs-and-references/recipe.md	New turn-budget section added: writes partial report immediately, stops after 20 tool calls or 2 new findings per section, ensuring a usable artifact even if interrupted.
.agents/recipes/test-health/recipe.md	Same turn-budget section added as docs-and-references: early partial report, bounded sampling, and explicit stop conditions.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Daily workflow triggered] --> B[Select recipe suite]
    B --> C[Parse max_turns from recipe frontmatter\nawk + grep pipeline]
    C --> D{max_turns found?}
    D -- No --> E[Default to 50]
    D -- Yes --> F[Use recipe value]
    E --> G[Run audit phase\nclaude --max-turns MAX_TURNS]
    F --> G
    G --> H{Audit success?}
    H -- No --> Z[End]
    H -- Yes --> I[Check fix_backlog size]
    I --> J{Backlog > 0\nand suite eligible?}
    J -- No --> Z
    J -- Yes --> K[Snapshot attempted_fixes]
    K --> L[Run fix phase\nclaude --max-turns 50]
    L --> M[Select primary candidate]
    M --> N{Category batchable?}
    N -- Yes --> O[Collect siblings with same\ntest_target, batch <= 3]
    N -- No --> P[Single finding]
    O --> Q[Re-verify all findings still apply]
    P --> Q
    Q --> R{All valid?}
    R -- Primary stale --> S[Remove primary from fix_backlog\nnext candidate]
    R -- Sibling stale --> T[Remove sibling from fix_backlog\ncontinue with smaller batch]
    T --> R
    S --> M
    R -- All valid --> U[Apply fix / batch]
    U --> V[Run package tests]
    V --> W[Push branch, open PR\none hidden marker per finding]
    W --> X[Record one attempted_fixes entry\nper fixed finding]
    X --> Y[Validate fix scope gate]
    Y --> Z[End]

_{Reviews (5): Last reviewed commit: "fix(ci): harden agentic max turns parsin..." | Re-trigger Greptile}

…-limit

johnnygreco · 2026-05-26T15:30:42Z

Thanks for putting this together, @andreatgretel!

Summary

This tightens Agentic CI budgets and run instructions, makes daily audit --max-turns reflect recipe frontmatter, increases API pre-flight timeouts, and documents bounded batching for mechanical structure fixes. The implementation matches the PR's stated reliability goals; I only found a couple of small robustness/docs nits.

Findings

Suggestions — Take it or leave it

.github/workflows/agentic-ci-daily.yml:190 — Make the fallback survive an unparsable value

What: MAX_TURNS=${MAX_TURNS:-50} is intended to default to 50, but in the GitHub Actions bash shell the assignment can exit non-zero if the extraction pipeline finds no digits, because grep returns 1 and the script has pipefail enabled.
Why: Current recipes all parse correctly, so this is not blocking. A future recipe missing max_turns or using a malformed value would fail the audit job instead of safely falling back to 50.
Suggestion: Make the extraction pipeline non-fatal before applying the fallback:

MAX_TURNS=$(awk -F': *' '
  /^---$/ { section++; next }
  section == 1 && $1 == "max_turns" { print $2; exit }
  section == 2 { exit }
' "${RECIPE_DIR}/recipe.md" | grep -oE '[0-9]+' | head -n1 || true)
MAX_TURNS=${MAX_TURNS:-50}

.agents/recipes/_phase-fix.md:12 — Re-verification cross-reference now points at batching

What: _phase-fix.md still says re-verification is _fix-policy.md step 4.1, but this PR moved re-verification to step 4.2 by adding sibling collection as the new step 4.1.
Why: The surrounding prose is clear, but these files are prompt material. Stale exact step references can make an agent attach the MUST to the wrong substep when following the policy mechanically.
Suggestion: Change the reference to step 4.2, or avoid the number entirely with "the per-candidate re-verification substep" so future renumbering does not stale it again.

What Looks Good

The daily workflow now uses the recipe-declared audit budget while leaving fix-phase turns at 50 with an explicit rationale.
The docs and test-health recipes now create a partial report early, which directly addresses the "ran out of turns with no artifact" failure mode.
The batch-fix policy is careful about using only existing backlog entries, capping the batch at the localized 3-file limit, and writing one hidden marker per fixed finding for crash recovery.

Verdict

Ship it (with nits) — No blocking issues. The suggestions above are small hardening/clarity improvements.

This review was generated by an AI assistant.

johnnygreco · 2026-05-26T15:55:15Z

Implemented the two review suggestions and pushed them in 6d3cc855:\n\n- Hardened daily audit max_turns parsing so malformed or missing values fall back to 50 under pipefail.\n- Removed the stale step-number reference in _phase-fix.md so re-verification points at the policy section rather than an outdated substep.\n\nValidated with git diff --check, PyYAML workflow parsing, recipe max-turn parsing for all recipes, and a synthetic missing-max_turns fallback case.

fix(ci): limit docs audit turn usage

e6373e8

github-actions Bot mentioned this pull request May 12, 2026

Agentic CI: Issue & PR Triage Tracker #562

Open

andreatgretel added 2 commits May 13, 2026 14:57

Merge remote-tracking branch 'origin/main' into andreatgretel/fix/age…

2b2a8a5

…ntic-ci-docs-audit-turn-limit

fix(agentic-ci): batch mechanical structure fixes

527a33b

andreatgretel changed the title ~~fix(ci): limit docs audit turn usage~~ fix(agentic-ci): improve daily audit reliability May 13, 2026

fix(agentic-ci): harden CI recipes and preflights

4acac75

andreatgretel changed the title ~~fix(agentic-ci): improve daily audit reliability~~ fix(ci): improve Agentic CI daily audit reliability May 13, 2026

fix(agentic-ci): align recipe turn budgets

f650384

andreatgretel marked this pull request as ready for review May 14, 2026 18:30

andreatgretel requested a review from a team as a code owner May 14, 2026 18:30

andreatgretel temporarily deployed to agentic-ci May 14, 2026 18:30 — with GitHub Actions Inactive

greptile-apps Bot reviewed May 14, 2026

View reviewed changes

Comment thread .agents/recipes/_fix-policy.md

andreatgretel and others added 3 commits May 14, 2026 23:44

fix(agentic-ci): prune stale primary backlog entries

4274a35

Merge branch 'main' into andreatgretel/fix/agentic-ci-docs-audit-turn…

5bf8455

…-limit

Merge branch 'main' into andreatgretel/fix/agentic-ci-docs-audit-turn…

bdcc2b1

…-limit

fix(ci): harden agentic max turns parsing

6d3cc85

johnnygreco approved these changes May 26, 2026

View reviewed changes

johnnygreco merged commit e4f2409 into main May 26, 2026
50 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ci): improve Agentic CI daily audit reliability#632

fix(ci): improve Agentic CI daily audit reliability#632
johnnygreco merged 9 commits into
mainfrom
andreatgretel/fix/agentic-ci-docs-audit-turn-limit

andreatgretel commented May 11, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

greptile-apps Bot commented May 14, 2026 •

edited

Loading

Confidence Score: 5/5

Flowchart

Uh oh!

Uh oh!

johnnygreco commented May 26, 2026

Uh oh!

johnnygreco commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andreatgretel commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Changed

Why

Recent failure scan

Claude review

Validation

Uh oh!

github-actions Bot commented May 14, 2026

Code Review: PR #632 — fix(ci): improve Agentic CI daily audit reliability

Summary

Findings

Code correctness

Project conventions

Risks / things to watch

Test coverage

Performance / security

Suggestions (non-blocking)

Verdict

Uh oh!

greptile-apps Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

johnnygreco commented May 26, 2026

Summary

Findings

Suggestions — Take it or leave it

What Looks Good

Verdict

Uh oh!

johnnygreco commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andreatgretel commented May 11, 2026 •

edited

Loading

greptile-apps Bot commented May 14, 2026 •

edited

Loading