Skip to content

fix(ci): improve Agentic CI daily audit reliability#632

Merged
johnnygreco merged 9 commits into
mainfrom
andreatgretel/fix/agentic-ci-docs-audit-turn-limit
May 26, 2026
Merged

fix(ci): improve Agentic CI daily audit reliability#632
johnnygreco merged 9 commits into
mainfrom
andreatgretel/fix/agentic-ci-docs-audit-turn-limit

Conversation

@andreatgretel
Copy link
Copy Markdown
Contributor

@andreatgretel andreatgretel commented May 11, 2026

Summary

Improves Agentic CI reliability in focused places:

  • Honors each daily recipe's declared max_turns.
  • Keeps docs and test-health audits bounded enough to produce a report before
    they run out of turns.
  • Keeps structure and code-quality on truthful 50-turn budgets based on recent
    successful run history.
  • Prevents CI recipes from spawning local subagents that may default to an
    inaccessible model.
  • Lets deterministic structure fixes batch same-category backlog entries while
    staying under the existing scope gates.
  • Raises custom API pre-flight timeouts from 10s to 30s where the agentic
    workflows still used the old shorter probe.

Changes

Changed

  • Updated .github/workflows/agentic-ci-daily.yml to read max_turns from
    recipe frontmatter instead of always passing 50 to Claude.
  • Hardened max_turns parsing so inline comments or quoted YAML values do not
    break claude --max-turns.
  • Raised custom API pre-flight curl --max-time from 10s to 30s in daily,
    repository triage, and PR review workflows.
  • Tightened .agents/recipes/docs-and-references/recipe.md so it writes a
    partial report early and samples bounded docs/source sets.
  • Added the same early partial report and turn-budget guard to
    .agents/recipes/test-health/recipe.md.
  • Set structure and code-quality recipe budgets to 50 after recent
    successful runs used 34 and 31 turns respectively.
  • Added a shared runner constraint in .agents/recipes/_runner.md to keep CI
    recipes in the main agent session instead of delegated/local agents.
  • Generalized .agents/recipes/_fix-policy.md and
    .agents/recipes/_phase-fix.md to allow suite-declared batchable mechanical
    fixes.
  • Opted structure / missing-future into batching in
    .agents/recipes/structure/recipe.md, capped batches at 3 files, and
    documented batch grouping by package test target.

Why

  • Prevents the workflow/recipe mismatch where recipe budgets were ignored by
    daily audit execution.
  • Reduces the chance that docs and test-health audits spend the full run
    exploring and leave no useful artifact.
  • Avoids accidentally breaking structure and code-quality by dropping their
    effective budget below recent successful runs.
  • Avoids local-agent failures where delegated tasks select a default model the
    CI key cannot access, then the parent agent keeps running until max turns.
  • Avoids one-file PRs for purely mechanical same-category structure fixes when
    the combined diff still satisfies the localized-fix bar.
  • Avoids transient agentic workflow failures when the inference endpoint
    responds slower than 10 seconds.

Recent failure scan

  • May 13 daily audit: custom endpoint probe exceeded the old 10s budget, then
    passed on retry. Covered by 30s daily pre-flight.
  • May 11 docs audit: delegated local agents failed auth against a default Haiku
    task model, and the main recipe later hit 50 turns with no report. Covered by
    no-subagent runner guidance plus docs turn-budget changes.
  • May 4 docs audit: hit error_max_turns after 50 turns with no report.
    Covered by docs turn-budget changes and recipe max_turns enforcement.
  • Apr 24 test-health audit: hit Reached max turns (30) with no report.
    Covered by test-health turn-budget changes.
  • Apr 21 repository triage: custom endpoint pre-flight failed with HTTP 400
    during the old model/config period. Current health-probe covers CLI/model
    compatibility; this PR also aligns triage timeout with the health probe.

Claude review

Claude review found no blocking issues. Follow-ups addressed in this PR:

  • robust max_turns parsing
  • explicit fix-phase 50-turn rationale
  • 50-turn budgets for structure and code-quality based on run history
  • explicit 3-file batch cap
  • documented batch grouping by package test target

Validation

  • make install-dev
  • .venv/bin/ruff check --fix .
  • .venv/bin/ruff format .
  • Parsed agentic workflow YAML files with PyYAML.
  • Verified parsed max_turns values for all recipes.
  • git diff --check
  • Commit hooks passed: trailing whitespace, EOF, YAML, large file, merge
    conflict, mixed line ending. Ruff hooks skipped on the latest commits because
    no Python files changed.

@andreatgretel andreatgretel changed the title fix(ci): limit docs audit turn usage fix(agentic-ci): improve daily audit reliability May 13, 2026
@andreatgretel andreatgretel changed the title fix(agentic-ci): improve daily audit reliability fix(ci): improve Agentic CI daily audit reliability May 13, 2026
@andreatgretel andreatgretel marked this pull request as ready for review May 14, 2026 18:30
@andreatgretel andreatgretel requested a review from a team as a code owner May 14, 2026 18:30
@github-actions
Copy link
Copy Markdown
Contributor

Code Review: PR #632 — fix(ci): improve Agentic CI daily audit reliability

Summary

This PR is a focused reliability fix for the Agentic CI daily audit workflow. It does five things:

  1. Honors per-recipe max_turns — the daily workflow now parses max_turns from each recipe's frontmatter instead of hard-coding 50.
  2. Sets recipe budgets to match historical realitystructure and code-quality move 30 → 50 (recent successful runs used 31–34 turns); docs-and-references and test-health add early-partial-report and turn-budget guards instead of bumping the cap.
  3. Forbids subagents in CI recipes — adds a _runner.md rule so delegated/Task/Explore agents (which may default to a model the CI key cannot reach) are not spawned.
  4. Allows category-batched mechanical fixes_fix-policy.md and _phase-fix.md now permit batching siblings of the same suite/category through one PR; structure/missing-future is opted in with a 3-file cap.
  5. Raises pre-flight curl --max-time 10s → 30s in three workflows.

Scope is correct: only .agents/recipes/*.md and .github/workflows/agentic-ci-*.yml. No code changes.

Findings

Code correctness

  • max_turns parser is robust enough. The awk + grep pipeline in .github/workflows/agentic-ci-daily.yml:190-194 correctly:
    • Stops at the closing --- (section == 2 { exit }), so a max_turns: 999 line in the recipe body is ignored.
    • Strips inline YAML comments and quotes via grep -oE '[0-9]+' | head -n1.
    • Falls back to 50 when the field is missing (MAX_TURNS=${MAX_TURNS:-50}).
      Edge case worth noting but acceptable: a recipe with max_turns: 0 would silently fall through to the :-50 default because ${VAR:-X} substitutes when the variable is empty or unset — 0 is non-empty, so this actually works correctly. ✅
  • Fix-phase still uses fixed 50 turns — only the audit phase reads recipe max_turns. The inline comment at agentic-ci-daily.yml:263-264 makes this explicit and the rationale (scope gates already bound fix work) is reasonable.
  • Batch policy is internally consistent. Each batched finding gets its own hidden <!-- agentic-ci finding=<id> suite=<suite> --> marker and its own attempted_fixes entry, so crash recovery via marker grep still reconstructs state correctly. The 3-file cap matches the localized-fix bar.

Project conventions

  • Follows the existing recipe/policy structure. New rules slot into the same headings (Fix policy contract, Standard fix procedure).
  • Honors _runner.md's "no workflow modifications" boundary at the agent level — this PR is from a human contributor, not the agent.
  • Conventional commit style preserved.

Risks / things to watch

  • Self-imposed turn budgets in docs/test-health are advisory. The "stop after 20 tool calls / 2 new findings" guidance in docs-and-references/recipe.md:46-50 and test-health/recipe.md:39-49 depends on the agent counting and obeying — there is no hard enforcement. If a future model ignores it, the only backstop is the max_turns cap. The early-partial-report write (step 2) is the real reliability win here; that one is concrete and will produce an artifact even on max-turns.
  • The "no subagents" rule is policy, not enforcement. Same caveat — it relies on the agent reading and obeying _runner.md. Acceptable for now, but if subagent-related failures recur, consider stripping the Task/Explore tools from the CI invocation's allowed-tools list.
  • Batching introduces a new failure mode. If one file in a 3-file missing-future batch breaks tests, the policy says abandon the whole batch — meaning the other (correct) two are blocked too. This is the right call for atomicity, but it does mean a single noisy file can starve siblings. The top-5 candidate fallback partially mitigates this.
  • curl --max-time 30 matches the model health probe — good consistency. No retry/backoff added, but a 3× margin is a reasonable single-shot improvement.

Test coverage

  • N/A — these are CI configuration / recipe text changes. The PR description lists manual validation (YAML parse, max_turns value verification across all recipes, ruff). That's appropriate for the change kind.

Performance / security

  • No performance impact on the product code path.
  • No new secrets handling. The curl probe still posts only the structural health-check payload; raising the timeout doesn't change exposure.

Suggestions (non-blocking)

  1. Consider hard-failing on missing max_turns rather than defaulting to 50. A typo like max_trns: 30 would silently use 50, masking the recipe's actual intent. Logging the parsed value (which the PR already does on agentic-ci-daily.yml:195) helps catch this in run logs, but MAX_TURNS=${MAX_TURNS:?recipe missing max_turns} would catch it earlier.
  2. Document the 3-file batch cap rationale in _fix-policy.md (it's currently in structure/recipe.md only). Other suites adding batchable categories will need the same number; pulling it into the shared policy avoids drift.
  3. Sample-size guidance in docs recipe is now inconsistent: section 2 says "at most 10 candidate links", section 3 says "at most 3 architecture files", section 4 says "3-5 high-value pages". With the global "stop after 20 tool calls" cap, these can collide. Worth one editorial pass to ensure the per-section numbers sum to something achievable inside the global cap.
  4. The _phase-fix.md change says "you may add sibling entries from the existing fix_backlog after re-verifying each one" — re-verification is also covered in _fix-policy.md step 4.2. The duplication is fine for clarity but worth noting if the policies ever drift.

Verdict

Approve. Targeted reliability fixes with a clear paper trail (the "Recent failure scan" section maps each change to a specific historical failure). The risks are well-contained: scope is limited to CI configuration, no product code is touched, and the failure modes added by batching are explicitly bounded (3-file cap, atomic abandon). The advisory-only nature of the in-recipe turn budgets is the weakest part — but pairing them with the early-partial-report pattern means even a non-compliant agent run produces a usable artifact, which is the actual reliability goal.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 14, 2026

Greptile Summary

This PR improves reliability of the Agentic CI daily audit workflows by fixing a recipe/workflow max_turns mismatch, hardening turn budgets for docs and test-health suites, and aligning curl pre-flight timeouts across all three workflow files.

  • The daily workflow now parses each recipe's max_turns frontmatter field (with a robust awk + grep pipeline) and passes it to claude --max-turns, instead of hardcoding 50 for every suite; the fix phase intentionally keeps its own hardcoded 50-turn budget.
  • docs-and-references and test-health recipes gain explicit early-report and sampling stop-conditions (20 tool calls or 2 new findings) to ensure a usable partial artifact is always produced; structure and code-quality have their frontmatter budgets raised to 50 to reflect recent successful run history.
  • _runner.md gains a "No subagents" constraint to prevent CI failures when the default delegated model is inaccessible; _fix-policy.md and structure/recipe.md extend the fix procedure to support batching up to 3 same-category missing-future findings per PR, with one marker and one attempted_fixes entry per finding.

Confidence Score: 5/5

All changes are CI workflow and agent recipe configuration updates with no production code impact; the max_turns parsing is robust and defaults safely to 50.

The max_turns awk pipeline correctly handles inline comments, quoted values, and missing keys with a safe default. Batch fix logic in _fix-policy.md is internally consistent with crash recovery, attempted_fixes recording, and the structure recipe. The fix phase intentionally retains its own hardcoded 50-turn budget and is documented. No logic errors or correctness issues found across any of the 10 changed files.

No files require special attention.

Important Files Changed

Filename Overview
.github/workflows/agentic-ci-daily.yml Adds per-recipe max_turns parsing via awk + grep pipeline with inline-comment and quoted-value safety; fix phase correctly retains hardcoded 50; curl pre-flight raised to 30s.
.github/workflows/agentic-ci-issue-triage.yml Single-line change: curl pre-flight timeout raised from 10s to 30s, consistent with daily and PR-review workflows.
.github/workflows/agentic-ci-pr-review.yml Single-line change: curl pre-flight timeout raised from 10s to 30s, aligned with other workflow files.
.agents/recipes/_fix-policy.md Adds batching support: step 4.1 collects siblings, step 4.2 re-verifies and removes stale primary/siblings, crash recovery now parses multiple markers; all internal references are consistent.
.agents/recipes/_phase-fix.md Step-number reference to _fix-policy.md removed (making it resilient to future renumbering); batch PR recording guidance added, consistent with _fix-policy.md.
.agents/recipes/_runner.md Added 'No subagents' rule to prevent CI failures from delegated agent model-access errors.
.agents/recipes/structure/recipe.md max_turns raised to 50; missing-future category opted into batching with 3-file cap, same-test-target grouping, and one marker+entry per file — consistent with _fix-policy.md.
.agents/recipes/code-quality/recipe.md max_turns raised from 30 to 50 based on recent successful run history (31 turns used).
.agents/recipes/docs-and-references/recipe.md New turn-budget section added: writes partial report immediately, stops after 20 tool calls or 2 new findings per section, ensuring a usable artifact even if interrupted.
.agents/recipes/test-health/recipe.md Same turn-budget section added as docs-and-references: early partial report, bounded sampling, and explicit stop conditions.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Daily workflow triggered] --> B[Select recipe suite]
    B --> C[Parse max_turns from recipe frontmatter\nawk + grep pipeline]
    C --> D{max_turns found?}
    D -- No --> E[Default to 50]
    D -- Yes --> F[Use recipe value]
    E --> G[Run audit phase\nclaude --max-turns MAX_TURNS]
    F --> G
    G --> H{Audit success?}
    H -- No --> Z[End]
    H -- Yes --> I[Check fix_backlog size]
    I --> J{Backlog > 0\nand suite eligible?}
    J -- No --> Z
    J -- Yes --> K[Snapshot attempted_fixes]
    K --> L[Run fix phase\nclaude --max-turns 50]
    L --> M[Select primary candidate]
    M --> N{Category batchable?}
    N -- Yes --> O[Collect siblings with same\ntest_target, batch <= 3]
    N -- No --> P[Single finding]
    O --> Q[Re-verify all findings still apply]
    P --> Q
    Q --> R{All valid?}
    R -- Primary stale --> S[Remove primary from fix_backlog\nnext candidate]
    R -- Sibling stale --> T[Remove sibling from fix_backlog\ncontinue with smaller batch]
    T --> R
    S --> M
    R -- All valid --> U[Apply fix / batch]
    U --> V[Run package tests]
    V --> W[Push branch, open PR\none hidden marker per finding]
    W --> X[Record one attempted_fixes entry\nper fixed finding]
    X --> Y[Validate fix scope gate]
    Y --> Z[End]
Loading

Reviews (5): Last reviewed commit: "fix(ci): harden agentic max turns parsin..." | Re-trigger Greptile

Comment thread .agents/recipes/_fix-policy.md
@johnnygreco
Copy link
Copy Markdown
Contributor

Thanks for putting this together, @andreatgretel!

Summary

This tightens Agentic CI budgets and run instructions, makes daily audit --max-turns reflect recipe frontmatter, increases API pre-flight timeouts, and documents bounded batching for mechanical structure fixes. The implementation matches the PR's stated reliability goals; I only found a couple of small robustness/docs nits.

Findings

Suggestions — Take it or leave it

.github/workflows/agentic-ci-daily.yml:190 — Make the fallback survive an unparsable value

  • What: MAX_TURNS=${MAX_TURNS:-50} is intended to default to 50, but in the GitHub Actions bash shell the assignment can exit non-zero if the extraction pipeline finds no digits, because grep returns 1 and the script has pipefail enabled.
  • Why: Current recipes all parse correctly, so this is not blocking. A future recipe missing max_turns or using a malformed value would fail the audit job instead of safely falling back to 50.
  • Suggestion: Make the extraction pipeline non-fatal before applying the fallback:
MAX_TURNS=$(awk -F': *' '
  /^---$/ { section++; next }
  section == 1 && $1 == "max_turns" { print $2; exit }
  section == 2 { exit }
' "${RECIPE_DIR}/recipe.md" | grep -oE '[0-9]+' | head -n1 || true)
MAX_TURNS=${MAX_TURNS:-50}

.agents/recipes/_phase-fix.md:12 — Re-verification cross-reference now points at batching

  • What: _phase-fix.md still says re-verification is _fix-policy.md step 4.1, but this PR moved re-verification to step 4.2 by adding sibling collection as the new step 4.1.
  • Why: The surrounding prose is clear, but these files are prompt material. Stale exact step references can make an agent attach the MUST to the wrong substep when following the policy mechanically.
  • Suggestion: Change the reference to step 4.2, or avoid the number entirely with "the per-candidate re-verification substep" so future renumbering does not stale it again.

What Looks Good

  • The daily workflow now uses the recipe-declared audit budget while leaving fix-phase turns at 50 with an explicit rationale.
  • The docs and test-health recipes now create a partial report early, which directly addresses the "ran out of turns with no artifact" failure mode.
  • The batch-fix policy is careful about using only existing backlog entries, capping the batch at the localized 3-file limit, and writing one hidden marker per fixed finding for crash recovery.

Verdict

Ship it (with nits) — No blocking issues. The suggestions above are small hardening/clarity improvements.


This review was generated by an AI assistant.

@johnnygreco
Copy link
Copy Markdown
Contributor

Implemented the two review suggestions and pushed them in 6d3cc855:\n\n- Hardened daily audit max_turns parsing so malformed or missing values fall back to 50 under pipefail.\n- Removed the stale step-number reference in _phase-fix.md so re-verification points at the policy section rather than an outdated substep.\n\nValidated with git diff --check, PyYAML workflow parsing, recipe max-turn parsing for all recipes, and a synthetic missing-max_turns fallback case.

@johnnygreco johnnygreco merged commit e4f2409 into main May 26, 2026
50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants