[github] auto-mode review-remediation: surface the failing-test output to the fixing agent + revisit the 2-cy#4130
Conversation
|
👀 Quinn is reviewing — verdict (PASS / WARN / FAIL) + findings to follow. |
📝 WalkthroughWalkthroughThis PR replaces auto-mode's transient CI failure message queue with structured evidence extraction from GitHub checks and logs. It surfaces concrete test assertions, expected/received values, and file locations to the fixing agent. It also makes the remediation budget dynamic by detecting measurable progress (fewer failures per cycle) and extending rather than escalating when progress is observed. ChangesStructured CI Failure Evidence and Progress-Aware Remediation
Sequence Diagram(s)sequenceDiagram
participant ReviewProcessor
participant Collector as collectCIFailureEvidence
participant GitHub as GitHub API
participant Parser as parseTestFailureDetails
participant Enforcer as checkProgressAndExtend
participant Feature as feature._remediationHistory
ReviewProcessor->>Collector: collect evidence for failed checks
Collector->>GitHub: list failed check runs & fetch annotations
GitHub-->>Collector: check metadata + annotations/logs
Collector->>Parser: extract from annotation or log
Parser-->>Collector: test file, assertion, values, location
Collector-->>ReviewProcessor: CIFailureEvidence[]
ReviewProcessor->>Feature: persist remediation history
ReviewProcessor->>Enforcer: check progress vs prior cycle
Enforcer->>Feature: read prior failure count
Feature-->>Enforcer: previous ciFailureCount
Enforcer-->>ReviewProcessor: {extended: boolean, reason: string}
alt Progress detected (fewer failures)
ReviewProcessor->>ReviewProcessor: continue EXECUTE with evidence
else Budget exhausted, no progress
ReviewProcessor->>ReviewProcessor: escalate to operator
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
QA Audit — PR #4130 | [github] auto-mode review-remediation: surface the failing-test output to the fixing agent + revisit the 2-cy
VERDICT: WARN
CI Status
- harness-eval: queued
- checks: queued
- build: queued
- test: queued
- Close external code contributions: skipped
⏳ CI non-terminal — 4 checks still running. This is a non-blocking comment; formal verdict will land on next pass once CI is terminal.
Diff Review
- Removes
pendingCIInjectionsmap +sendCIFailureToAgent/consumePendingCIInjectionfromauto-mode-service.ts— old in-process CI injection mechanism is fully replaced - Adds
ci-failure-evidence-collector.ts(391 lines) — fetches structured GitHub check-run annotations viagh api, falls back to 500-line log excerpt extraction with failure pattern matching
Observations
- MEDIUM/Gap: The diff does not show how
CIFailureEvidenceCollectoris wired intolead-engineer-review-merge-processors.ts. The PR description statesReviewProcessorwill use it, but the integration call-site is absent from the diff. Without that wiring, the new collector is dead code. Verify the processor invokes the collector in a follow-up or confirm it lands in the same PR. - LOW:
execAsyncwithghCLI usesprojectPathas cwd — acceptable since projectPath is internal to the protoMaker session context, but worth a security eye on path isolation if untrusted repos can be loaded as features. - LOW:
MAX_LOG_LINES = 500cap is reasonable; no risk of context blowout. - INFO: Graceful degradation (warn + empty array on failure) is correct for a background collector.
— Quinn, QA Engineer
|
Submitted COMMENT review on #4130. |
There was a problem hiding this comment.
QA Audit — PR #4130 | [github] auto-mode review-remediation: surface the failing-test output to the fixing agent + revisit the 2-cy
VERDICT: WARN
CI Status
- ci-complete: failure
- harness-eval: success
- checks: failure
- build: failure
- test: failure
- Close external code contributions: skipped
Diff Review
This PR replaces a simple in-memory CI injection pipeline (pendingCIInjections Map in auto-mode-service.ts) with a structured ci-failure-evidence-collector.ts that fetches failure annotations from GitHub check runs and parses structured evidence from raw logs. The removals in auto-mode-service.ts are clean (no other callers of sendCIFailureToAgent visible in diff). The new collector is well-scoped: all execAsync calls are constrained to projectPath with timeouts (10s/15s), errors are handled gracefully with logger.warn, and log excerpts are hard-capped at 500 lines.
Observations
- GAP:
build: failuremay stem from unverified type imports —CIFailureEvidencefrom@protolabsai/typesandStateContextfrom./lead-engineer-types.jsare not visible in the diff and cannot be confirmed as existing in this tree. If these types are missing or mismatched,tscwould fail at build. Author should confirm or add the types. - GAP: The diff is truncated at 200 lines (897 total); the remaining ~700 lines of
ci-failure-evidence-collector.tswere not reviewed. Structural review (clawpatch) also skipped — checkout cache could not resolvemainorHEAD~1as git refs in this environment. - CONTEXT:
checksandtestfailures are expected — they are the CI failures this feature is designed to surface to the fixing agent (the PR's stated purpose). They do not indicate a bug introduced by this PR. - POSITIVE: No unresolved CodeRabbit threads. No security-sensitive paths touched. Old injection code removed cleanly with all call sites removed in the same diff.
— Quinn, QA Engineer
|
Submitted |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
apps/server/src/services/lead-engineer-review-merge-processors.ts (2)
423-425:⚠️ Potential issue | 🟠 Major | ⚡ Quick winClear
ctx.ciFailureEvidencewhen CI evidence is unavailable or not relevant.
ctx.ciFailureEvidenceis only overwritten on successful collection. If CI evidence was set in a prior cycle, it can leak into later non-CI remediation (or CI cycles with collection failure), andExecuteProcessorwill prioritize stale evidence in recovery context.Suggested fix
if (reviewState === 'changes_requested') { + ctx.ciFailureEvidence = undefined; ... } if (reviewState === 'ci_failed') { + ctx.ciFailureEvidence = undefined; // reset before recollecting ... if (evidences.length > 0) { ctx.ciFailureEvidence = evidences; ... } else if (ciFailureNames.length > 0) { + ctx.ciFailureEvidence = undefined; ctx.reviewFeedback = `CI checks failed: ${ciFailureNames.join(', ')}`; } }Also applies to: 468-486
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/server/src/services/lead-engineer-review-merge-processors.ts` around lines 423 - 425, The code currently only sets ctx.ciFailureEvidence when CI evidence is successfully collected, allowing stale evidence to persist across cycles; update the branches in lead-engineer-review-merge-processors.ts where feedbackParts is handled and in the other CI-related block (the region around where ctx.reviewFeedback is set and the block covering lines 468-486) to explicitly clear ctx.ciFailureEvidence (set it to undefined/null) whenever CI evidence is unavailable, irrelevant, or collection fails so ExecuteProcessor will not see stale evidence; ensure every path that represents "no CI evidence" (including collection failure and non-CI remediation paths) assigns ctx.ciFailureEvidence = undefined and keep these changes adjacent to the existing ctx.reviewFeedback/feedbackParts logic and the CI collection failure handling.
488-523:⚠️ Potential issue | 🟠 Major | ⚡ Quick winProgress extension is computed against the wrong snapshot and inconsistent counters.
The code appends the current
_remediationHistoryentry before callingcheckProgressAndExtend(), then compares withevidences.lengthwhile persisted history usesciFailureNames.length. This can suppress valid extensions or trigger inconsistent extension behavior.Suggested fix direction
- // Record remediation history for progress-aware budget extension - history.push({ timestamp: ..., ciFailureCount: ciFailureNames.length, ... }); - await featureLoader.update(..._remediationHistory: history); - const ciBudgetResult = ciEnforcer.checkAndIncrement(...); if (!ciBudgetResult.allowed) { + // Compare against previous persisted snapshot (before writing current cycle) const progressCheck = ciEnforcer.checkProgressAndExtend(ctx.feature, evidences); ... } + // Persist current snapshot after evaluation, using the same count basis used for progress checks. + const effectiveFailureCount = evidences.length > 0 ? evidences.length : ciFailureNames.length; + history.push({ timestamp: ..., ciFailureCount: effectiveFailureCount, failingCheckNames: ciFailureNames }); + await featureLoader.update(..._remediationHistory: history);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@apps/server/src/services/lead-engineer-review-merge-processors.ts` around lines 488 - 523, The progress-extension logic is using an updated remediation snapshot and inconsistent counters which can hide or mis-trigger extensions; before mutating ctx.feature._remediationHistory, capture the pre-update snapshot and the corresponding failure count (use ciFailureNames.length as the canonical CI-failure count), pass that snapshot and count into RemediationBudgetEnforcer.checkProgressAndExtend(ctx.feature, evidences) (or a new overload) so the extension comparison is performed against the same, pre-persisted state, then only append the new history entry and persist after the progress check; ensure checkProgressAndExtend and any comparisons use ciFailureNames.length (not evidences.length) for consistency.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@apps/server/src/services/ci-failure-evidence-collector.ts`:
- Around line 34-35: The two regexes meant to match diff-added/removed lines
currently use /\b-d\b/ and /\b\+d\b/ (which match a literal "d"); replace them
with regexes that look for leading "+" or "-" characters such as /^-/ and /^\+/
(or /^-\s/ and /^\+\s/ if you want to allow a single space after the sign) so
the diff context lines are correctly detected by the code that processes diffs
(replace the existing /\b-d\b/ and /\b\+d\b/ occurrences with these corrected
patterns).
- Around line 254-319: The loop over failedChecks is re-calling
getFailedStepLog() for each check which causes unnecessary GH API calls and
latency; instead, call getFailedStepLog() at most once and reuse its result.
Refactor the code around the failedChecks loop (the block that calls
getCheckRunOutput, parses annotations via parseTestFailureDetails, and currently
calls getFailedStepLog) to: 1) determine if any check might need the workflow
log (i.e., no annotations and no summary) and only then call getFailedStepLog
once into a local variable (e.g., failedStepLog); 2) inside the loop use that
cached failedStepLog to populate evidence.logExcerpt and to call
parseTestFailureDetails instead of invoking getFailedStepLog per-check; and 3)
keep existing behavior for using annotations, output.summary,
parseTestFailureDetails, evidence.testName, evidence.location, etc., but replace
the per-check getFailedStepLog call with the single cached value.
---
Outside diff comments:
In `@apps/server/src/services/lead-engineer-review-merge-processors.ts`:
- Around line 423-425: The code currently only sets ctx.ciFailureEvidence when
CI evidence is successfully collected, allowing stale evidence to persist across
cycles; update the branches in lead-engineer-review-merge-processors.ts where
feedbackParts is handled and in the other CI-related block (the region around
where ctx.reviewFeedback is set and the block covering lines 468-486) to
explicitly clear ctx.ciFailureEvidence (set it to undefined/null) whenever CI
evidence is unavailable, irrelevant, or collection fails so ExecuteProcessor
will not see stale evidence; ensure every path that represents "no CI evidence"
(including collection failure and non-CI remediation paths) assigns
ctx.ciFailureEvidence = undefined and keep these changes adjacent to the
existing ctx.reviewFeedback/feedbackParts logic and the CI collection failure
handling.
- Around line 488-523: The progress-extension logic is using an updated
remediation snapshot and inconsistent counters which can hide or mis-trigger
extensions; before mutating ctx.feature._remediationHistory, capture the
pre-update snapshot and the corresponding failure count (use
ciFailureNames.length as the canonical CI-failure count), pass that snapshot and
count into RemediationBudgetEnforcer.checkProgressAndExtend(ctx.feature,
evidences) (or a new overload) so the extension comparison is performed against
the same, pre-persisted state, then only append the new history entry and
persist after the progress check; ensure checkProgressAndExtend and any
comparisons use ciFailureNames.length (not evidences.length) for consistency.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 06365562-c4bf-442b-beac-3e21f9fcded0
📒 Files selected for processing (9)
apps/server/src/services/auto-mode-service.tsapps/server/src/services/ci-failure-evidence-collector.tsapps/server/src/services/lead-engineer-execute-processor.tsapps/server/src/services/lead-engineer-review-merge-processors.tsapps/server/src/services/lead-engineer-types.tsapps/server/src/services/remediation-budget-enforcer.tslibs/types/src/ci-reaction.tslibs/types/src/feature.tslibs/types/src/index.ts
💤 Files with no reviewable changes (1)
- apps/server/src/services/auto-mode-service.ts
| /\b-d\b/, // diff removed lines | ||
| /\b\+d\b/, // diff added lines |
There was a problem hiding this comment.
Fix diff-line regexes; current patterns don't match real +/- diff lines.
The patterns at Line 34 and Line 35 match literal d tokens, so diff context lines are effectively never captured.
Suggested fix
- /\b-d\b/, // diff removed lines
- /\b\+d\b/, // diff added lines
+ /^\s*-\s/, // diff removed lines
+ /^\s*\+\s/, // diff added lines📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| /\b-d\b/, // diff removed lines | |
| /\b\+d\b/, // diff added lines | |
| /^\s*-\s/, // diff removed lines | |
| /^\s*\+\s/, // diff added lines |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@apps/server/src/services/ci-failure-evidence-collector.ts` around lines 34 -
35, The two regexes meant to match diff-added/removed lines currently use
/\b-d\b/ and /\b\+d\b/ (which match a literal "d"); replace them with regexes
that look for leading "+" or "-" characters such as /^-/ and /^\+/ (or /^-\s/
and /^\+\s/ if you want to allow a single space after the sign) so the diff
context lines are correctly detected by the code that processes diffs (replace
the existing /\b-d\b/ and /\b\+d\b/ occurrences with these corrected patterns).
| for (const check of failedChecks) { | ||
| const evidence: CIFailureEvidence = { | ||
| checkName: check.name, | ||
| checkUrl: check.html_url, | ||
| }; | ||
|
|
||
| // Fetch check run output | ||
| const output = await getCheckRunOutput(slug, check.id, projectPath); | ||
|
|
||
| if (output) { | ||
| // Prefer annotations | ||
| if (output.annotations && output.annotations.length > 0) { | ||
| evidence.annotations = output.annotations; | ||
|
|
||
| // Extract details from annotations | ||
| const failureAnnotations = output.annotations.filter( | ||
| (a) => a.annotation_level === 'failure' | ||
| ); | ||
|
|
||
| if (failureAnnotations.length > 0) { | ||
| const first = failureAnnotations[0]; | ||
| if (first.path) { | ||
| evidence.testFile = first.path; | ||
| } | ||
| if (first.start_line) { | ||
| evidence.location = `${first.path || 'unknown'}:${first.start_line}`; | ||
| } | ||
| if (first.message) { | ||
| const parsed = parseTestFailureDetails(first.message); | ||
| Object.assign(evidence, parsed); | ||
| if (!evidence.testName && first.title) { | ||
| evidence.testName = first.title; | ||
| } | ||
| } | ||
| if (first.raw_details) { | ||
| const parsed = parseTestFailureDetails(first.raw_details); | ||
| Object.assign(evidence, parsed); | ||
| } | ||
| } | ||
|
|
||
| // If annotations exist but no failure-level ones, use summary | ||
| if (!evidence.testName && output.summary) { | ||
| const parsed = parseTestFailureDetails(output.summary); | ||
| Object.assign(evidence, parsed); | ||
| } | ||
| } else if (output.summary && output.summary.trim()) { | ||
| // No annotations, parse summary | ||
| const parsed = parseTestFailureDetails(output.summary); | ||
| Object.assign(evidence, parsed); | ||
| // Use summary as log excerpt if it's substantial | ||
| if (output.summary.trim().length > 50) { | ||
| evidence.logExcerpt = output.summary.trim().slice(0, 3000); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // If we still have no log excerpt, try fetching failed step logs | ||
| if (!evidence.logExcerpt && !evidence.annotations?.length) { | ||
| const log = await getFailedStepLog(projectPath, prNumber); | ||
| if (log && log.trim().length > 10) { | ||
| evidence.logExcerpt = log.trim().slice(0, 5000); | ||
| // Try to parse details from the log | ||
| const parsed = parseTestFailureDetails(log); | ||
| Object.assign(evidence, parsed); | ||
| } | ||
| } |
There was a problem hiding this comment.
Avoid re-fetching failed workflow logs inside the per-check loop.
When annotations are missing, the code calls getFailedStepLog() per failed check. That multiplies gh calls and can add large latency in multi-failure PRs, increasing timeout/escalation risk.
Suggested refactor
export async function collectCIFailureEvidence(ctx: StateContext): Promise<CIFailureEvidence[]> {
@@
- for (const check of failedChecks) {
+ let sharedFailedLog: string | null = null;
+ let sharedFailedLogLoaded = false;
+
+ for (const check of failedChecks) {
@@
- if (!evidence.logExcerpt && !evidence.annotations?.length) {
- const log = await getFailedStepLog(projectPath, prNumber);
+ if (!evidence.logExcerpt && !evidence.annotations?.length) {
+ if (!sharedFailedLogLoaded) {
+ sharedFailedLog = await getFailedStepLog(projectPath, prNumber);
+ sharedFailedLogLoaded = true;
+ }
+ const log = sharedFailedLog;
if (log && log.trim().length > 10) {
evidence.logExcerpt = log.trim().slice(0, 5000);
const parsed = parseTestFailureDetails(log);
Object.assign(evidence, parsed);
}
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@apps/server/src/services/ci-failure-evidence-collector.ts` around lines 254 -
319, The loop over failedChecks is re-calling getFailedStepLog() for each check
which causes unnecessary GH API calls and latency; instead, call
getFailedStepLog() at most once and reuse its result. Refactor the code around
the failedChecks loop (the block that calls getCheckRunOutput, parses
annotations via parseTestFailureDetails, and currently calls getFailedStepLog)
to: 1) determine if any check might need the workflow log (i.e., no annotations
and no summary) and only then call getFailedStepLog once into a local variable
(e.g., failedStepLog); 2) inside the loop use that cached failedStepLog to
populate evidence.logExcerpt and to call parseTestFailureDetails instead of
invoking getFailedStepLog per-check; and 3) keep existing behavior for using
annotations, output.summary, parseTestFailureDetails, evidence.testName,
evidence.location, etc., but replace the per-check getFailedStepLog call with
the single cached value.
Summary
SPARC PRD: Auto-Mode Review-Remediation — Failing-Test Evidence + Cycle Budget Revisit
Situation
Auto-mode's PR review-remediation loop operates as follows:
ReviewProcessor(lead-engineer-review-merge-processors.ts) polls the PR viagh pr viewfor check results and review feedbackCI / test, CI / lint) and sets `ctx.reviewFeedback = 'CI ...Closes #4124
Created automatically by Automaker
Summary by CodeRabbit
New Features
Improvements