Replace authority hierarchy with synthesis approach for RCA

janisz · janisz · commit dd79eff2e848 · 2026-05-08T14:15:09.000+02:00
Change root cause determination from strict authority hierarchy to
synthesizing all three agent reports into one plausible narrative.

Key changes:
- Combine all agent findings instead of using Infrastructure Detective
  as final authority
- Add minority_report field to highlight dissenting perspectives (≥50%
  confidence)
- Weight evidence by confidence but don't exclude lower-confidence
  insights
- Update examples to show both consensus and conflicting scenarios
- Rename 'Agent Authority Hierarchy' to 'Agent Contribution Weighting'

This allows presenting the most plausible explanation while preserving
alternative interpretations that may be valuable for investigation.
diff --git a/workflows/acs-triage/.claude/commands/triage.md b/workflows/acs-triage/.claude/commands/triage.md
@@ -196,12 +196,16 @@ After Stage 1, spawn specialized RCA agents for each CI_FAILURE issue:
 
    **Aggregation Process:**
    - Load findings from the three JSON files: archaeology-findings.json, infra-findings.json, and correlation-findings.json
-   - Determine root cause by following the authority hierarchy (Infrastructure Detective has final authority on flakes with ≥80% confidence)
-   - Classify failure category based on Infrastructure Detective's analysis (if confidence ≥80%), or use archaeology signals as fallback
+   - **Synthesize root cause** by combining all three agents' findings into one coherent narrative that sounds plausible:
+     - Weight evidence by agent confidence but don't exclude lower-confidence insights that add context
+     - If agents agree, state the consensus
+     - If agents provide complementary information, integrate it (e.g., "Infrastructure pattern X triggered by recent code change Y")
+     - If agents disagree with reasonable confidence (≥50%), include main finding in root_cause and dissenting view in minority_report
+   - Classify failure category based on Infrastructure Detective's analysis (weighted by confidence ≥80%), with archaeology providing supporting signals
    - Extract affected components from archaeology's git blame results if available
    - Calculate unified confidence starting from Infrastructure Detective's base score, adding +10% for very recent changes and +5% for high-similarity matches
    - Assess risk based on failure category (infrastructure flakes = Low, high frequency or critical components = High, otherwise Medium)
-   - Extract proposed fix from Infrastructure Detective's suggested action
+   - Extract proposed fix from Infrastructure Detective's suggested action or archaeology context
    - Sanitize and extract relevant logs (max 500 chars, removing tokens, passwords, internal URLs, IPs, and employee emails)
    - Include problematic commit and PR from archaeology if available
    - Flag infrastructure flakes and include workaround recommendations
diff --git a/workflows/acs-triage/reference/rca-aggregation-rules.md b/workflows/acs-triage/reference/rca-aggregation-rules.md
@@ -2,24 +2,37 @@
 
 This document defines how findings from the three RCA agents (Code Archaeologist, Infrastructure Detective, Cross-Issue Correlator) are aggregated into a unified `deep_analysis` object.
 
-## Agent Authority Hierarchy
+## Agent Contribution Weighting
 
-When agents disagree on classification, use this authority hierarchy:
+When synthesizing findings from multiple agents:
 
-1. **Infrastructure Detective** - Has final authority on infrastructure flake classification when confidence ≥80%
-2. **Code Archaeologist** - Has authority on identifying problematic commits/PRs
-3. **Cross-Issue Correlator** - Provides supporting evidence, does not override other agents
+1. **Infrastructure Detective** - Primary source for pattern-based classification and infrastructure flake detection; weight increases with confidence ≥80%
+2. **Code Archaeologist** - Primary source for commit/PR attribution and code change context; weight increases with recency of changes
+3. **Cross-Issue Correlator** - Primary source for frequency trends and historical patterns; weight increases with high-similarity matches (≥85%)
+
+**Integration Principle:** Combine all three perspectives rather than using strict hierarchy. When agents disagree, include the majority finding in the main root cause and note dissenting views in the minority report.
 
 ## Root Cause Determination
 
 **Algorithm:**
 
-1. If Infrastructure Detective classified this as an infrastructure flake with confidence ≥80%, use the Infrastructure Detective's reasoning as the root cause
-2. If Infrastructure Detective classified this as a code bug:
-   - Start with Infrastructure Detective's reasoning
-   - If Code Archaeologist found git blame results, append the archaeology reasoning
-   - Return the combined root cause
-3. Otherwise (unknown classification), return "Insufficient data to determine root cause"
+Synthesize findings from all three agents into a single coherent root cause narrative:
+
+1. **Combine all available evidence:**
+   - Start with the most concrete findings (Infrastructure Detective's pattern analysis, Code Archaeologist's git blame results)
+   - Incorporate frequency and historical context from Cross-Issue Correlator
+   - Weight evidence by agent confidence levels, but don't exclude low-confidence insights that add context
+
+2. **Generate unified root cause:**
+   - Integrate all perspectives into one narrative that sounds plausible
+   - If agents agree, state the consensus
+   - If agents provide complementary information, weave it together (e.g., "Infrastructure pattern X triggered by recent code change Y")
+   - If insufficient data across all agents, state "Insufficient data to determine root cause"
+
+3. **Add minority report (if applicable):**
+   - If agents disagree or provide alternative explanations with reasonable confidence (≥50%), include a "minority_report" field
+   - Format: Brief statement of the alternative perspective with attribution (e.g., "Code Archaeologist suggests recent refactor in PR #123 may be a contributing factor")
+   - This highlights uncertainty without committing to a single explanation when evidence is mixed
 
 ## Failure Category Classification
 
@@ -63,7 +76,8 @@ When agents disagree on classification, use this authority hierarchy:
 
 ```json
 {
-  "root_cause": "<unified root cause from determine_root_cause()>",
+  "root_cause": "<unified narrative synthesizing all three agents' findings>",
+  "minority_report": "<alternative perspectives from dissenting agents, if any; null if consensus>",
   "failure_category": "<from classify_failure()>",
   "affected_components": ["<from archaeology or existing ci_analysis>"],
   "confidence": "<High | Medium | Low from calculate_unified_confidence()>",
@@ -88,12 +102,13 @@ When agents disagree on classification, use this authority hierarchy:
 
 ### Scenario: Archaeology says code-bug, Infrastructure says flake
 
-**Resolution:** Infrastructure Detective wins if confidence ≥80%
+**Resolution:** Synthesize both perspectives
 
 **Decision Logic:**
-- If Infrastructure Detective confidence ≥80% → use Infrastructure Detective's classification
-- Otherwise, if Code Archaeologist confidence ≥70% → use "code-bug" (archaeology found recent code change)
-- Otherwise → use "unknown" (conflicting low-confidence signals)
+- **Primary finding (root_cause):** If Infrastructure Detective has confidence ≥80%, lead with their infrastructure flake classification but note the code change context from archaeology
+  - Example: "Infrastructure timeout pattern detected (intermittent test runner issues). Recent code change in PR #123 may have increased susceptibility to timing issues."
+- **Minority report:** If Code Archaeologist has confidence ≥70%, note: "Code Archaeologist identifies recent change in PR #123 as potential root cause rather than infrastructure"
+- **If both have low confidence (<70%):** State "Conflicting signals - infrastructure pattern suggests flake, but recent code changes warrant investigation" and mark confidence as Low
 
 ### Scenario: Multiple similar issues with different root causes
 
@@ -142,6 +157,8 @@ Before writing to `deep_analysis`, sanitize all text fields:
 
 ## Example Aggregation
 
+### Example 1: Consensus Scenario
+
 **Inputs:**
 
 - **Archaeology**: Found commit `abc123` in PR #12345, 4 days ago (very recent), code-under-test changed
@@ -152,7 +169,8 @@ Before writing to `deep_analysis`, sanitize all text fields:
 
 ```json
 {
-  "root_cause": "GraphQL schema validation error - template emits Boolean placeholders without resolvers. Recent code change (4 days ago) in PR #12345 likely introduced this regression.",
+  "root_cause": "GraphQL schema validation error - template emits Boolean placeholders without resolvers. Recent code change (4 days ago) in PR #12345 likely introduced this regression, similar to previously resolved issue ROX-11111.",
+  "minority_report": null,
   "failure_category": "code-bug",
   "affected_components": ["central/graphql/generator/codegen/codegen.go.tpl"],
   "confidence": "High",
@@ -185,6 +203,57 @@ Before writing to `deep_analysis`, sanitize all text fields:
 }
 ```
 
+### Example 2: Conflicting Perspectives
+
+**Inputs:**
+
+- **Archaeology**: No recent code changes in affected files (last change 3 months ago), confidence 60%
+- **Infrastructure**: Classified as infrastructure-flake with 85% confidence, intermittent test runner timeout pattern
+- **Correlation**: Found 3 similar issues with same timeout pattern in last 30 days, all resolved by retry
+
+**Aggregated Output:**
+
+```json
+{
+  "root_cause": "Intermittent test runner timeout pattern consistent with infrastructure instability. Multiple similar failures in past 30 days (3 occurrences) all resolved by retry without code changes.",
+  "minority_report": "Code Archaeologist notes no recent changes in affected test files (last change 3 months ago), suggesting this is not a regression but could indicate latent timing sensitivity in the test itself.",
+  "failure_category": "infrastructure",
+  "affected_components": ["qa/test/integration/sensor_test.go"],
+  "confidence": "High",
+  "risk_assessment": "Low",
+  "proposed_fix": "Retry build - infrastructure timeout pattern",
+  "relevant_logs": "timeout: test exceeded 10m deadline",
+
+  "problematic_commit": null,
+  "problematic_pr": null,
+
+  "is_infrastructure_flake": true,
+  "infrastructure_workaround": "Retry test execution with increased timeout threshold",
+
+  "similar_issues": [
+    {
+      "key": "ROX-12001",
+      "similarity": 88,
+      "root_cause": "Infrastructure timeout",
+      "solution": "Retry succeeded"
+    },
+    {
+      "key": "ROX-12055",
+      "similarity": 85,
+      "root_cause": "Test runner timeout",
+      "solution": "Retry succeeded"
+    }
+  ],
+  "failure_frequency": {
+    "count_30d": 3,
+    "classification": "Medium",
+    "trend": "increasing"
+  },
+
+  "investigation_method": "multi_agent_parallel"
+}
+```
+
 **Confidence Calculation:**
 - Base: 90 (Infrastructure Detective)
 - +10 (very recent code change)
@@ -196,3 +265,5 @@ Before writing to `deep_analysis`, sanitize all text fields:
 - **Investigation Method**: Always set to `"multi_agent_parallel"` when using multi-agent RCA
 - **Null Handling**: If an agent fails or returns no data, use `null` for its fields (don't fail the whole aggregation)
 - **Fallback**: If aggregation fails, fall back to single sequential analysis or description-only mode
+- **Minority Report**: Include dissenting perspectives with confidence ≥50% to highlight uncertainty; set to `null` when agents reach consensus
+- **Synthesis Over Hierarchy**: Combine all agent findings into a coherent narrative rather than strictly following authority hierarchy