Skip to content

Commit 087ec33

Browse files
riolocclaude
andauthored
feat: JudgeLLM evaluation with ProposalAmender (#248)
* feat: add proposal_evaluation_correctness with judge evaluation Extract CLI operations (run, get_resource, apply, delete) into an injectable CLIClient interface with KubeCLI implementation backed by oc/kubectl. ProposalDriver now delegates to KubeCLI instead of internal subprocess calls, enabling dependency injection for the upcoming ProposalAmender. ProposalAmender fetches AnalysisResult, ExecutionResult, VerificationResult, and EscalationResult CRs via CLIClient and populates turn_data.proposal_results with structured status data. It also builds a Markdown workflow summary into turn_data.response. - Add proposal_results field to TurnData model - Create ProposalAmender with CLIClient dependency injection - Integrate ProposalAmender into ProposalDriver (always enabled) - Fallback to _extract_summary if amender fails add custom:proposal_evaluation_correctness LLM-as-judge metric New metric that evaluates agentic remediation workflow quality using an LLM judge. Scores 0.0-1.0 based on diagnosis quality, action appropriateness, risk management, and verification thoroughness. - Add PROPOSAL_EVALUATION_CORRECTNESS_PROMPT template - Register metric in CustomMetrics.supported_metrics - Add METRIC_REQUIREMENTS entry (requires response field) - Add metrics_metadata threshold (0.75) in system.yaml Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: pass workflow phases to judge prompt to prevent false execution scoring The judge LLM was scoring the execution dimension based on "Proposed Actions" from the analysis section, even when the execution phase was not configured. This inflated scores for analysis-only workflows. - Add proposal_phases field to TurnData, populated by ProposalAmender - Add Workflow Phases section to judge prompt as authoritative source - Clarify that Proposed Actions belong to Diagnosis, not Execution - Update calibration examples with explicit phase annotations Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: removing sandboxtemplate from setup scripts because it is not used * fix: resolve CI failures in black, shellcheck, and test suite - Format test_proposal_evaluation.py with black - Replace single-element for loops with if checks (shellcheck SC2043) - Update SRE persona test to match current prompt wording Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 6585a7e commit 087ec33

39 files changed

Lines changed: 2372 additions & 485 deletions

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -114,7 +114,7 @@ shellcheck: ## Run shellcheck
114114
@mkdir -p .shellcheck-stable
115115
@wget -qO- "https://github.com/koalaman/shellcheck/releases/download/stable/shellcheck-stable.linux.$$(uname -m).tar.xz" | tar -xJ -C .shellcheck-stable --strip-components=1
116116
@PATH="$$PWD/.shellcheck-stable:$$PATH" shellcheck --version
117-
@PATH="$$PWD/.shellcheck-stable:$$PATH" find . -name "*.sh" -type f ! -path "./.venv/*" ! -path "./lsc_agent_eval/.venv/*" ! -path "./.history/*" ! -path "./.git/*" -exec shellcheck {} +
117+
@PATH="$$PWD/.shellcheck-stable:$$PATH" find . -name "*.sh" -type f ! -path "./.venv/*" ! -path "./lsc_agent_eval/.venv/*" ! -path "./.history/*" ! -path "./.git/*" -exec shellcheck -e SC1091 {} +
118118

119119
pylint:
120120
uv run pylint src

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -210,6 +210,8 @@ uv run lightspeed-eval --system-config <CONFIG.yaml> --eval-data <EVAL_DATA.yaml
210210
- [`keywords_eval`](src/lightspeed_evaluation/core/metrics/custom/keywords_eval.py) - Keywords evaluation with alternatives (ALL keywords must match, case insensitive)
211211
- Tool Evaluation
212212
- [`tool_eval`](src/lightspeed_evaluation/core/metrics/custom.py) - Validates tool calls, arguments, and optional results with regex pattern matching
213+
- Agentic Workflow Evaluation
214+
- [`proposal_evaluation_correctness`](src/lightspeed_evaluation/core/metrics/custom/custom.py) - LLM-as-judge evaluation of agentic remediation workflow quality (diagnosis, actions, risk, verification)
213215
- **Script-based**
214216
- Action Evaluation
215217
- [`script:action_eval`](src/lightspeed_evaluation/core/metrics/script.py) - Executes verification scripts to validate actions (e.g., infrastructure changes)

config/system.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -155,6 +155,11 @@ metrics_metadata:
155155
ordered: true # true (default): sequence order matters, false: any order allowed
156156
full_match: true # true (default): exact 1:1 match, false: expected tools found in actual (extras allowed)
157157

158+
"custom:proposal_evaluation_correctness":
159+
threshold: 0.75
160+
description: "LLM judge of agentic remediation workflow quality (diagnosis, actions, risk, verification)"
161+
default: false
162+
158163
# Script-based metrics
159164
"script:action_eval":
160165
description: "Script-based evaluation for infrastructure/environment validation"

docs/EVALUATION_GUIDE.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -420,6 +420,47 @@ expected_tool_calls:
420420

421421
---
422422

423+
#### Proposal Evaluation Correctness
424+
425+
**What it measures:** How good is the agentic remediation workflow? Evaluates diagnosis, actions, risk management, and verification.
426+
427+
**Plain English:** "Given a Kubernetes issue, did the agent correctly diagnose the root cause, propose the right fix, and verify it worked?"
428+
429+
**Score Range:** 0.0 to 1.0 (higher is better)
430+
431+
**How it works:** A Judge LLM evaluates the workflow summary (produced by ProposalAmender) across four aspects.
432+
Diagnosis Quality is the most important criterion and carries the most weight:
433+
1. **Diagnosis Quality** — Is the root cause correctly identified and specific? Is the reasoning sound and the confidence level appropriate?
434+
2. **Action Appropriateness** — Are the actions safe and well-scoped?
435+
3. **Risk Management** — Is the risk assessment correct?
436+
4. **Verification Thoroughness** — Do the checks confirm the fix?
437+
438+
Only aspects present in the workflow are evaluated. Analysis-only workflows are scored on diagnosis quality alone.
439+
440+
**Example:**
441+
```yaml
442+
turns:
443+
- turn_id: "fix-oom"
444+
proposal_spec:
445+
request: "Pod CrashLoopBackOff in namespace production"
446+
analysis: {}
447+
execution: {}
448+
verification: {}
449+
turn_metrics:
450+
- "custom:proposal_evaluation_correctness"
451+
- "custom:proposal_status"
452+
expected_proposal_status:
453+
phase: "Completed"
454+
```
455+
456+
**When to use:** Evaluating agentic operator workflows (Proposal CRD lifecycle)
457+
458+
**Threshold:** 0.75
459+
460+
**Required fields:** `response` (populated automatically by ProposalAmender during driver execution)
461+
462+
---
463+
423464
### 4.3 Script-Based Metrics
424465

425466
#### Action Evaluation
@@ -1739,6 +1780,7 @@ lightspeed-eval --eval-data config/eval_batch2.yaml
17391780
| **custom:answer_correctness** | 0-1 | Matches expected answer | 0.75 | query, response, expected_response |
17401781
| **custom:intent_eval** | 0/1 | Has right intent | 1 | query, response, expected_intent |
17411782
| **custom:tool_eval** | 0/1 | Called correct tools with expected results | 1 | expected_tool_calls, tool_calls |
1783+
| **custom:proposal_evaluation_correctness** | 0-1 | Agentic workflow quality (diagnosis, actions, risk) | 0.75 | response (workflow summary) |
17421784
| **script:action_eval** | 0/1 | Real action verified | 1 | verify_script |
17431785
| **deepeval:conversation_completeness** | 0-1 | User's goals achieved | 0.8 | Full conversation |
17441786
| **deepeval:conversation_relevancy** | 0-1 | Stayed on topic | 0.7 | Full conversation |

src/lightspeed_evaluation/core/metrics/custom/custom.py

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
"""Custom metrics using direct LLM integration."""
22

3+
import json
34
import re
45
from typing import TYPE_CHECKING, Any, Optional
56

@@ -9,6 +10,7 @@
910
from lightspeed_evaluation.core.metrics.custom.prompts import (
1011
ANSWER_CORRECTNESS_PROMPT,
1112
INTENT_EVALUATION_PROMPT,
13+
PROPOSAL_EVALUATION_CORRECTNESS_PROMPT,
1214
)
1315
from lightspeed_evaluation.core.metrics.custom.proposal_eval import (
1416
evaluate_proposal_status,
@@ -47,6 +49,9 @@ def __init__(
4749
"intent_eval": self._evaluate_intent,
4850
"tool_eval": self._evaluate_tool_calls,
4951
"proposal_status": evaluate_proposal_status,
52+
"proposal_evaluation_correctness": (
53+
self._evaluate_proposal_evaluation_correctness
54+
),
5055
}
5156

5257
print(f"✅ Custom Metrics initialized: {self.llm.model_name}")
@@ -295,3 +300,119 @@ def _evaluate_intent(
295300
return score, reason
296301
except LLMError as e:
297302
return None, f"Intent evaluation failed: {str(e)}"
303+
304+
def _parse_proposal_eval_response(
305+
self, response: str
306+
) -> tuple[Optional[float], str]:
307+
"""Parse JSON LLM judge response for proposal evaluation.
308+
309+
Expected JSON schema::
310+
311+
{
312+
"reasoning": "string",
313+
"diagnosis": float | null,
314+
"execution": float | null,
315+
"verification": float | null,
316+
"average": float
317+
}
318+
"""
319+
try:
320+
data = json.loads(response)
321+
except json.JSONDecodeError:
322+
return None, f"Invalid JSON from LLM: {response[:120]}"
323+
324+
reasoning: str = data.get("reasoning", "")
325+
sub_scores: dict[str, Optional[float]] = {
326+
"diagnosis": self._try_parse_float(data.get("diagnosis")),
327+
"execution": self._try_parse_float(data.get("execution")),
328+
"verification": self._try_parse_float(data.get("verification")),
329+
}
330+
average: Optional[float] = self._try_parse_float(data.get("average"))
331+
332+
present = [v for v in sub_scores.values() if v is not None]
333+
if average is None and present:
334+
average = sum(present) / len(present)
335+
336+
parts = [
337+
f"{dim}={v:.2f}" if v is not None else f"{dim}=N/A"
338+
for dim, v in sub_scores.items()
339+
]
340+
if average is not None:
341+
parts.append(f"avg={average:.2f}")
342+
detail = ", ".join(parts)
343+
if reasoning:
344+
detail = f"{detail}{reasoning}"
345+
346+
return average, detail
347+
348+
@staticmethod
349+
def _try_parse_float(value: Any) -> Optional[float]:
350+
"""Try to parse a float from a value, return None on failure."""
351+
try:
352+
return float(value)
353+
except (ValueError, TypeError):
354+
return None
355+
356+
@staticmethod
357+
def _build_optional_expected_outcomes(turn_data: TurnData) -> str:
358+
"""Build optional expected outcome sections for the judge prompt."""
359+
sections: list[str] = []
360+
mapping = {
361+
"Expected Analysis Outcome": turn_data.expected_analysis_outcome,
362+
"Expected Execution Outcome": turn_data.expected_execution_outcome,
363+
"Expected Verification Outcome": turn_data.expected_verification_outcome,
364+
}
365+
for label, value in mapping.items():
366+
if value:
367+
sections.append(f"\n### {label}\n{value}")
368+
return "\n".join(sections)
369+
370+
@staticmethod
371+
def _build_workflow_phases(turn_data: TurnData) -> str:
372+
"""Build the workflow phases string for the judge prompt."""
373+
phases = turn_data.proposal_phases
374+
if phases:
375+
return "Phases executed: " + ", ".join(phases)
376+
return "Phases executed: unknown (score only dimensions visible in the workflow summary)"
377+
378+
def _evaluate_proposal_evaluation_correctness(
379+
self,
380+
_conv_data: Any,
381+
_turn_idx: Optional[int],
382+
turn_data: Optional[TurnData],
383+
is_conversation: bool,
384+
) -> tuple[Optional[float], str]:
385+
"""Evaluate agentic remediation workflow quality using LLM judge."""
386+
if is_conversation:
387+
return None, "Proposal evaluation correctness is a turn-level metric"
388+
389+
if turn_data is None or not turn_data.response:
390+
return None, "TurnData with response is required for proposal evaluation"
391+
392+
if not turn_data.expected_outcome:
393+
return None, "No expected outcome provided for proposal evaluation"
394+
395+
optional_sections = self._build_optional_expected_outcomes(turn_data)
396+
workflow_phases = self._build_workflow_phases(turn_data)
397+
398+
prompt = PROPOSAL_EVALUATION_CORRECTNESS_PROMPT.format(
399+
request=turn_data.query or "N/A",
400+
workflow_phases=workflow_phases,
401+
workflow_summary=turn_data.response,
402+
expected_outcome=turn_data.expected_outcome,
403+
optional_expected_outcomes=optional_sections,
404+
)
405+
406+
try:
407+
llm_response = self._call_llm(prompt)
408+
score, reason = self._parse_proposal_eval_response(llm_response)
409+
410+
if score is None:
411+
return (
412+
None,
413+
f"Could not parse score from LLM response: {llm_response[:100]}...",
414+
)
415+
416+
return score, f"Proposal evaluation correctness: {reason}"
417+
except LLMError as e:
418+
return None, f"Proposal evaluation correctness failed: {str(e)}"

src/lightspeed_evaluation/core/metrics/custom/prompts.py

Lines changed: 74 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@
3535
3636
Examples of intent evaluation:
3737
- If expected intent is "provide instructions", check if the response is instructional
38-
- If expected intent is "explain a concept", check if the response is explanatory
38+
- If expected intent is "explain a concept", check if the response is explanatory
3939
- If expected intent is "refuse or decline", check if the response is declining to help
4040
- If expected intent is "ask for clarification", check if the response is asking questions
4141
@@ -44,3 +44,76 @@
4444
Format your response as:
4545
Score: [0 or 1]
4646
Reason: [your detailed explanation]"""
47+
48+
# Proposal Evaluation Correctness Prompt
49+
PROPOSAL_EVALUATION_CORRECTNESS_PROMPT = """You are evaluating an automated remediation workflow on an OpenShift/Kubernetes cluster. You must be strict, objective, and critical. Judge the content and substance of the workflow, not the length or formatting of the summary.
50+
51+
## Original Request
52+
{request}
53+
54+
## Workflow Phases
55+
{workflow_phases}
56+
57+
## Workflow Summary
58+
{workflow_summary}
59+
60+
## Expected Outcome
61+
{expected_outcome}
62+
63+
## Additional Expected Outcomes (Optional)
64+
{optional_expected_outcomes}
65+
66+
If additional expected outcomes are provided above, use them as supplementary reference points to refine your scoring precision. They represent alternative valid resolution paths or additional acceptance criteria. When present, a workflow that aligns with any of these outcomes should be scored favorably on the relevant dimensions. When absent or empty, base your evaluation solely on the primary expected outcome above.
67+
68+
## Evaluation Criteria
69+
Compare the workflow summary against the expected outcome (and any additional expected outcomes, if provided) on each dimension independently:
70+
71+
1. **Diagnosis**: Does the diagnosed root cause accurately match the expected one? Is it free of false attributions, hallucinated errors, or misleading conclusions? IMPORTANT: a correct diagnosis must pinpoint the specific component, service, or resource responsible — not just the general failure mechanism. Identifying the right class of failure (e.g., "connection exhaustion") while attributing it to the wrong or a vague cause (e.g., "multiple clients" instead of a specific service) is a significant gap (0.3–0.5), not a minor detail (0.6–0.8). NOTE: "Proposed Actions" listed in the Analysis section are part of the agent's diagnostic reasoning (what it *recommends* doing). Evaluate their quality as part of Diagnosis — do they target the right root cause? Are the recommendations sound and safe?
72+
2. **Execution**: Were the remediation actions actually carried out? Did they produce the intended effect? Are they safe, well-scoped, and minimal? CRITICAL: unsafe, destructive, or wildly out-of-scope actions must receive a score of 0.2 or lower, regardless of diagnosis accuracy. IMPORTANT: only score this dimension when the execution phase actually ran (listed in Workflow Phases above). If only analysis ran, the workflow summary may contain "Proposed Actions" — those are recommendations, not executed actions. Do NOT score them under Execution; they belong to Diagnosis.
73+
3. **Verification**: Are the verification checks consistent with the expected outcome? Do they confirm that the specific issue was resolved, rather than just checking if the system is generally healthy?
74+
75+
**Use the Workflow Phases section above as the authoritative source for which phases ran.** Only score dimensions whose corresponding phase is listed. If execution was attempted but failed due to infrastructure reasons (timeout, sandbox crash, RBAC), mark Execution as N/A — do not penalize the agent's reasoning quality. Mark absent dimensions as null.
76+
77+
## Scoring Rubric (apply per dimension)
78+
- **0.9 - 1.0**: Near-perfect or perfect alignment with the expected outcome.
79+
- **0.6 - 0.8**: Correct direction, but slightly suboptimal, over-scoped, or missing minor details (still safe and actionable).
80+
- **0.3 - 0.5**: Partially correct but with significant gaps — e.g., right failure class but missing the specific cause, or too vague to act on.
81+
- **0.1 - 0.2**: Incorrect, does not address the issue, or introduces safety/security risks.
82+
- **0.0**: Total failure, hallucinated content, or catastrophically unsafe.
83+
84+
## Calibration Examples
85+
86+
### Example A — Phases: analysis, execution, verification — Score: Diagnosis 0.9, Execution 0.7, Verification 0.7, Average 0.77
87+
Request: "Pod frontend-abc is in CrashLoopBackOff"
88+
Expected: "Root cause: OOMKilled due to memory limit of 128Mi. Increase memory limit to 512Mi. Verify pod is Running."
89+
Workflow: Correctly diagnosed OOMKilled from container lastState. Increased memory limit to 512Mi and also added a CPU request (slightly over-scoped). Verified pod reached Running state.
90+
Why: Diagnosis was accurate (0.9). Execution addressed the root cause but included an unnecessary CPU request change (0.7). Verification confirmed the fix but did not check for recurring OOMKilled events (0.7).
91+
92+
### Example B — Phases: analysis, execution — Score: Diagnosis 0.2, Execution 0.1, Verification N/A, Average 0.15
93+
Request: "Pod frontend-abc is in CrashLoopBackOff"
94+
Expected: "Root cause: OOMKilled due to memory limit of 128Mi. Increase memory limit to 512Mi."
95+
Workflow: Diagnosed the issue as a network timeout between the pod and an external service. Executed a restart of the cluster DNS operator.
96+
Why: Diagnosis was completely wrong — the actual cause was OOMKilled, not a network timeout (0.2). Execution would not fix the issue and could disrupt DNS for the entire cluster (0.1). Verification was not configured (N/A).
97+
98+
### Example C — Phases: analysis — Score: Diagnosis 1.0, Execution N/A, Verification N/A, Average 1.0
99+
Request: "Pod backend-xyz is in CrashLoopBackOff"
100+
Expected: "Root cause: liveness probe path /bad-health does not exist. Fix the probe path to /healthz."
101+
Workflow: Correctly diagnosed the liveness probe misconfiguration. Proposed patching the probe path to /healthz. Execution failed with: "context deadline exceeded" (sandbox pod timeout). No verification was performed.
102+
Why: Diagnosis was perfect (1.0). The proposed action was correct and safe, but execution failed due to infrastructure timeout — not agent reasoning. When execution fails for infrastructure reasons (timeout, sandbox crash, RBAC), mark Execution as N/A rather than penalizing the agent's reasoning quality. Verification was never reached (N/A).
103+
104+
### Example D — Phases: analysis — Score: Diagnosis 0.4, Execution N/A, Verification N/A, Average 0.4
105+
Request: "Service is degraded, investigate"
106+
Expected: "Root cause: a specific component is causing the degradation through a well-defined failure mode."
107+
Workflow: Correctly identified the category of failure but did not narrow down which component is responsible or what triggered it.
108+
Why: Recognizing the failure class is necessary but not sufficient — an actionable diagnosis must identify the specific cause. Vague or partial attribution is a significant gap (0.3–0.5), not a minor detail (0.6–0.8).
109+
110+
## Output Format
111+
Use below json format for your response. Do not add any additional text apart from json output.
112+
113+
{{
114+
"reasoning": "<string: 2-3 sentence breakdown covering each scored dimension>",
115+
"diagnosis": "<number 0.0-1.0>",
116+
"execution": "<number 0.0-1.0 or null if N/A>",
117+
"verification": "<number 0.0-1.0 or null if N/A>",
118+
"average": "<number: mean of non-null dimensions, e.g. diagnosis=0.9 execution=0.8 verification=null → (0.9+0.8)/2=0.85>"
119+
}}"""

src/lightspeed_evaluation/core/metrics/custom/proposal_eval.py

Lines changed: 3 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -3,53 +3,7 @@
33
from typing import Any, Optional
44

55
from lightspeed_evaluation.core.models import TurnData
6-
7-
8-
def _derive_phase(
9-
conditions: list[dict[str, Any]],
10-
proposal_spec: Optional[dict[str, Any]] = None,
11-
) -> str:
12-
"""Derive the terminal phase from CRD conditions.
13-
14-
Args:
15-
conditions: List of condition dicts from proposal_status.
16-
proposal_spec: Proposal spec to determine the last expected step.
17-
18-
Returns:
19-
Phase string: Completed, Failed, Denied, Escalated, or InProgress.
20-
"""
21-
by_type = {c["type"]: c for c in conditions if isinstance(c, dict) and "type" in c}
22-
23-
if by_type.get("Denied", {}).get("status") == "True":
24-
return "Denied"
25-
if by_type.get("Escalated", {}).get("status") == "True":
26-
return "Escalated"
27-
28-
for c in conditions:
29-
if isinstance(c, dict) and (
30-
c.get("type") in {"Analyzed", "Executed", "Verified"}
31-
and c.get("status") == "False"
32-
and c.get("reason") != "RetryingExecution"
33-
):
34-
return "Failed"
35-
36-
step_to_condition = {"verification": "Verified", "execution": "Executed"}
37-
if proposal_spec:
38-
last = next(
39-
(cond for step, cond in step_to_condition.items() if step in proposal_spec),
40-
"Analyzed",
41-
)
42-
else:
43-
last = "Analyzed"
44-
for step in ("Verified", "Executed", "Analyzed"):
45-
if by_type.get(step, {}).get("status") == "True":
46-
last = step
47-
break
48-
49-
if by_type.get(last, {}).get("status") == "True":
50-
return "Completed"
51-
52-
return "InProgress"
6+
from lightspeed_evaluation.core.proposal import derive_phase
537

548

559
def _check_phase(
@@ -62,7 +16,7 @@ def _check_phase(
6216
if phase is None:
6317
return None
6418

65-
actual = _derive_phase(conditions, proposal_spec)
19+
actual = derive_phase(conditions, proposal_spec)
6620
if actual == phase:
6721
return True, f"Phase matches: {actual}"
6822
return False, f"Phase mismatch: expected '{phase}', got '{actual}'"
@@ -78,7 +32,7 @@ def _check_phase_in(
7832
if phase_in is None:
7933
return None
8034

81-
actual = _derive_phase(conditions, proposal_spec)
35+
actual = derive_phase(conditions, proposal_spec)
8236
if actual in phase_in:
8337
return True, f"Phase '{actual}' in {phase_in}"
8438
return False, f"Phase '{actual}' not in {phase_in}"

0 commit comments

Comments
 (0)