|
| 1 | +# Event-Driven Agent Evaluation: Openshift Agentic Lightspeed |
| 2 | + |
| 3 | +Asutosh Samal (@asamal4), Carmelo Riolo (@rioloc), Alberto Falossi (@falox) |
| 4 | +Rev. 0.1 (Apr 27, 2026) |
| 5 | +Rev. 0.2 (May 12, 2026 — agents config override) |
| 6 | +Rev. 0.3 (May 15, 2026 — agent name change for Agentic OL) |
| 7 | + |
| 8 | +**Scope:** Openshift Agentic Lightspeed integration, generic agent framework, evaluation mechanisms |
| 9 | + |
| 10 | +## 1\. Overview |
| 11 | + |
| 12 | +lightspeed-eval assumes synchronous HTTP request-response. Openshift Agentic Lightspeed and similar systems are event-driven: CRDs applied, workflows executed, cluster state changed. The "answer" is a trajectory of events and a final cluster state. |
| 13 | + |
| 14 | +**Solution:** Introduce a generic agents configuration layer. HTTP API becomes one agent type. Openshift Agentic Lightspeed is the first non-HTTP agent. The framework evaluates agent results using deterministic assertions (parity with operator eval) and, in future, LLM-as-judge. |
| 15 | + |
| 16 | +## 2\. Key Decisions & Open Questions |
| 17 | + |
| 18 | +| \# | Decision | Choice | |
| 19 | +| :---- | :---- | :---- | |
| 20 | +| D1 | Config structure | agents: top-level with default.agent (selection) \+ default.agent\_config (fallback properties) and named agent definitions. Same `agent` \+ `agent_config` field names in eval\_data for consistency. | |
| 21 | +| D2 | CRD operations approach | *K8s Python client OR kubectl/oc subprocess ??* | |
| 22 | +| D3 | Proposal input | Inline spec dict in eval\_data (same fields as operator EvalSuite) | |
| 23 | +| D4 | Evaluation metric | Single custom:proposal\_status with all assertion checks | |
| 24 | +| D5 | query for Openshift Agentic Lightspeed | request accepted as alias for query; driver injects into proposal CR | |
| 25 | +| D6 | LLM-as-judge | Needed for behavioural testing \- TBD | |
| 26 | +| D7 | Polling vs Watch | Simple polling | |
| 27 | +| D8 | Backward compatibility | api: block auto-migrates to agents | |
| 28 | + |
| 29 | +## 3\. Configuration Architecture |
| 30 | + |
| 31 | +### 3.1 Agents Block (system.yaml) |
| 32 | + |
| 33 | +``` |
| 34 | +agents: |
| 35 | + enabled: true # Master switch — disables all agent execution when false |
| 36 | +
|
| 37 | + default: |
| 38 | + agent: ols_api # Used when eval_data doesn't specify |
| 39 | + agent_config: # Fallback properties for agents that don't set their own |
| 40 | + timeout: 600 |
| 41 | + retry: 3 |
| 42 | +
|
| 43 | + ols_api: |
| 44 | + type: http_api |
| 45 | + api_base: http://localhost:8080 |
| 46 | + endpoint_type: streaming |
| 47 | + provider: openai |
| 48 | + model: gpt-4o |
| 49 | +
|
| 50 | + openshift_agentic_lightspeed: |
| 51 | + type: proposal |
| 52 | + kubeconfig: ${KUBECONFIG} |
| 53 | + namespace: openshift-lightspeed |
| 54 | + auto_approve: true |
| 55 | + cleanup_proposals: true # Delete eval proposals after status captured |
| 56 | + timeout: 900 # Explicit — ignores default.agent_config.timeout |
| 57 | + poll_interval: 2 |
| 58 | +``` |
| 59 | + |
| 60 | +**Structure:** `enabled` is a master switch at agents level — controls whether any agent execution happens (vs using pre-filled data). In future, per-agent `enabled` can be considered to allow enabling/disabling individual agents independently. `default` holds agent selection (`agent`) and fallback properties (`agent_config`). Everything else with a `type:` field is an agent definition. CRD coordinates (crd\_group, crd\_version, crd\_plural, crd\_kind) are configurable for other CRD-based agents. |
| 61 | + |
| 62 | +**`agent` + `agent_config` consistency:** The same field names are used in both system.yaml (`default.agent`, `default.agent_config`) and eval\_data (`agent`, `agent_config`) for clarity. |
| 63 | + |
| 64 | +### 3.2 Config Resolution |
| 65 | + |
| 66 | +``` |
| 67 | +eval_data.agent_config > agents.<name> typed fields > default.agent_config |
| 68 | +(highest priority) (agent-specific) (fallback for unset fields only) |
| 69 | +``` |
| 70 | + |
| 71 | +**Note:** `default.agent_config` only applies to fields the agent didn't explicitly set. This prevents system defaults from silently overriding agent-specific values. |
| 72 | + |
| 73 | +### 3.3 Backward Compatibility |
| 74 | + |
| 75 | +The existing api: block should auto-migrate to agents: via a Pydantic model\_validator. Migration only fires when agents: is absent. When both exist, agents: takes precedence. |
| 76 | + |
| 77 | +``` |
| 78 | +# Migration output |
| 79 | +agents: |
| 80 | + enabled: true/false # From api.enabled |
| 81 | + default: |
| 82 | + agent: api # Key name, distinct from type |
| 83 | + api: # Named agent definition |
| 84 | + type: http_api # Type is separate |
| 85 | + api_base: ... |
| 86 | +``` |
| 87 | + |
| 88 | +### 3.4 eval\_data Agent Selection |
| 89 | + |
| 90 | +``` |
| 91 | +conversation_groups: |
| 92 | + - conversation_group_id: legacy_tests # No agent = uses default |
| 93 | + turns: [...] |
| 94 | +
|
| 95 | + - conversation_group_id: openshift_agentic_lightspeed_tests # Explicit agent |
| 96 | + agent: openshift_agentic_lightspeed |
| 97 | + turns: [...] |
| 98 | +
|
| 99 | + - conversation_group_id: openshift_agentic_lightspeed_custom # Per-group config override |
| 100 | + agent: openshift_agentic_lightspeed |
| 101 | + agent_config: |
| 102 | + namespace: custom-namespace |
| 103 | + timeout: 1200 |
| 104 | + turns: [...] |
| 105 | +``` |
| 106 | + |
| 107 | +## 4\. Agent Driver Architecture |
| 108 | + |
| 109 | +### 4.1 Driver Interface |
| 110 | + |
| 111 | +``` |
| 112 | +class AgentDriver(ABC): |
| 113 | + @abstractmethod |
| 114 | + def execute_turn(self, turn_data: TurnData, config: dict) -> Optional[str]: |
| 115 | + """Enrich turn_data in-place. Return error message or None.""" |
| 116 | + ... |
| 117 | +
|
| 118 | + @abstractmethod |
| 119 | + def validate_config(self, config: dict) -> None: ... |
| 120 | +``` |
| 121 | + |
| 122 | +**Caching:** There is a plan to move caching configuration to a framework level, applied uniformly to all components (agents, judge LLM, embedding model) rather than at individual component level. |
| 123 | + |
| 124 | +### 4.2 Driver Registry & Pipeline Integration |
| 125 | + |
| 126 | +``` |
| 127 | +AGENT_DRIVERS = { |
| 128 | + "http_api": HttpApiDriver, # Wraps existing APIDataAmender |
| 129 | + "proposal": ProposalDriver, # Proposal lifecycle - managed by kubectl/oc or k8s client |
| 130 | +} |
| 131 | +``` |
| 132 | + |
| 133 | +Pipeline change: the driver should replace the amender call. Metrics are agent-agnostic. |
| 134 | + |
| 135 | +``` |
| 136 | +Current: processor -> APIDataAmender -> MetricsEvaluator |
| 137 | +New: processor -> AgentDriver.execute_turn() -> MetricsEvaluator |
| 138 | +``` |
| 139 | + |
| 140 | +## 5\. Openshift Agentic Lightspeed Flow |
| 141 | + |
| 142 | +### 5.1 Lifecycle |
| 143 | + |
| 144 | +``` |
| 145 | +1. Build Proposal CR ← Merge proposal_spec + request + agent config |
| 146 | +2. Create CR on cluster ← Auto-generated name: eval-<uuid8> |
| 147 | +3. Poll status ← Loop every poll_interval seconds |
| 148 | +4. Auto-approve ← If phase == Proposed and auto_approve enabled |
| 149 | +5. Terminal phase ← Completed / Failed / Denied / Escalated |
| 150 | +6. Populate turn_data ← proposal_status (full dict) + response (summary text) |
| 151 | +7. Cleanup proposal CR ← Delete the created CR (if cleanup_proposals enabled) |
| 152 | +8. Metrics evaluate ← custom:proposal_status on enriched data |
| 153 | +``` |
| 154 | +The driver manages the full proposal lifecycle — create through cleanup. Setup/cleanup scripts are only needed for **infrastructure** (deploying agent, llmprovider, sandbox and needed CRs to the cluster). |
| 155 | + |
| 156 | +### 5.2 Data Model |
| 157 | + |
| 158 | +TurnData new fields: |
| 159 | + |
| 160 | +| Field | Type | Source | Purpose | |
| 161 | +| :---- | :---- | :---- | :---- | |
| 162 | +| description | Optional\[str\] | User | Human-readable label for reports. Falls back to query. | |
| 163 | +| proposal\_spec | Optional\[dict\] | User | Inline proposal spec | |
| 164 | +| expected\_proposal\_status | Optional\[dict\] | User | Assertions to check against proposal\_status | |
| 165 | +| proposal\_status | Optional\[dict\] | Framework | Raw CRD status populated by driver. Saved in amended data. | |
| 166 | + |
| 167 | +**query** remains required. **request** is accepted as alias. For Openshift Agentic Lightspeed, the driver injects a query/request into the proposal CR's request field. |
| 168 | + |
| 169 | +EvaluationData new fields: agent: Optional\[str\], agent\_config: Optional\[dict\] |
| 170 | + |
| 171 | +## 6\. Open Decision: K8s Python Client vs kubectl/oc |
| 172 | + |
| 173 | +Both implement the same AgentDriver interface. The rest of the framework is unaffected. |
| 174 | + |
| 175 | +| Factor | K8s Python Client | kubectl/oc Subprocess | |
| 176 | +| :---- | :---- | :---- | |
| 177 | +| New dependency | kubernetes package (\~50MB) | None (oc already needed for setup) | |
| 178 | +| Auth | Kubeconfig loading in code | Inherits from shell | |
| 179 | +| Code | \~200 LOC | \~100 LOC | |
| 180 | +| Debugging | Inspect Python objects | Copy-paste commands to terminal | |
| 181 | +| Consistency | Different tool than setup scripts | Same tool as setup scripts | |
| 182 | +| Errors | Python exceptions | Parse stderr \+ exit codes | |
| 183 | + |
| 184 | +Evaluation/assertion logic (**custom:proposal\_status**) will be Python regardless. This decision only affects CRD lifecycle operations (create, poll, approve, fetch). |
| 185 | + |
| 186 | +Recommendation: Lean toward kubectl/oc for consistency and fewer dependencies. ?? |
| 187 | + |
| 188 | +## 7\. Evaluation: custom:proposal\_status |
| 189 | + |
| 190 | +### 7.1 Architecture |
| 191 | + |
| 192 | +A single metric that should run all assertion checks from expected\_proposal\_status in sequence, failing fast at the first failure. Mirrors the operator's EvalSuite: one Expect block per case, one result. |
| 193 | + |
| 194 | +Checks should run in order: phase → timing → analysis → components → execution → verification. Each check returns None (no expectation, skip), (True, reason), or (False, reason). |
| 195 | + |
| 196 | +### 7.2 expected\_proposal\_status Structure |
| 197 | + |
| 198 | +``` |
| 199 | +expected_proposal_status: |
| 200 | + phase: Completed # Exact match |
| 201 | + phase_in: [Completed, Escalated] # Alternative: any of these |
| 202 | + max_duration: "5m" |
| 203 | + max_attempts: 3 |
| 204 | + analysis: |
| 205 | + min_options: 1 |
| 206 | + options: |
| 207 | + - risk_in: [low, medium] |
| 208 | + confidence_in: [medium, high] |
| 209 | + diagnosis_contains: ["scale", "replicas"] |
| 210 | + components: |
| 211 | + - type: remediation_summary |
| 212 | + match: { action: Scale, replicas: 3 } |
| 213 | + - type: risk_assessment |
| 214 | + match_contains: { summary: "low risk" } |
| 215 | + required: [mitigation_steps] |
| 216 | + - type: destructive_action |
| 217 | + absent: true |
| 218 | + execution: |
| 219 | + phase: Succeeded |
| 220 | + verification: |
| 221 | + passed: true |
| 222 | + summary_contains: "3 replicas running" |
| 223 | +``` |
| 224 | + |
| 225 | +This structure should map 1:1 to the operator's Expectations Go struct (camelCase → snake\_case). |
| 226 | + |
| 227 | +### 7.3 LLM-as-Judge (Future) |
| 228 | + |
| 229 | +LLM-based quality evaluation is a future phase. Approach TBD — may use existing metrics (e.g., custom:answer\_correctness on the remediation summary), a new metric, or a combination. |
| 230 | + |
| 231 | +### 7.4 Comparison: eval\_data vs Operator EvalSuite |
| 232 | + |
| 233 | +| Aspect | Operator EvalSuite | lightspeed-eval | |
| 234 | +| :---- | :---- | :---- | |
| 235 | +| Input | case.workflow \+ case.request inline | request (alias for query) \+ proposal\_spec.workflow | |
| 236 | +| Label | case.name | turn.description | |
| 237 | +| Assertions | case.expect (single block) | turn.expected\_proposal\_status (same semantics) | |
| 238 | +| Naming | camelCase (minOptions) | snake\_case (min\_options) | |
| 239 | +| Scope | One suite \= one workflow | Mixed agents in one eval run | |
| 240 | +| Extra | NA | Future: LLM-as-judge | |
| 241 | + |
| 242 | +## 8\. Dependencies |
| 243 | + |
| 244 | +If **K8s Python client** (Approach A): New dependency kubernetes\>=28.0.0 |
| 245 | + |
| 246 | +If **kubectl/oc subprocess** (Approach B): No new Python dependencies. oc/kubectl already required for setup scripts. |
| 247 | + |
| 248 | +Always required: Cluster access, RBAC permissions, operator installed for real evaluations. |
| 249 | + |
| 250 | +Not required: Operator eval CLI (lightspeed-eval drives the flow). |
0 commit comments