Skip to content

Commit 0a14004

Browse files
committed
specs: agentic openshift lightspeed evaluation
1 parent 72eeedd commit 0a14004

1 file changed

Lines changed: 250 additions & 0 deletions

File tree

Lines changed: 250 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,250 @@
1+
# Event-Driven Agent Evaluation: Openshift Agentic Lightspeed
2+
3+
Asutosh Samal (@asamal4), Carmelo Riolo (@rioloc), Alberto Falossi (@falox)
4+
Rev. 0.1 (Apr 27, 2026)
5+
Rev. 0.2 (May 12, 2026 — agents config override)
6+
Rev. 0.3 (May 15, 2026 — agent name change for Agentic OL)
7+
8+
**Scope:** Openshift Agentic Lightspeed integration, generic agent framework, evaluation mechanisms
9+
10+
## 1\. Overview
11+
12+
lightspeed-eval assumes synchronous HTTP request-response. Openshift Agentic Lightspeed and similar systems are event-driven: CRDs applied, workflows executed, cluster state changed. The "answer" is a trajectory of events and a final cluster state.
13+
14+
**Solution:** Introduce a generic agents configuration layer. HTTP API becomes one agent type. Openshift Agentic Lightspeed is the first non-HTTP agent. The framework evaluates agent results using deterministic assertions (parity with operator eval) and, in future, LLM-as-judge.
15+
16+
## 2\. Key Decisions & Open Questions
17+
18+
| \# | Decision | Choice |
19+
| :---- | :---- | :---- |
20+
| D1 | Config structure | agents: top-level with default.agent (selection) \+ default.agent\_config (fallback properties) and named agent definitions. Same `agent` \+ `agent_config` field names in eval\_data for consistency. |
21+
| D2 | CRD operations approach | *K8s Python client OR kubectl/oc subprocess ??* |
22+
| D3 | Proposal input | Inline spec dict in eval\_data (same fields as operator EvalSuite) |
23+
| D4 | Evaluation metric | Single custom:proposal\_status with all assertion checks |
24+
| D5 | query for Openshift Agentic Lightspeed | request accepted as alias for query; driver injects into proposal CR |
25+
| D6 | LLM-as-judge | Needed for behavioural testing \- TBD |
26+
| D7 | Polling vs Watch | Simple polling |
27+
| D8 | Backward compatibility | api: block auto-migrates to agents |
28+
29+
## 3\. Configuration Architecture
30+
31+
### 3.1 Agents Block (system.yaml)
32+
33+
```
34+
agents:
35+
enabled: true # Master switch — disables all agent execution when false
36+
37+
default:
38+
agent: ols_api # Used when eval_data doesn't specify
39+
agent_config: # Fallback properties for agents that don't set their own
40+
timeout: 600
41+
retry: 3
42+
43+
ols_api:
44+
type: http_api
45+
api_base: http://localhost:8080
46+
endpoint_type: streaming
47+
provider: openai
48+
model: gpt-4o
49+
50+
openshift_agentic_lightspeed:
51+
type: proposal
52+
kubeconfig: ${KUBECONFIG}
53+
namespace: openshift-lightspeed
54+
auto_approve: true
55+
cleanup_proposals: true # Delete eval proposals after status captured
56+
timeout: 900 # Explicit — ignores default.agent_config.timeout
57+
poll_interval: 2
58+
```
59+
60+
**Structure:** `enabled` is a master switch at agents level — controls whether any agent execution happens (vs using pre-filled data). In future, per-agent `enabled` can be considered to allow enabling/disabling individual agents independently. `default` holds agent selection (`agent`) and fallback properties (`agent_config`). Everything else with a `type:` field is an agent definition. CRD coordinates (crd\_group, crd\_version, crd\_plural, crd\_kind) are configurable for other CRD-based agents.
61+
62+
**`agent` + `agent_config` consistency:** The same field names are used in both system.yaml (`default.agent`, `default.agent_config`) and eval\_data (`agent`, `agent_config`) for clarity.
63+
64+
### 3.2 Config Resolution
65+
66+
```
67+
eval_data.agent_config > agents.<name> typed fields > default.agent_config
68+
(highest priority) (agent-specific) (fallback for unset fields only)
69+
```
70+
71+
**Note:** `default.agent_config` only applies to fields the agent didn't explicitly set. This prevents system defaults from silently overriding agent-specific values.
72+
73+
### 3.3 Backward Compatibility
74+
75+
The existing api: block should auto-migrate to agents: via a Pydantic model\_validator. Migration only fires when agents: is absent. When both exist, agents: takes precedence.
76+
77+
```
78+
# Migration output
79+
agents:
80+
enabled: true/false # From api.enabled
81+
default:
82+
agent: api # Key name, distinct from type
83+
api: # Named agent definition
84+
type: http_api # Type is separate
85+
api_base: ...
86+
```
87+
88+
### 3.4 eval\_data Agent Selection
89+
90+
```
91+
conversation_groups:
92+
- conversation_group_id: legacy_tests # No agent = uses default
93+
turns: [...]
94+
95+
- conversation_group_id: openshift_agentic_lightspeed_tests # Explicit agent
96+
agent: openshift_agentic_lightspeed
97+
turns: [...]
98+
99+
- conversation_group_id: openshift_agentic_lightspeed_custom # Per-group config override
100+
agent: openshift_agentic_lightspeed
101+
agent_config:
102+
namespace: custom-namespace
103+
timeout: 1200
104+
turns: [...]
105+
```
106+
107+
## 4\. Agent Driver Architecture
108+
109+
### 4.1 Driver Interface
110+
111+
```
112+
class AgentDriver(ABC):
113+
@abstractmethod
114+
def execute_turn(self, turn_data: TurnData, config: dict) -> Optional[str]:
115+
"""Enrich turn_data in-place. Return error message or None."""
116+
...
117+
118+
@abstractmethod
119+
def validate_config(self, config: dict) -> None: ...
120+
```
121+
122+
**Caching:** There is a plan to move caching configuration to a framework level, applied uniformly to all components (agents, judge LLM, embedding model) rather than at individual component level.
123+
124+
### 4.2 Driver Registry & Pipeline Integration
125+
126+
```
127+
AGENT_DRIVERS = {
128+
"http_api": HttpApiDriver, # Wraps existing APIDataAmender
129+
"proposal": ProposalDriver, # Proposal lifecycle - managed by kubectl/oc or k8s client
130+
}
131+
```
132+
133+
Pipeline change: the driver should replace the amender call. Metrics are agent-agnostic.
134+
135+
```
136+
Current: processor -> APIDataAmender -> MetricsEvaluator
137+
New: processor -> AgentDriver.execute_turn() -> MetricsEvaluator
138+
```
139+
140+
## 5\. Openshift Agentic Lightspeed Flow
141+
142+
### 5.1 Lifecycle
143+
144+
```
145+
1. Build Proposal CR ← Merge proposal_spec + request + agent config
146+
2. Create CR on cluster ← Auto-generated name: eval-<uuid8>
147+
3. Poll status ← Loop every poll_interval seconds
148+
4. Auto-approve ← If phase == Proposed and auto_approve enabled
149+
5. Terminal phase ← Completed / Failed / Denied / Escalated
150+
6. Populate turn_data ← proposal_status (full dict) + response (summary text)
151+
7. Cleanup proposal CR ← Delete the created CR (if cleanup_proposals enabled)
152+
8. Metrics evaluate ← custom:proposal_status on enriched data
153+
```
154+
The driver manages the full proposal lifecycle — create through cleanup. Setup/cleanup scripts are only needed for **infrastructure** (deploying agent, llmprovider, sandbox and needed CRs to the cluster).
155+
156+
### 5.2 Data Model
157+
158+
TurnData new fields:
159+
160+
| Field | Type | Source | Purpose |
161+
| :---- | :---- | :---- | :---- |
162+
| description | Optional\[str\] | User | Human-readable label for reports. Falls back to query. |
163+
| proposal\_spec | Optional\[dict\] | User | Inline proposal spec |
164+
| expected\_proposal\_status | Optional\[dict\] | User | Assertions to check against proposal\_status |
165+
| proposal\_status | Optional\[dict\] | Framework | Raw CRD status populated by driver. Saved in amended data. |
166+
167+
**query** remains required. **request** is accepted as alias. For Openshift Agentic Lightspeed, the driver injects a query/request into the proposal CR's request field.
168+
169+
EvaluationData new fields: agent: Optional\[str\], agent\_config: Optional\[dict\]
170+
171+
## 6\. Open Decision: K8s Python Client vs kubectl/oc
172+
173+
Both implement the same AgentDriver interface. The rest of the framework is unaffected.
174+
175+
| Factor | K8s Python Client | kubectl/oc Subprocess |
176+
| :---- | :---- | :---- |
177+
| New dependency | kubernetes package (\~50MB) | None (oc already needed for setup) |
178+
| Auth | Kubeconfig loading in code | Inherits from shell |
179+
| Code | \~200 LOC | \~100 LOC |
180+
| Debugging | Inspect Python objects | Copy-paste commands to terminal |
181+
| Consistency | Different tool than setup scripts | Same tool as setup scripts |
182+
| Errors | Python exceptions | Parse stderr \+ exit codes |
183+
184+
Evaluation/assertion logic (**custom:proposal\_status**) will be Python regardless. This decision only affects CRD lifecycle operations (create, poll, approve, fetch).
185+
186+
Recommendation: Lean toward kubectl/oc for consistency and fewer dependencies. ??
187+
188+
## 7\. Evaluation: custom:proposal\_status
189+
190+
### 7.1 Architecture
191+
192+
A single metric that should run all assertion checks from expected\_proposal\_status in sequence, failing fast at the first failure. Mirrors the operator's EvalSuite: one Expect block per case, one result.
193+
194+
Checks should run in order: phase → timing → analysis → components → execution → verification. Each check returns None (no expectation, skip), (True, reason), or (False, reason).
195+
196+
### 7.2 expected\_proposal\_status Structure
197+
198+
```
199+
expected_proposal_status:
200+
phase: Completed # Exact match
201+
phase_in: [Completed, Escalated] # Alternative: any of these
202+
max_duration: "5m"
203+
max_attempts: 3
204+
analysis:
205+
min_options: 1
206+
options:
207+
- risk_in: [low, medium]
208+
confidence_in: [medium, high]
209+
diagnosis_contains: ["scale", "replicas"]
210+
components:
211+
- type: remediation_summary
212+
match: { action: Scale, replicas: 3 }
213+
- type: risk_assessment
214+
match_contains: { summary: "low risk" }
215+
required: [mitigation_steps]
216+
- type: destructive_action
217+
absent: true
218+
execution:
219+
phase: Succeeded
220+
verification:
221+
passed: true
222+
summary_contains: "3 replicas running"
223+
```
224+
225+
This structure should map 1:1 to the operator's Expectations Go struct (camelCase → snake\_case).
226+
227+
### 7.3 LLM-as-Judge (Future)
228+
229+
LLM-based quality evaluation is a future phase. Approach TBD — may use existing metrics (e.g., custom:answer\_correctness on the remediation summary), a new metric, or a combination.
230+
231+
### 7.4 Comparison: eval\_data vs Operator EvalSuite
232+
233+
| Aspect | Operator EvalSuite | lightspeed-eval |
234+
| :---- | :---- | :---- |
235+
| Input | case.workflow \+ case.request inline | request (alias for query) \+ proposal\_spec.workflow |
236+
| Label | case.name | turn.description |
237+
| Assertions | case.expect (single block) | turn.expected\_proposal\_status (same semantics) |
238+
| Naming | camelCase (minOptions) | snake\_case (min\_options) |
239+
| Scope | One suite \= one workflow | Mixed agents in one eval run |
240+
| Extra | NA | Future: LLM-as-judge |
241+
242+
## 8\. Dependencies
243+
244+
If **K8s Python client** (Approach A): New dependency kubernetes\>=28.0.0
245+
246+
If **kubectl/oc subprocess** (Approach B): No new Python dependencies. oc/kubectl already required for setup scripts.
247+
248+
Always required: Cluster access, RBAC permissions, operator installed for real evaluations.
249+
250+
Not required: Operator eval CLI (lightspeed-eval drives the flow).

0 commit comments

Comments
 (0)