Skip to content

Commit 3c860f1

Browse files
committed
refactor: use strands-evals OpenSearchProvider in agent-evals example
Switch the agent-evals example from direct OpenSearchTraceRetriever + Session conversion to strands-evals OpenSearchProvider. This dogfoods the provider that was upstreamed to strands-agents/evals and makes the pattern portable across observability backends (CloudWatch, Langfuse, OpenSearch). Changes: - main.py: use strands_evals.providers.OpenSearchProvider. Anchor scores on the last AgentInvocationSpan (matches provider's own output extraction). Remove mock mode now that the provider handles retrieval end-to-end. - pyproject.toml: pin strands-agents-evals[opensearch] >= 0.1.15 (OpenSearchProvider landed in 0.1.15). Drop unused strands-agents. - README: rewrite to match new flow, document EVAL_JUDGE_MODEL, point at eval_canary as the deterministic complementary path. Verified E2E against local observability-stack: trace retrieval via provider works, HelpfulnessEvaluator runs on Bedrock Claude, score span emits with correct parent linkage and lands in OpenSearch joined to the evaluated trace. Signed-off-by: Kyle Hounslow <kylhouns@amazon.com>
1 parent a723f2e commit 3c860f1

4 files changed

Lines changed: 110 additions & 265 deletions

File tree

Lines changed: 27 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,17 @@
1-
# Agent Evals — GenAI Observability SDK
1+
# Agent Evals — strands-evals + OpenSearch
22

3-
End-to-end evaluation loop: retrieve agent traces from OpenSearch, run LLM-as-judge evaluations, and write score spans back to OpenSearch.
3+
Run LLM-as-judge evaluations on agent traces stored in OpenSearch. Scores are written back as OTel spans and appear in the same trace waterfall as the original agent.
4+
5+
## How it works
6+
7+
1. `OpenSearchProvider` (from `strands-agents-evals`) fetches the trace by session or trace ID. It wraps `genai-observability-sdk-py` under the hood, so the same retrieval code works across CloudWatch, Langfuse, and OpenSearch backends.
8+
2. `HelpfulnessEvaluator` runs Bedrock Claude as the judge.
9+
3. `score()` (from `genai-observability-sdk-py`) emits the result as an OTel GenAI score span back to OpenSearch.
410

511
## Prerequisites
612

7-
- [observability-stack](../../) running (`docker compose up`)
8-
- Agent traces indexed in OpenSearch (run any example agent first)
9-
- AWS credentials configured for Bedrock access (only for LLM-as-judge mode, not needed with `--mock`)
13+
- observability-stack running (`docker compose up`) with trace data indexed.
14+
- AWS credentials with Bedrock access (default: `us.anthropic.claude-sonnet-4-20250514-v1:0`).
1015

1116
## Setup
1217

@@ -17,68 +22,30 @@ uv sync
1722
## Usage
1823

1924
```bash
20-
# LLM-as-judge using Bedrock Claude (requires AWS credentials)
21-
python main.py <conversation_id>
22-
23-
# Target a specific trace ID
24-
python main.py --trace-id <trace_id>
25-
26-
# Mock evaluator (no AWS credentials needed, for testing the pipeline)
27-
python main.py --mock <conversation_id>
28-
python main.py --mock --trace-id <trace_id>
29-
```
30-
31-
### Quick start (no AWS needed)
32-
33-
```bash
34-
# 1. Run the observability-stack with example agents
35-
cd ../.. && docker compose up -d
36-
37-
# 2. Wait for canary to generate traces, then grab a trace ID from Dashboards
38-
39-
# 3. Run mock eval against that trace
40-
python main.py --mock --trace-id <trace_id>
41-
42-
# 4. Check OpenSearch Dashboards — the score span appears in the same trace
43-
```
44-
45-
### LLM-as-judge (Bedrock)
46-
47-
Requires AWS credentials with Bedrock access (`aws configure` or env vars).
48-
49-
```bash
50-
# Evaluate by conversation ID
51-
python main.py conv_abc123
25+
# By session (conversation) ID
26+
uv run python main.py <session_id>
5227

53-
# Evaluate by trace ID
54-
python main.py --trace-id d4479d70ec2aa787775b58cc65e77b88
28+
# By trace ID
29+
uv run python main.py --trace-id <trace_id>
5530
```
5631

57-
Uses `HelpfulnessEvaluator` from [strands-agents/evals](https://github.com/strands-agents/evals) with Claude on Bedrock.
58-
59-
## How it works
60-
61-
1. Retrieves agent traces from OpenSearch using `OpenSearchTraceRetriever` from [genai-observability-sdk-py](https://github.com/opensearch-project/genai-observability-sdk-py)
62-
2. Converts traces to [strands-agents/evals](https://github.com/strands-agents/evals) `Session` format
63-
3. Runs `HelpfulnessEvaluator` (LLM-as-judge via Bedrock Claude)
64-
4. Writes evaluation scores back to OpenSearch as OTel spans via `score()`
65-
66-
Score spans appear in the same trace waterfall as the original agent spans, with attributes following the [OTel GenAI semantic conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-events/).
32+
Score spans appear in OpenSearch Dashboards on the same trace as the agent spans, tagged with `test.suite.run.id`, `test.case.id`, and `test.case.result.status` (pass/fail).
6733

68-
## Environment variables
34+
## Configuration
6935

70-
| Variable | Default | Description |
36+
| Variable | Default | Purpose |
7137
|---|---|---|
7238
| `OPENSEARCH_HOST` | `https://localhost:9200` | OpenSearch endpoint |
73-
| `OPENSEARCH_USER` | `admin` | OpenSearch username |
74-
| `OPENSEARCH_PASS` | `My_password_123!@#` | OpenSearch password |
75-
| `OTEL_EXPORTER_OTLP_ENDPOINT` | `localhost:4317` | OTel Collector gRPC endpoint |
76-
| `OTEL_SERVICE_NAME` | `genai-evals` | Service name for score spans |
39+
| `OPENSEARCH_USER` / `OPENSEARCH_PASS` | `admin` / `My_password_123!@#` | Basic auth |
40+
| `OTEL_EXPORTER_OTLP_ENDPOINT` | `localhost:4317` | OTel Collector gRPC |
41+
| `EVAL_JUDGE_MODEL` | `us.anthropic.claude-sonnet-4-20250514-v1:0` | Bedrock model ID |
7742

78-
## Architecture
43+
## Applies to any agent framework
7944

80-
```
81-
Agent traces (OpenSearch) → Retrieve → Evaluate (Bedrock) → Score spans → OpenSearch
82-
```
45+
Works against any agent emitting OTel GenAI semantic convention spans (Strands, LangChain, CrewAI, plain OTel SDK). `OpenSearchProvider` handles retrieval; the evaluator does not care which framework produced the traces.
46+
47+
## Related
8348

84-
Works with any agent framework (Strands, LangChain, CrewAI, plain OTel SDK) that emits GenAI semantic convention spans.
49+
- [strands-agents/evals](https://github.com/strands-agents/evals) — evaluator library + provider interfaces.
50+
- [opensearch-project/genai-observability-sdk-py](https://github.com/opensearch-project/genai-observability-sdk-py) — retrieval (`OpenSearchTraceRetriever`) and score write-back (`score()`).
51+
- For continuous background scoring without Bedrock, see [`docker-compose/agent-eval-canary/`](../../../docker-compose/agent-eval-canary/).

0 commit comments

Comments
 (0)