kylehounslow
diff --git a/‎examples/agent-evals/genai-sdk/README.md‎
Lines changed: 27 additions & 60 deletions b/‎examples/agent-evals/genai-sdk/README.md‎
Lines changed: 27 additions & 60 deletions
@@ -1,12 +1,17 @@
-# Agent Evals — GenAI Observability SDK
+# Agent Evals — strands-evals + OpenSearch
 
-End-to-end evaluation loop: retrieve agent traces from OpenSearch, run LLM-as-judge evaluations, and write score spans back to OpenSearch.
+Run LLM-as-judge evaluations on agent traces stored in OpenSearch. Scores are written back as OTel spans and appear in the same trace waterfall as the original agent.
+
+## How it works
+
+1. `OpenSearchProvider` (from `strands-agents-evals`) fetches the trace by session or trace ID. It wraps `genai-observability-sdk-py` under the hood, so the same retrieval code works across CloudWatch, Langfuse, and OpenSearch backends.
+2. `HelpfulnessEvaluator` runs Bedrock Claude as the judge.
+3. `score()` (from `genai-observability-sdk-py`) emits the result as an OTel GenAI score span back to OpenSearch.
 
 ## Prerequisites
 
-- [observability-stack](../../) running (`docker compose up`)
-- Agent traces indexed in OpenSearch (run any example agent first)
-- AWS credentials configured for Bedrock access (only for LLM-as-judge mode, not needed with `--mock`)
+- observability-stack running (`docker compose up`) with trace data indexed.
+- AWS credentials with Bedrock access (default: `us.anthropic.claude-sonnet-4-20250514-v1:0`).
 
 ## Setup
 
@@ -17,68 +22,30 @@ uv sync
 ## Usage
 
 ```bash
-# LLM-as-judge using Bedrock Claude (requires AWS credentials)
-python main.py <conversation_id>
-
-# Target a specific trace ID
-python main.py --trace-id <trace_id>
-
-# Mock evaluator (no AWS credentials needed, for testing the pipeline)
-python main.py --mock <conversation_id>
-python main.py --mock --trace-id <trace_id>
-```
-
-### Quick start (no AWS needed)
-
-```bash
-# 1. Run the observability-stack with example agents
-cd ../.. && docker compose up -d
-
-# 2. Wait for canary to generate traces, then grab a trace ID from Dashboards
-
-# 3. Run mock eval against that trace
-python main.py --mock --trace-id <trace_id>
-
-# 4. Check OpenSearch Dashboards — the score span appears in the same trace
-```
-
-### LLM-as-judge (Bedrock)
-
-Requires AWS credentials with Bedrock access (`aws configure` or env vars).
-
-```bash
-# Evaluate by conversation ID
-python main.py conv_abc123
+# By session (conversation) ID
+uv run python main.py <session_id>
 
-# Evaluate by trace ID
-python main.py --trace-id d4479d70ec2aa787775b58cc65e77b88
+# By trace ID
+uv run python main.py --trace-id <trace_id>
 ```
 
-Uses `HelpfulnessEvaluator` from [strands-agents/evals](https://github.com/strands-agents/evals) with Claude on Bedrock.
-
-## How it works
-
-1. Retrieves agent traces from OpenSearch using `OpenSearchTraceRetriever` from [genai-observability-sdk-py](https://github.com/opensearch-project/genai-observability-sdk-py)
-2. Converts traces to [strands-agents/evals](https://github.com/strands-agents/evals) `Session` format
-3. Runs `HelpfulnessEvaluator` (LLM-as-judge via Bedrock Claude)
-4. Writes evaluation scores back to OpenSearch as OTel spans via `score()`
-
-Score spans appear in the same trace waterfall as the original agent spans, with attributes following the [OTel GenAI semantic conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-events/).
+Score spans appear in OpenSearch Dashboards on the same trace as the agent spans, tagged with `test.suite.run.id`, `test.case.id`, and `test.case.result.status` (pass/fail).
 
-## Environment variables
+## Configuration
 
-| Variable | Default | Description |
+| Variable | Default | Purpose |
 |---|---|---|
 | `OPENSEARCH_HOST` | `https://localhost:9200` | OpenSearch endpoint |
-| `OPENSEARCH_USER` | `admin` | OpenSearch username |
-| `OPENSEARCH_PASS` | `My_password_123!@#` | OpenSearch password |
-| `OTEL_EXPORTER_OTLP_ENDPOINT` | `localhost:4317` | OTel Collector gRPC endpoint |
-| `OTEL_SERVICE_NAME` | `genai-evals` | Service name for score spans |
+| `OPENSEARCH_USER` / `OPENSEARCH_PASS` | `admin` / `My_password_123!@#` | Basic auth |
+| `OTEL_EXPORTER_OTLP_ENDPOINT` | `localhost:4317` | OTel Collector gRPC |
+| `EVAL_JUDGE_MODEL` | `us.anthropic.claude-sonnet-4-20250514-v1:0` | Bedrock model ID |
 
-## Architecture
+## Applies to any agent framework
 
-```
-Agent traces (OpenSearch) → Retrieve → Evaluate (Bedrock) → Score spans → OpenSearch
-```
+Works against any agent emitting OTel GenAI semantic convention spans (Strands, LangChain, CrewAI, plain OTel SDK). `OpenSearchProvider` handles retrieval; the evaluator does not care which framework produced the traces.
+
+## Related
 
-Works with any agent framework (Strands, LangChain, CrewAI, plain OTel SDK) that emits GenAI semantic convention spans.
+- [strands-agents/evals](https://github.com/strands-agents/evals) — evaluator library + provider interfaces.
+- [opensearch-project/genai-observability-sdk-py](https://github.com/opensearch-project/genai-observability-sdk-py) — retrieval (`OpenSearchTraceRetriever`) and score write-back (`score()`).
+- For continuous background scoring without Bedrock, see [`docker-compose/agent-eval-canary/`](../../../docker-compose/agent-eval-canary/).