You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
refactor: use strands-evals OpenSearchProvider in agent-evals example
Switch the agent-evals example from direct OpenSearchTraceRetriever +
Session conversion to strands-evals OpenSearchProvider. This dogfoods
the provider that was upstreamed to strands-agents/evals and makes the
pattern portable across observability backends (CloudWatch, Langfuse,
OpenSearch).
Changes:
- main.py: use strands_evals.providers.OpenSearchProvider. Anchor
scores on the last AgentInvocationSpan (matches provider's own
output extraction). Remove mock mode now that the provider handles
retrieval end-to-end.
- pyproject.toml: pin strands-agents-evals[opensearch] >= 0.1.15
(OpenSearchProvider landed in 0.1.15). Drop unused strands-agents.
- README: rewrite to match new flow, document EVAL_JUDGE_MODEL, point
at eval_canary as the deterministic complementary path.
Verified E2E against local observability-stack: trace retrieval via
provider works, HelpfulnessEvaluator runs on Bedrock Claude, score
span emits with correct parent linkage and lands in OpenSearch joined
to the evaluated trace.
Signed-off-by: Kyle Hounslow <kylhouns@amazon.com>
End-to-end evaluation loop: retrieve agent traces from OpenSearch, run LLM-as-judge evaluations, and write score spans back to OpenSearch.
3
+
Run LLM-as-judge evaluations on agent traces stored in OpenSearch. Scores are written back as OTel spans and appear in the same trace waterfall as the original agent.
4
+
5
+
## How it works
6
+
7
+
1.`OpenSearchProvider` (from `strands-agents-evals`) fetches the trace by session or trace ID. It wraps `genai-observability-sdk-py` under the hood, so the same retrieval code works across CloudWatch, Langfuse, and OpenSearch backends.
8
+
2.`HelpfulnessEvaluator` runs Bedrock Claude as the judge.
9
+
3.`score()` (from `genai-observability-sdk-py`) emits the result as an OTel GenAI score span back to OpenSearch.
Uses `HelpfulnessEvaluator` from [strands-agents/evals](https://github.com/strands-agents/evals) with Claude on Bedrock.
58
-
59
-
## How it works
60
-
61
-
1. Retrieves agent traces from OpenSearch using `OpenSearchTraceRetriever` from [genai-observability-sdk-py](https://github.com/opensearch-project/genai-observability-sdk-py)
62
-
2. Converts traces to [strands-agents/evals](https://github.com/strands-agents/evals)`Session` format
63
-
3. Runs `HelpfulnessEvaluator` (LLM-as-judge via Bedrock Claude)
64
-
4. Writes evaluation scores back to OpenSearch as OTel spans via `score()`
65
-
66
-
Score spans appear in the same trace waterfall as the original agent spans, with attributes following the [OTel GenAI semantic conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-events/).
32
+
Score spans appear in OpenSearch Dashboards on the same trace as the agent spans, tagged with `test.suite.run.id`, `test.case.id`, and `test.case.result.status` (pass/fail).
Works against any agent emitting OTel GenAI semantic convention spans (Strands, LangChain, CrewAI, plain OTel SDK). `OpenSearchProvider` handles retrieval; the evaluator does not care which framework produced the traces.
46
+
47
+
## Related
83
48
84
-
Works with any agent framework (Strands, LangChain, CrewAI, plain OTel SDK) that emits GenAI semantic convention spans.
0 commit comments