Skip to content

Commit 64f364f

Browse files
authored
docs: fix SDK API references to match actual v0.2.8 code (opensearch-project#146)
* docs: fix SDK API references to match actual v0.2.8 code Cross-checked all SDK documentation against the actual opensearch-genai-observability-sdk-py v0.2.8 source code and fixed multiple discrepancies: - Rename `Experiment` to `Benchmark` throughout (the SDK class is `Benchmark`, not `Experiment`) - Fix `ExperimentSummary` -> `BenchmarkSummary`, `CaseResult` -> `TestCaseResult` in result type references - Add missing exports to API overview table: `BenchmarkResult`, `BenchmarkSummary`, `TestCaseResult`, `ScoreSummary` - Remove non-existent individual extras (`[cohere]`, `[mistral]`, `[groq]`, `[ollama]`) from auto-instrumentation table - these are only available via `[otel-instrumentors]` bundle - Add missing `[google]` extra to installation section - Add missing env vars `OTEL_EXPORTER_OTLP_TRACES_PROTOCOL` and `OTEL_EXPORTER_OTLP_PROTOCOL` to environment variables table - Show actual default endpoint URL instead of "Data Prepper default" - Add `Benchmark` constructor parameters documentation - Add manual (non-context-manager) usage example for `Benchmark` - Fix evaluation-integrations.mdx variable names for consistency Signed-off-by: Vamsi Manohar <reddyvam@amazon.com> * docs: fix evaluation-integrations and framework integrations pages evaluation-integrations.mdx: - Add version caveat Aside for third-party framework APIs - Add install commands (pip install) for each section - DeepEval: add tip about deepeval.evaluate() alternative - RAGAS: add version note about v0.2+ API changes, fix iteration to use df.iterrows() instead of enumerate(itertuples()) - MLflow: fix to use DataFrame input, add mlflow.start_run() context, add version note about model_type deprecation in 2.12+ - pytest: move register() to conftest.py session-scoped fixture to avoid re-initialization across test files, add install command, rename scorer to avoid shadowing builtin - Fix related links: restore "Experiments" in link text to match actual page title at /docs/agent-health/evaluations/experiments/ integrations.mdx: - CrewAI: suggest [otel-instrumentors] extra which includes CrewAI auto-instrumentation, add note about what it provides - OpenAI Agents SDK: add import comment clarifying package name Signed-off-by: Vamsi Manohar <reddyvam@amazon.com> * docs: remove agent-health experiments link from evaluation-integrations Address review comment — keep related links consistent with evaluation.mdx. Signed-off-by: Vamshi Vijay Nakkirtha <vamsimanohar@gmail.com> Signed-off-by: Vamsi Manohar <reddyvam@amazon.com> --------- Signed-off-by: Vamsi Manohar <reddyvam@amazon.com>
1 parent ea76e11 commit 64f364f

6 files changed

Lines changed: 141 additions & 86 deletions

File tree

docs/starlight-docs/src/content/docs/ai-observability/evaluation-integrations.mdx

Lines changed: 78 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,13 @@ title: "Evaluation Integrations"
33
description: "Bridge external evaluation frameworks like DeepEval, RAGAS, MLflow, and pytest into the observability stack"
44
---
55

6-
The `Experiment` class bridges any evaluation framework into the observability stack. Run your evaluations with your preferred tool, then upload the results as OTel spans so everything is queryable in one place.
6+
import { Aside } from '@astrojs/starlight/components';
7+
8+
The `Benchmark` class bridges any evaluation framework into the observability stack. Run your evaluations with your preferred tool, then upload the results as OTel spans so everything is queryable in one place.
9+
10+
<Aside type="note">
11+
Third-party framework APIs change frequently. The examples below show the integration pattern — adapt imports and method calls to your installed version. The SDK side (`Benchmark`, `b.log()`, `evaluate()`, `EvalScore`) is stable.
12+
</Aside>
713

814
---
915

@@ -12,15 +18,15 @@ The `Experiment` class bridges any evaluation framework into the observability s
1218
[DeepEval](https://github.com/confident-ai/deepeval) provides LLM-as-judge metrics like faithfulness, answer relevancy, and hallucination detection.
1319

1420
```bash
15-
pip install deepeval
21+
pip install opensearch-genai-observability-sdk-py deepeval
1622
```
1723

1824
```python
19-
from opensearch_genai_observability_sdk_py import register, Experiment
25+
from opensearch_genai_observability_sdk_py import register, Benchmark
2026
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
2127
from deepeval.test_case import LLMTestCase
2228

23-
register(service_name="deepeval-experiment")
29+
register(service_name="deepeval-eval")
2430

2531
# Define your test cases
2632
test_cases = [
@@ -42,7 +48,7 @@ test_cases = [
4248
relevancy = AnswerRelevancyMetric(model="gpt-4o")
4349
faithfulness = FaithfulnessMetric(model="gpt-4o")
4450

45-
with Experiment("deepeval_run", metadata={"framework": "deepeval"}) as exp:
51+
with Benchmark("deepeval_run", metadata={"framework": "deepeval"}) as b:
4652
for case in test_cases:
4753
tc = LLMTestCase(
4854
input=case["input"],
@@ -52,7 +58,7 @@ with Experiment("deepeval_run", metadata={"framework": "deepeval"}) as exp:
5258
)
5359
relevancy.measure(tc)
5460
faithfulness.measure(tc)
55-
exp.log(
61+
b.log(
5662
input=case["input"],
5763
output=case["output"],
5864
expected=case["expected"],
@@ -64,23 +70,27 @@ with Experiment("deepeval_run", metadata={"framework": "deepeval"}) as exp:
6470
)
6571
```
6672

73+
<Aside type="tip">
74+
DeepEval also has a `deepeval.evaluate()` function that runs all metrics at once. You can use either approach — the key is extracting the numeric scores and passing them to `b.log()`.
75+
</Aside>
76+
6777
---
6878

6979
## RAGAS
7080

7181
[RAGAS](https://docs.ragas.io/) evaluates RAG pipelines with metrics like context precision, faithfulness, and answer correctness.
7282

7383
```bash
74-
pip install ragas
84+
pip install opensearch-genai-observability-sdk-py ragas datasets
7585
```
7686

7787
```python
78-
from opensearch_genai_observability_sdk_py import register, Experiment
88+
from opensearch_genai_observability_sdk_py import register, Benchmark
7989
from ragas import evaluate as ragas_evaluate
8090
from ragas.metrics import faithfulness, answer_correctness, context_precision
8191
from datasets import Dataset
8292

83-
register(service_name="ragas-experiment")
93+
register(service_name="ragas-eval")
8494

8595
# Prepare your RAG evaluation dataset
8696
data = {
@@ -106,75 +116,97 @@ ragas_result = ragas_evaluate(
106116
metrics=[faithfulness, answer_correctness, context_precision],
107117
)
108118

109-
# Upload results to OpenSearch via Experiment
110-
with Experiment("ragas_eval", metadata={"framework": "ragas"}) as exp:
111-
for i, row in enumerate(ragas_result.to_pandas().itertuples()):
112-
exp.log(
119+
# Upload results to OpenSearch via Benchmark
120+
df = ragas_result.to_pandas()
121+
with Benchmark("ragas_eval", metadata={"framework": "ragas"}) as b:
122+
for i, row in df.iterrows():
123+
b.log(
113124
input=data["question"][i],
114125
output=data["answer"][i],
115126
expected=data["ground_truth"][i],
116-
scores={
117-
"faithfulness": row.faithfulness,
118-
"answer_correctness": row.answer_correctness,
119-
"context_precision": row.context_precision,
120-
},
127+
scores={col: row[col] for col in df.columns if col not in data},
121128
case_name=data["question"][i][:50],
122129
)
123130
```
124131

132+
<Aside type="note">
133+
RAGAS v0.2+ changed its dataset format and metric API. If you're on v0.2+, check the [RAGAS migration guide](https://docs.ragas.io/) for the updated `evaluate()` signature. The SDK-side `Benchmark.log()` call stays the same — only the RAGAS imports and `evaluate()` call change.
134+
</Aside>
135+
125136
---
126137

127138
## MLflow
128139

129140
[MLflow](https://mlflow.org/) tracks ML experiments. Export MLflow evaluation results into the observability stack:
130141

131142
```bash
132-
pip install mlflow
143+
pip install opensearch-genai-observability-sdk-py mlflow
133144
```
134145

135146
```python
136-
from opensearch_genai_observability_sdk_py import register, Experiment
147+
from opensearch_genai_observability_sdk_py import register, Benchmark
137148
import mlflow
149+
import pandas as pd
138150

139-
register(service_name="mlflow-experiment")
151+
register(service_name="mlflow-eval")
140152

141-
# Run MLflow evaluation
142-
eval_data = [
143-
{"inputs": {"question": "What is OpenSearch?"}, "ground_truth": "search engine"},
144-
{"inputs": {"question": "What is OTEL?"}, "ground_truth": "observability framework"},
145-
]
153+
# Prepare evaluation data as a DataFrame
154+
eval_df = pd.DataFrame([
155+
{"inputs": "What is OpenSearch?", "ground_truth": "search engine"},
156+
{"inputs": "What is OTEL?", "ground_truth": "observability framework"},
157+
])
146158

147-
mlflow_result = mlflow.evaluate(
148-
model="openai:/gpt-4o",
149-
data=eval_data,
150-
model_type="question-answering",
151-
)
159+
# Run MLflow evaluation
160+
with mlflow.start_run():
161+
mlflow_result = mlflow.evaluate(
162+
model="openai:/gpt-4o",
163+
data=eval_df,
164+
targets="ground_truth",
165+
model_type="question-answering",
166+
)
152167

153168
# Upload to observability stack
154-
with Experiment("mlflow_eval", metadata={"framework": "mlflow"}) as exp:
155-
for _, row in mlflow_result.tables["eval_results_table"].iterrows():
156-
exp.log(
157-
input=row["inputs"],
158-
output=row["outputs"],
169+
results_df = mlflow_result.tables["eval_results_table"]
170+
with Benchmark("mlflow_eval", metadata={"framework": "mlflow"}) as b:
171+
for _, row in results_df.iterrows():
172+
b.log(
173+
input=row.get("inputs", ""),
174+
output=row.get("outputs", ""),
159175
expected=row.get("ground_truth", ""),
160176
scores={
161-
col: row[col]
162-
for col in mlflow_result.metrics
163-
if col in row and row[col] is not None
177+
k: v for k, v in mlflow_result.metrics.items()
178+
if isinstance(v, (int, float))
164179
},
165180
)
166181
```
167182

183+
<Aside type="note">
184+
MLflow's `evaluate()` API varies across versions. The `model_type` parameter was deprecated in MLflow 2.12+ in favor of `evaluators`. Check [MLflow evaluate docs](https://mlflow.org/docs/latest/llms/llm-evaluate/index.html) for your version.
185+
</Aside>
186+
168187
---
169188

170189
## pytest
171190

172191
Use `evaluate()` directly in your test suite for CI/CD integration:
173192

193+
```bash
194+
pip install opensearch-genai-observability-sdk-py pytest
195+
```
196+
174197
```python
175-
from opensearch_genai_observability_sdk_py import register, evaluate, EvalScore
198+
# conftest.py — initialize tracing once for all tests
199+
import pytest
200+
from opensearch_genai_observability_sdk_py import register
201+
202+
@pytest.fixture(scope="session", autouse=True)
203+
def _init_tracing():
204+
register(service_name="pytest-eval")
205+
```
176206

177-
register(service_name="pytest-eval")
207+
```python
208+
# test_agent.py
209+
from opensearch_genai_observability_sdk_py import evaluate, EvalScore
178210

179211
def accuracy_scorer(input, output, expected) -> EvalScore:
180212
is_correct = expected.lower() in output.lower()
@@ -184,8 +216,8 @@ def accuracy_scorer(input, output, expected) -> EvalScore:
184216
label="pass" if is_correct else "fail",
185217
)
186218

187-
def latency_scorer(input, output, expected) -> EvalScore:
188-
return EvalScore(name="response_length", value=len(output))
219+
def response_length_scorer(input, output, expected) -> EvalScore:
220+
return EvalScore(name="response_length", value=float(len(output)))
189221

190222
def my_agent(input: str) -> str:
191223
# Replace with your agent logic
@@ -199,18 +231,17 @@ def test_agent_quality():
199231
{"input": "What is OpenSearch?", "expected": "search"},
200232
{"input": "What is OTEL?", "expected": "opentelemetry"},
201233
],
202-
scores=[accuracy_scorer, latency_scorer],
234+
scores=[accuracy_scorer, response_length_scorer],
203235
)
204236
avg_accuracy = result.summary.scores["accuracy"].avg
205237
assert avg_accuracy >= 0.8, f"Accuracy dropped to {avg_accuracy}"
206238
```
207239

208-
Run with: `pytest test_agent.py` - results are recorded as OTel experiment spans and available in OpenSearch Dashboards.
240+
Run with: `pytest test_agent.py` results are recorded as OTel benchmark spans and available in OpenSearch Dashboards.
209241

210242
---
211243

212244
## Related links
213245

214-
- [Evaluation & Scoring](/docs/ai-observability/evaluation/) - core `score()`, `evaluate()`, `Experiment` API
246+
- [Evaluation & Scoring](/docs/ai-observability/evaluation/) - core `score()`, `evaluate()`, `Benchmark` API
215247
- [Python SDK reference](/docs/send-data/ai-agents/python/) - full SDK documentation
216-
- [Agent Health - Experiments](/docs/agent-health/evaluations/experiments/) - UI and CLI-based experiment workflows

docs/starlight-docs/src/content/docs/ai-observability/evaluation.mdx

Lines changed: 39 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ The Python SDK provides three evaluation capabilities that all emit data through
1111

1212
- **`score()`** - attach quality scores to individual traces or spans
1313
- **`evaluate()`** - run an agent against a dataset with automated scorer functions
14-
- **`Experiment`** - upload pre-computed results from any evaluation framework
14+
- **`Benchmark`** - upload pre-computed results from any evaluation framework
1515

1616
All evaluation data lands in the same OpenSearch index as your traces, so you can query scores alongside agent spans.
1717

@@ -76,7 +76,7 @@ def run(query: str) -> str:
7676

7777
## `evaluate()` - run experiments
7878

79-
Executes a task function against each item in a dataset, runs scorer functions, and records everything as OTel experiment spans.
79+
Executes a task function against each item in a dataset, runs scorer functions, and records everything as OTel benchmark spans.
8080

8181
```python
8282
from opensearch_genai_observability_sdk_py import register, observe, evaluate, Op, EvalScore
@@ -113,11 +113,11 @@ print(result.summary)
113113

114114
| Parameter | Type | Description |
115115
|---|---|---|
116-
| `name` | `str` | Experiment name. |
116+
| `name` | `str` | Benchmark name (`test.suite.name`). Stable across runs. |
117117
| `task` | `Callable` | Function that takes input and returns output. Use `@observe` for full tracing. |
118118
| `data` | `list[dict]` | Test cases: `"input"` (required), `"expected"`, `"case_id"`, `"case_name"` (optional). |
119119
| `scores` | `list[Callable]` | Scorer functions - each receives `(input, output, expected)`. |
120-
| `metadata` | `dict` | Attached to the root experiment span. |
120+
| `metadata` | `dict` | Attached to the root benchmark span. Reserved keys (`test.*`, `gen_ai.operation.name`) are filtered with a warning. |
121121
| `record_io` | `bool` | Record input/output/expected as span attributes. Default `False`. |
122122

123123
### Scorer functions
@@ -154,44 +154,48 @@ class EvalScore:
154154

155155
```mermaid
156156
flowchart TD
157-
A["test_suite_run - experiment root"] --> B["test_case - case 1"]
157+
A["test_suite_run - benchmark root"] --> B["test_case - case 1"]
158158
A --> C["test_case - case 2"]
159159
B --> D["invoke_agent my_agent"]
160160
B --> E["evaluation result events"]
161161
D --> F["execute_tool ..."]
162162
```
163163

164-
Agent traces from the task become children of `test_case` spans - full waterfall from experiment to individual LLM calls.
164+
Agent traces from the task become children of `test_case` spans - full waterfall from benchmark to individual LLM calls.
165165

166166
### Result types
167167

168168
```python
169169
result = evaluate(...)
170-
result.summary # ExperimentSummary
170+
result.summary # BenchmarkSummary
171171
result.summary.scores # dict[str, ScoreSummary] - avg, min, max, count per metric
172-
result.cases # list[CaseResult] - per-case input, output, scores, status
172+
result.cases # list[TestCaseResult] - per-case input, output, scores, status
173173
```
174174

175+
`BenchmarkResult` contains:
176+
- `summary: BenchmarkSummary` — benchmark name, run ID, total cases, error count, duration, and per-metric `ScoreSummary` (avg, min, max, count)
177+
- `cases: list[TestCaseResult]` — per-case case_id, input, output, expected, scores dict, error, status, scorer_errors
178+
175179
---
176180

177-
## `Experiment` - upload pre-computed results
181+
## `Benchmark` - upload pre-computed results
178182

179-
Use `Experiment` when you already have evaluation results from another framework (RAGAS, DeepEval, pytest, custom) and want to upload them as OTel spans.
183+
Use `Benchmark` when you already have evaluation results from another framework (RAGAS, DeepEval, pytest, custom) and want to upload them as OTel spans.
180184

181185
```python
182-
from opensearch_genai_observability_sdk_py import register, Experiment
186+
from opensearch_genai_observability_sdk_py import register, Benchmark
183187

184188
register(service_name="eval-upload")
185189

186-
with Experiment("ragas_eval_v2", metadata={"framework": "ragas"}) as exp:
187-
exp.log(
190+
with Benchmark("ragas_eval_v2", metadata={"framework": "ragas"}) as b:
191+
b.log(
188192
input="What is OpenSearch?",
189193
output="OpenSearch is an open-source search engine.",
190194
expected="search and analytics engine",
191195
scores={"faithfulness": 0.92, "relevance": 0.88},
192196
case_name="opensearch_definition",
193197
)
194-
exp.log(
198+
b.log(
195199
input="How does RAG work?",
196200
output="RAG retrieves documents then generates answers.",
197201
scores={"faithfulness": 0.95, "relevance": 0.91},
@@ -200,6 +204,22 @@ with Experiment("ragas_eval_v2", metadata={"framework": "ragas"}) as exp:
200204
# summary printed on close
201205
```
202206

207+
You can also use `Benchmark` without a context manager:
208+
209+
```python
210+
b = Benchmark(name="my-eval")
211+
b.log(input="q1", output="a1", scores={"accuracy": 1.0})
212+
summary = b.close()
213+
```
214+
215+
### Constructor parameters
216+
217+
| Parameter | Type | Default | Description |
218+
|---|---|---|---|
219+
| `name` | `str` | | Benchmark name (`test.suite.name`). Stable across runs. |
220+
| `metadata` | `dict` | `None` | Attached to the root span. Reserved keys (`test.*`, `gen_ai.operation.name`) are filtered with a warning. |
221+
| `record_io` | `bool` | `False` | Record input/output/expected as span attributes. |
222+
203223
### `log()` parameters
204224

205225
| Parameter | Type | Description |
@@ -208,12 +228,12 @@ with Experiment("ragas_eval_v2", metadata={"framework": "ragas"}) as exp:
208228
| `output` | any | Agent output. |
209229
| `expected` | any | Ground truth. |
210230
| `scores` | `dict[str, float]` | Pre-computed scores. |
211-
| `metadata` | `dict` | Per-case metadata. |
231+
| `metadata` | `dict` | Per-case metadata. Reserved keys are filtered. |
212232
| `error` | `str` | Error message (sets status to `"fail"`). |
213-
| `case_id` | `str` | Explicit ID. Defaults to SHA256 of input. |
214-
| `case_name` | `str` | Human-readable name. |
215-
| `trace_id` | `str` | Creates OTel span link to an agent trace. |
216-
| `span_id` | `str` | Span-level linking (with `trace_id`). |
233+
| `case_id` | `str` | Explicit ID. Defaults to SHA256 hash of input. |
234+
| `case_name` | `str` | Human-readable name (`test.case.name`). |
235+
| `trace_id` | `str` | Creates OTel span link to an agent trace. Must be provided with `span_id`. |
236+
| `span_id` | `str` | Span-level linking. Must be provided with `trace_id`. |
217237

218238
---
219239

@@ -394,4 +414,3 @@ retriever = OpenSearchTraceRetriever(
394414
- [Evaluation Integrations](/docs/ai-observability/evaluation-integrations/) - use DeepEval, RAGAS, MLflow, pytest with the observability stack
395415
- [Python SDK reference](/docs/send-data/ai-agents/python/) - `register`, `observe`, `enrich` documentation
396416
- [Agent Tracing UI](/docs/ai-observability/agent-tracing/) - explore traces in OpenSearch Dashboards
397-
- [Agent Health - Experiments](/docs/agent-health/evaluations/experiments/) - UI and CLI-based experiment workflows

docs/starlight-docs/src/content/docs/ai-observability/getting-started.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ flowchart LR
6969
| **Auto-instrument LLMs** | `register(auto_instrument=True)` | OpenAI, Anthropic, Bedrock, LangChain, and 20+ libraries traced automatically |
7070
| **Score traces** | `score()` | Attaches evaluation scores to traces through the OTLP pipeline |
7171
| **Run experiments** | `evaluate()` | Runs a task against a dataset with scorer functions, records everything as OTel spans |
72-
| **Upload results** | `Experiment` | Uploads pre-computed eval results from RAGAS, DeepEval, pytest, or custom frameworks |
72+
| **Upload results** | `Benchmark` | Uploads pre-computed eval results from RAGAS, DeepEval, pytest, or custom frameworks |
7373
| **Query traces** | `OpenSearchTraceRetriever` | Retrieves stored traces from OpenSearch for evaluation pipelines |
7474
| **AWS production** | `AWSSigV4OTLPExporter` | SigV4-signed exports to OpenSearch Ingestion or OpenSearch Service |
7575

@@ -196,6 +196,6 @@ flowchart LR
196196
## What's next
197197

198198
- [Python SDK reference](/docs/send-data/ai-agents/python/) - full API documentation for `register`, `observe`, `enrich`, and AWS auth
199-
- [Evaluation & Scoring](/docs/ai-observability/evaluation/) - `score()`, `evaluate()`, `Experiment`, and `OpenSearchTraceRetriever` in depth
199+
- [Evaluation & Scoring](/docs/ai-observability/evaluation/) - `score()`, `evaluate()`, `Benchmark`, and `OpenSearchTraceRetriever` in depth
200200
- [Agent Tracing UI](/docs/ai-observability/agent-tracing/) - explore traces, graphs, and timelines in OpenSearch Dashboards
201201
- [Agent Health](/docs/agent-health/) - evaluate agents with Golden Path comparison, LLM judges, and batch experiments

0 commit comments

Comments
 (0)