Skip to content

Commit 4eafe85

Browse files
adding online evaluation for custom code based evaluators and CLI examples (#1412)
1 parent 2c0fdfc commit 4eafe85

3 files changed

Lines changed: 806 additions & 56 deletions

File tree

01-tutorials/07-AgentCore-evaluations/06-programmatic_evaluators/README.md

Lines changed: 211 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,136 @@
44

55
This tutorial shows how to build and run **custom code-based evaluators** with Amazon Bedrock AgentCore Evaluations. Instead of relying on an LLM as the judge, code-based evaluators delegate scoring to an AWS Lambda function you write. This gives you deterministic, low-cost, fully customizable evaluation logic that can encode exact business rules, format constraints, or data validation requirements that an LLM might interpret loosely.
66

7-
The tutorial pairs code-based evaluators with the built-in LLM evaluators from the [groundtruth tutorial](../05-groundtruth-based-evalautions/) to show how both types work side-by-side in a mixed evaluation run.
7+
The tutorial demonstrates code-based evaluators in **both on-demand and online evaluation** modes, and pairs them with built-in LLM evaluators to show how both types work side-by-side in a mixed evaluation run.
8+
9+
---
10+
11+
## Setup with AgentCore CLI
12+
13+
The fastest way to bootstrap and deploy the agent is with the [AgentCore CLI](https://github.com/aws/agentcore-cli) (`0.11.0`).
14+
15+
### Prerequisites
16+
17+
- **Node.js** 20.x or later
18+
- **uv** 0.4+ (Python package manager)
19+
- **AWS CLI** 2.x with credentials configured
20+
- **Docker** running locally (for agent container build)
21+
- **Git** 2.x
22+
23+
### Install the CLI
24+
25+
```bash
26+
npm install -g @aws/agentcore@0.11.0
27+
agentcore --version # should print 0.11.0
28+
```
29+
30+
### Configure AWS credentials
31+
32+
```bash
33+
aws configure
34+
aws sts get-caller-identity # verify credentials
35+
```
36+
37+
Your IAM user/role needs permissions for: AgentCore Runtime, AgentCore Evaluations, Lambda,
38+
CloudWatch Logs, ECR, IAM, and Bedrock.
39+
40+
### Create and deploy the agent
41+
42+
```bash
43+
# Scaffold a new AgentCore project
44+
agentcore create --name HRAssistant --framework Strands --model-provider Bedrock --defaults
45+
46+
# Copy the HR assistant implementation
47+
cp hr_assistant_agent.py app/HRAssistant/main.py
48+
49+
# Test locally
50+
agentcore dev
51+
52+
# Deploy to AWS (builds container, pushes to ECR, creates AgentCore Runtime)
53+
agentcore deploy
54+
```
55+
56+
After `agentcore deploy` completes, note the **Runtime ID** and **ARN** from the output.
57+
58+
### Register a code-based evaluator via CLI
59+
60+
`agentcore add evaluator` registers the evaluator in your project's `agentcore.json`. The evaluator
61+
is created in AWS when you run `agentcore deploy`.
62+
63+
```bash
64+
# Register a TRACE-level code-based evaluator
65+
agentcore add evaluator \
66+
--name HRResponseLength \
67+
--level TRACE \
68+
--type code-based \
69+
--lambda-arn arn:aws:lambda:<region>:<account-id>:function:hr-response-length \
70+
--timeout 30
71+
72+
# Register a SESSION-level code-based evaluator
73+
agentcore add evaluator \
74+
--name HRFactChecker \
75+
--level SESSION \
76+
--type code-based \
77+
--lambda-arn arn:aws:lambda:<region>:<account-id>:function:hr-fact-checker \
78+
--timeout 60
79+
```
80+
81+
### Run on-demand evaluation via CLI
82+
83+
**Standalone mode** (no project needed) — use `--runtime-arn` and `--evaluator-arn` with the
84+
full ARNs of already-deployed resources. This works from any directory:
85+
86+
```bash
87+
agentcore run eval \
88+
--runtime-arn <agent-runtime-arn> \
89+
--evaluator-arn <hr-response-length-evaluator-arn> \
90+
--evaluator-arn <hr-fact-checker-evaluator-arn> \
91+
--session-id <session-id> \
92+
--region <aws-region>
93+
```
94+
95+
Mix code-based (`--evaluator-arn`) with builtin (`--evaluator`) in one command:
96+
97+
```bash
98+
agentcore run eval \
99+
--runtime-arn <agent-runtime-arn> \
100+
--evaluator-arn <hr-response-length-evaluator-arn> \
101+
--evaluator-arn <hr-fact-checker-evaluator-arn> \
102+
--evaluator Builtin.Correctness \
103+
--evaluator Builtin.Helpfulness \
104+
--session-id <session-id> \
105+
--region <aws-region>
106+
```
107+
108+
**Project mode** (inside a deployed project directory) — use evaluator names from `agentcore.json`.
109+
Requires `agentcore deploy` to have been run first:
110+
111+
```bash
112+
agentcore run eval \
113+
--runtime HRAssistant \
114+
--evaluator HRResponseLength \
115+
--evaluator HRFactChecker \
116+
--session-id <session-id>
117+
```
118+
119+
### Add online evaluation via CLI
120+
121+
`agentcore add online-eval` adds the config to `agentcore.json`; it is created in AWS on
122+
`agentcore deploy`. Run from inside your project directory:
123+
124+
```bash
125+
# sampling-rate is a percentage (0.01–100)
126+
agentcore add online-eval \
127+
--name hr_online_eval \
128+
--runtime HRAssistant \
129+
--evaluator HRResponseLength \
130+
--evaluator HRFactChecker \
131+
--sampling-rate 100 \
132+
--enable-on-create
133+
```
134+
135+
> You can also use the notebook (Step 10) to create the online eval config programmatically
136+
> using the boto3 SDK, without needing a project directory.
8137
9138
---
10139

@@ -56,6 +185,7 @@ def lambda_handler(input: EvaluatorInput, context) -> EvaluatorOutput:
56185
│ 2. Register evaluators via bedrock-agentcore-control │
57186
│ 3a. On-demand: EvaluationClient.run(session_id, evaluator_ids) │
58187
│ 3b. Dataset: OnDemandEvaluationDatasetRunner.run(dataset, agent_invoker) │
188+
│ 3c. Online: create_online_evaluation_config (auto-evaluates all sessions) │
59189
└────────────────┬────────────────────────────────────────────────────────────┘
60190
61191
┌───────────▼────────────┐ ┌──────────────────────────────┐
@@ -81,7 +211,8 @@ def lambda_handler(input: EvaluatorInput, context) -> EvaluatorOutput:
81211
1. Agent is invoked; OTel spans are written to CloudWatch
82212
2. `EvaluationClient` or `OnDemandEvaluationDatasetRunner` collects spans from CloudWatch
83213
3. The service calls each evaluator — builtin evaluators run LLM inference; code-based evaluators invoke your Lambda with the span payload
84-
4. All results are aggregated and returned
214+
4. For **online evaluation**, AgentCore continuously watches the log group and automatically evaluates new sessions without any explicit trigger
215+
5. All results are aggregated and returned (on-demand) or written to the online evaluation results log group
85216

86217
---
87218

@@ -91,11 +222,11 @@ def lambda_handler(input: EvaluatorInput, context) -> EvaluatorOutput:
91222
- **Docker** running locally (for agent container image build)
92223
- **AWS credentials** with permissions for:
93224
- `bedrock-agentcore:*` — runtime and evaluations
94-
- `bedrock-agentcore-control:*` — evaluator registration
225+
- `bedrock-agentcore-control:*` — evaluator registration and online eval config management
95226
- `lambda:CreateFunction`, `lambda:UpdateFunctionCode`, `lambda:AddPermission`, `lambda:GetFunction`
96227
- `logs:FilterLogEvents`, `logs:DescribeLogGroups` — CloudWatch span collection
97228
- `ecr:*` — container image for the agent
98-
- `iam:*`auto-creating the agent execution role
229+
- `iam:*` — creating execution roles for the agent and online evaluation
99230
- **IAM role** named `AgentCoreLambdaExecutionRole` with `AWSLambdaBasicExecutionRole` attached
100231
- **bedrock-agentcore >= 1.6.0** installed in the notebook kernel
101232

@@ -109,6 +240,7 @@ def lambda_handler(input: EvaluatorInput, context) -> EvaluatorOutput:
109240
|---|---|
110241
| `programmatic_evaluators.ipynb` | Main tutorial notebook (standalone, end-to-end) |
111242
| `hr_assistant_agent.py` | HR Assistant Strands agent (same as groundtruth tutorial) |
243+
| `Dockerfile` | Container definition for the agent (used by Step 3 fresh deploy and `agentcore deploy`) |
112244
| `requirements.txt` | Python dependencies (`bedrock-agentcore>=1.6.0`) |
113245
| `lambdas/hr_response_length/lambda_function.py` | Response length evaluator Lambda |
114246
| `lambdas/hr_fact_checker/lambda_function.py` | HR fact-checking evaluator Lambda |
@@ -124,6 +256,7 @@ Checks that each agent response is between 50 and 600 characters. Responses shor
124256
- **Level:** TRACE — evaluated once per agent response
125257
- **Lambda:** `hr-response-length`
126258
- **Returns:** `1.0` (PASS) if within range, `0.0` (FAIL) otherwise
259+
- **Used in:** On-demand evaluation (Steps 7 & 8) and Online evaluation (Step 10)
127260

128261
### HRFactChecker (SESSION level)
129262

@@ -137,6 +270,7 @@ Deterministically validates that the HR assistant's responses contain accurate f
137270
- PTO request ID format `PTO-2026-NNN`
138271
- Policy facts: 15-day PTO accrual, 2-day advance notice, 401k 4% match, 90% health coverage
139272
- **Returns:** fraction of applicable checks passed (0.0–1.0), labeled `PASS`, `PARTIAL`, `FAIL`, or `SKIP`
273+
- **Used in:** On-demand evaluation (Steps 7 & 8) and Online evaluation (Step 10)
140274

141275
---
142276

@@ -156,6 +290,60 @@ Results from all five evaluators are collected per scenario, letting you compare
156290

157291
---
158292

293+
## Online Evaluation with Code-Based Evaluators
294+
295+
Step 10 of the notebook demonstrates **online evaluation** — a continuous evaluation mode where
296+
AgentCore automatically evaluates every live agent session without explicit API calls per session.
297+
298+
### How it works
299+
300+
1. Register code-based evaluators (Steps 4–6, same as for on-demand)
301+
2. Create an online evaluation config via `create_online_evaluation_config`:
302+
- Point it at the agent's CloudWatch log group
303+
- Set a sampling rate (0–100%)
304+
- List the evaluator IDs (code-based and/or builtin)
305+
- Provide an IAM execution role the service can assume
306+
3. Enable the config — AgentCore starts watching the log group
307+
4. Every new agent session is automatically evaluated
308+
5. Results appear in the online evaluation results CloudWatch log group
309+
310+
### Evaluator locking
311+
312+
When a code-based evaluator is referenced by an **enabled** online evaluation config, AgentCore
313+
**locks** it automatically. You cannot modify or delete a locked evaluator. To update it:
314+
315+
```
316+
disable/delete online eval config
317+
318+
update evaluator Lambda or re-register
319+
320+
re-create online eval config
321+
```
322+
323+
### On-demand vs. online comparison
324+
325+
| Dimension | On-demand | Online |
326+
|---|---|---|
327+
| Trigger | Explicit per session | Automatic on every invocation |
328+
| Setup | `EvaluationClient.run()` or `OnDemandEvaluationDatasetRunner` | `create_online_evaluation_config` once |
329+
| Code-based evaluators | ✅ Supported | ✅ Supported |
330+
| Evaluator locking | No | Yes — while config is enabled |
331+
| Best for | CI/CD, ad-hoc debugging | Continuous production monitoring |
332+
333+
### AgentCore CLI shortcut
334+
335+
```bash
336+
# sampling-rate is a percentage (0.01–100); 50 = evaluate 50% of sessions
337+
agentcore add online-eval \
338+
--name my_online_eval \
339+
--runtime MyAgent \
340+
--evaluator MyCodeEvaluator \
341+
--sampling-rate 50 \
342+
--enable-on-create
343+
```
344+
345+
---
346+
159347
## Sample Prompts
160348

161349
The dataset includes five scenarios that exercise facts the `HRFactChecker` validates:
@@ -178,14 +366,15 @@ You can extend the dataset with additional scenarios to test more HR topics (rem
178366
|---|---|
179367
| 1 | Install dependencies (`bedrock-agentcore>=1.6.0`) |
180368
| 2 | Configure AWS session, region, and Lambda role ARN |
181-
| 3 | Agent setup — reload from `%store` (groundtruth notebook) or deploy fresh |
369+
| 3 | Agent setup — reload from `%store` (groundtruth notebook) or deploy fresh with boto3 |
182370
| 4 | Define Lambda evaluator functions using the `@custom_code_based_evaluator()` decorator |
183371
| 5 | Deploy Lambda functions (bundled with bedrock-agentcore SDK + pydantic) |
184372
| 6 | Register evaluators via `bedrock-agentcore-control` boto3 service |
185373
| 7 | On-demand evaluation with `EvaluationClient` (code-based + builtin evaluators) |
186374
| 8 | Dataset evaluation with `OnDemandEvaluationDatasetRunner` (mixed evaluator set) |
187375
| 9 | Inspect and compare results (per-scenario tables + aggregate score comparison) |
188-
| 10 | Cleanup — delete Lambda functions, evaluator records, and agent runtime |
376+
| **10** | **Online evaluation with `create_online_evaluation_config` (code-based evaluators, auto-triggered)** |
377+
| 11 | Cleanup — delete Lambda functions, evaluator records, online eval config, and agent runtime |
189378

190379
---
191380

@@ -213,8 +402,10 @@ span.span_events[*]
213402
- **Business rule enforcement** — encode domain-specific rules that LLMs might interpret loosely
214403
- **High-volume evaluation** — reduce cost for evaluations that run on every production session
215404
- **Regulatory requirements** — verify that required disclosures or disclaimers are always present
405+
- **Continuous monitoring** — combine with online evaluation for zero-touch production quality gates
216406

217-
> **Note:** Code-based evaluators are supported for **on-demand evaluation** (`EvaluationClient`, `OnDemandEvaluationDatasetRunner`) only. Online evaluation configs support built-in LLM evaluators only.
407+
Code-based evaluators are supported for **both on-demand** (`EvaluationClient`,
408+
`OnDemandEvaluationDatasetRunner`) and **online** (`create_online_evaluation_config`) evaluation.
218409

219410
---
220411

@@ -223,20 +414,27 @@ span.span_events[*]
223414
To remove created AWS resources:
224415

225416
```python
226-
# Delete Lambda functions
417+
# 1. Disable online evaluation config first (unlocks evaluators)
418+
cp_client.update_online_evaluation_config(
419+
onlineEvaluationConfigId=ONLINE_EVAL_CONFIG_ID,
420+
enableOnCreate=False,
421+
)
422+
cp_client.delete_online_evaluation_config(onlineEvaluationConfigId=ONLINE_EVAL_CONFIG_ID)
423+
424+
# 2. Delete Lambda functions
227425
for fn in ["hr-response-length", "hr-fact-checker"]:
228426
lambda_client.delete_function(FunctionName=fn)
229427

230-
# Delete evaluator registrations
428+
# 3. Delete evaluator registrations (now unlocked)
231429
for name, eid in CODE_EVAL_IDS.items():
232430
cp_client.delete_evaluator(evaluatorId=eid)
233431

234-
# Delete agent runtime (only if deployed in this notebook)
432+
# 4. Delete agent runtime (only if deployed in this notebook)
235433
if not _agent_loaded:
236-
agent_runtime.delete()
434+
agentcore_control.delete_agent_runtime(agentRuntimeId=AGENT_ID)
237435
```
238436

239-
Alternatively, run the cleanup cell (Step 10) in the notebook — it is commented out by default to prevent accidental deletion.
437+
Alternatively, run the cleanup cell (Step 11) in the notebook — it is commented out by default to prevent accidental deletion.
240438

241439
---
242440

@@ -245,4 +443,5 @@ Alternatively, run the cleanup cell (Step 10) in the notebook — it is commente
245443
- Extend `HRFactChecker` with additional business rules as your agent and data model evolve
246444
- Combine code-based evaluators with `EvaluationClient` to validate specific production sessions
247445
- Add code-based evaluators to your CI/CD pipeline for zero-cost regression testing on every deployment
446+
- Use online evaluation with a lower sampling rate (e.g. 10%) to cost-effectively monitor high-traffic agents
248447
- Explore the [groundtruth tutorial](../05-groundtruth-based-evalautions/) for `EvaluationClient` and ground-truth-based evaluations with built-in evaluators

0 commit comments

Comments
 (0)