44
55This tutorial shows how to build and run ** custom code-based evaluators** with Amazon Bedrock AgentCore Evaluations. Instead of relying on an LLM as the judge, code-based evaluators delegate scoring to an AWS Lambda function you write. This gives you deterministic, low-cost, fully customizable evaluation logic that can encode exact business rules, format constraints, or data validation requirements that an LLM might interpret loosely.
66
7- The tutorial pairs code-based evaluators with the built-in LLM evaluators from the [ groundtruth tutorial] ( ../05-groundtruth-based-evalautions/ ) to show how both types work side-by-side in a mixed evaluation run.
7+ The tutorial demonstrates code-based evaluators in ** both on-demand and online evaluation** modes, and pairs them with built-in LLM evaluators to show how both types work side-by-side in a mixed evaluation run.
8+
9+ ---
10+
11+ ## Setup with AgentCore CLI
12+
13+ The fastest way to bootstrap and deploy the agent is with the [ AgentCore CLI] ( https://github.com/aws/agentcore-cli ) (` 0.11.0 ` ).
14+
15+ ### Prerequisites
16+
17+ - ** Node.js** 20.x or later
18+ - ** uv** 0.4+ (Python package manager)
19+ - ** AWS CLI** 2.x with credentials configured
20+ - ** Docker** running locally (for agent container build)
21+ - ** Git** 2.x
22+
23+ ### Install the CLI
24+
25+ ``` bash
26+ npm install -g @aws/agentcore@0.11.0
27+ agentcore --version # should print 0.11.0
28+ ```
29+
30+ ### Configure AWS credentials
31+
32+ ``` bash
33+ aws configure
34+ aws sts get-caller-identity # verify credentials
35+ ```
36+
37+ Your IAM user/role needs permissions for: AgentCore Runtime, AgentCore Evaluations, Lambda,
38+ CloudWatch Logs, ECR, IAM, and Bedrock.
39+
40+ ### Create and deploy the agent
41+
42+ ``` bash
43+ # Scaffold a new AgentCore project
44+ agentcore create --name HRAssistant --framework Strands --model-provider Bedrock --defaults
45+
46+ # Copy the HR assistant implementation
47+ cp hr_assistant_agent.py app/HRAssistant/main.py
48+
49+ # Test locally
50+ agentcore dev
51+
52+ # Deploy to AWS (builds container, pushes to ECR, creates AgentCore Runtime)
53+ agentcore deploy
54+ ```
55+
56+ After ` agentcore deploy ` completes, note the ** Runtime ID** and ** ARN** from the output.
57+
58+ ### Register a code-based evaluator via CLI
59+
60+ ` agentcore add evaluator ` registers the evaluator in your project's ` agentcore.json ` . The evaluator
61+ is created in AWS when you run ` agentcore deploy ` .
62+
63+ ``` bash
64+ # Register a TRACE-level code-based evaluator
65+ agentcore add evaluator \
66+ --name HRResponseLength \
67+ --level TRACE \
68+ --type code-based \
69+ --lambda-arn arn:aws:lambda:< region> :< account-id> :function:hr-response-length \
70+ --timeout 30
71+
72+ # Register a SESSION-level code-based evaluator
73+ agentcore add evaluator \
74+ --name HRFactChecker \
75+ --level SESSION \
76+ --type code-based \
77+ --lambda-arn arn:aws:lambda:< region> :< account-id> :function:hr-fact-checker \
78+ --timeout 60
79+ ```
80+
81+ ### Run on-demand evaluation via CLI
82+
83+ ** Standalone mode** (no project needed) — use ` --runtime-arn ` and ` --evaluator-arn ` with the
84+ full ARNs of already-deployed resources. This works from any directory:
85+
86+ ``` bash
87+ agentcore run eval \
88+ --runtime-arn < agent-runtime-arn> \
89+ --evaluator-arn < hr-response-length-evaluator-arn> \
90+ --evaluator-arn < hr-fact-checker-evaluator-arn> \
91+ --session-id < session-id> \
92+ --region < aws-region>
93+ ```
94+
95+ Mix code-based (` --evaluator-arn ` ) with builtin (` --evaluator ` ) in one command:
96+
97+ ``` bash
98+ agentcore run eval \
99+ --runtime-arn < agent-runtime-arn> \
100+ --evaluator-arn < hr-response-length-evaluator-arn> \
101+ --evaluator-arn < hr-fact-checker-evaluator-arn> \
102+ --evaluator Builtin.Correctness \
103+ --evaluator Builtin.Helpfulness \
104+ --session-id < session-id> \
105+ --region < aws-region>
106+ ```
107+
108+ ** Project mode** (inside a deployed project directory) — use evaluator names from ` agentcore.json ` .
109+ Requires ` agentcore deploy ` to have been run first:
110+
111+ ``` bash
112+ agentcore run eval \
113+ --runtime HRAssistant \
114+ --evaluator HRResponseLength \
115+ --evaluator HRFactChecker \
116+ --session-id < session-id>
117+ ```
118+
119+ ### Add online evaluation via CLI
120+
121+ ` agentcore add online-eval ` adds the config to ` agentcore.json ` ; it is created in AWS on
122+ ` agentcore deploy ` . Run from inside your project directory:
123+
124+ ``` bash
125+ # sampling-rate is a percentage (0.01–100)
126+ agentcore add online-eval \
127+ --name hr_online_eval \
128+ --runtime HRAssistant \
129+ --evaluator HRResponseLength \
130+ --evaluator HRFactChecker \
131+ --sampling-rate 100 \
132+ --enable-on-create
133+ ```
134+
135+ > You can also use the notebook (Step 10) to create the online eval config programmatically
136+ > using the boto3 SDK, without needing a project directory.
8137
9138---
10139
@@ -56,6 +185,7 @@ def lambda_handler(input: EvaluatorInput, context) -> EvaluatorOutput:
56185│ 2. Register evaluators via bedrock-agentcore-control │
57186│ 3a. On-demand: EvaluationClient.run(session_id, evaluator_ids) │
58187│ 3b. Dataset: OnDemandEvaluationDatasetRunner.run(dataset, agent_invoker) │
188+ │ 3c. Online: create_online_evaluation_config (auto-evaluates all sessions) │
59189└────────────────┬────────────────────────────────────────────────────────────┘
60190 │
61191 ┌───────────▼────────────┐ ┌──────────────────────────────┐
@@ -81,7 +211,8 @@ def lambda_handler(input: EvaluatorInput, context) -> EvaluatorOutput:
812111 . Agent is invoked; OTel spans are written to CloudWatch
822122 . ` EvaluationClient ` or ` OnDemandEvaluationDatasetRunner ` collects spans from CloudWatch
832133 . The service calls each evaluator — builtin evaluators run LLM inference; code-based evaluators invoke your Lambda with the span payload
84- 4 . All results are aggregated and returned
214+ 4 . For ** online evaluation** , AgentCore continuously watches the log group and automatically evaluates new sessions without any explicit trigger
215+ 5 . All results are aggregated and returned (on-demand) or written to the online evaluation results log group
85216
86217---
87218
@@ -91,11 +222,11 @@ def lambda_handler(input: EvaluatorInput, context) -> EvaluatorOutput:
91222- ** Docker** running locally (for agent container image build)
92223- ** AWS credentials** with permissions for:
93224 - ` bedrock-agentcore:* ` — runtime and evaluations
94- - ` bedrock-agentcore-control:* ` — evaluator registration
225+ - ` bedrock-agentcore-control:* ` — evaluator registration and online eval config management
95226 - ` lambda:CreateFunction ` , ` lambda:UpdateFunctionCode ` , ` lambda:AddPermission ` , ` lambda:GetFunction `
96227 - ` logs:FilterLogEvents ` , ` logs:DescribeLogGroups ` — CloudWatch span collection
97228 - ` ecr:* ` — container image for the agent
98- - ` iam:* ` — auto- creating the agent execution role
229+ - ` iam:* ` — creating execution roles for the agent and online evaluation
99230- ** IAM role** named ` AgentCoreLambdaExecutionRole ` with ` AWSLambdaBasicExecutionRole ` attached
100231- ** bedrock-agentcore >= 1.6.0** installed in the notebook kernel
101232
@@ -109,6 +240,7 @@ def lambda_handler(input: EvaluatorInput, context) -> EvaluatorOutput:
109240| ---| ---|
110241| ` programmatic_evaluators.ipynb ` | Main tutorial notebook (standalone, end-to-end) |
111242| ` hr_assistant_agent.py ` | HR Assistant Strands agent (same as groundtruth tutorial) |
243+ | ` Dockerfile ` | Container definition for the agent (used by Step 3 fresh deploy and ` agentcore deploy ` ) |
112244| ` requirements.txt ` | Python dependencies (` bedrock-agentcore>=1.6.0 ` ) |
113245| ` lambdas/hr_response_length/lambda_function.py ` | Response length evaluator Lambda |
114246| ` lambdas/hr_fact_checker/lambda_function.py ` | HR fact-checking evaluator Lambda |
@@ -124,6 +256,7 @@ Checks that each agent response is between 50 and 600 characters. Responses shor
124256- ** Level:** TRACE — evaluated once per agent response
125257- ** Lambda:** ` hr-response-length `
126258- ** Returns:** ` 1.0 ` (PASS) if within range, ` 0.0 ` (FAIL) otherwise
259+ - ** Used in:** On-demand evaluation (Steps 7 & 8) and Online evaluation (Step 10)
127260
128261### HRFactChecker (SESSION level)
129262
@@ -137,6 +270,7 @@ Deterministically validates that the HR assistant's responses contain accurate f
137270 - PTO request ID format ` PTO-2026-NNN `
138271 - Policy facts: 15-day PTO accrual, 2-day advance notice, 401k 4% match, 90% health coverage
139272- ** Returns:** fraction of applicable checks passed (0.0–1.0), labeled ` PASS ` , ` PARTIAL ` , ` FAIL ` , or ` SKIP `
273+ - ** Used in:** On-demand evaluation (Steps 7 & 8) and Online evaluation (Step 10)
140274
141275---
142276
@@ -156,6 +290,60 @@ Results from all five evaluators are collected per scenario, letting you compare
156290
157291---
158292
293+ ## Online Evaluation with Code-Based Evaluators
294+
295+ Step 10 of the notebook demonstrates ** online evaluation** — a continuous evaluation mode where
296+ AgentCore automatically evaluates every live agent session without explicit API calls per session.
297+
298+ ### How it works
299+
300+ 1 . Register code-based evaluators (Steps 4–6, same as for on-demand)
301+ 2 . Create an online evaluation config via ` create_online_evaluation_config ` :
302+ - Point it at the agent's CloudWatch log group
303+ - Set a sampling rate (0–100%)
304+ - List the evaluator IDs (code-based and/or builtin)
305+ - Provide an IAM execution role the service can assume
306+ 3 . Enable the config — AgentCore starts watching the log group
307+ 4 . Every new agent session is automatically evaluated
308+ 5 . Results appear in the online evaluation results CloudWatch log group
309+
310+ ### Evaluator locking
311+
312+ When a code-based evaluator is referenced by an ** enabled** online evaluation config, AgentCore
313+ ** locks** it automatically. You cannot modify or delete a locked evaluator. To update it:
314+
315+ ```
316+ disable/delete online eval config
317+ ↓
318+ update evaluator Lambda or re-register
319+ ↓
320+ re-create online eval config
321+ ```
322+
323+ ### On-demand vs. online comparison
324+
325+ | Dimension | On-demand | Online |
326+ | ---| ---| ---|
327+ | Trigger | Explicit per session | Automatic on every invocation |
328+ | Setup | ` EvaluationClient.run() ` or ` OnDemandEvaluationDatasetRunner ` | ` create_online_evaluation_config ` once |
329+ | Code-based evaluators | ✅ Supported | ✅ Supported |
330+ | Evaluator locking | No | Yes — while config is enabled |
331+ | Best for | CI/CD, ad-hoc debugging | Continuous production monitoring |
332+
333+ ### AgentCore CLI shortcut
334+
335+ ``` bash
336+ # sampling-rate is a percentage (0.01–100); 50 = evaluate 50% of sessions
337+ agentcore add online-eval \
338+ --name my_online_eval \
339+ --runtime MyAgent \
340+ --evaluator MyCodeEvaluator \
341+ --sampling-rate 50 \
342+ --enable-on-create
343+ ```
344+
345+ ---
346+
159347## Sample Prompts
160348
161349The dataset includes five scenarios that exercise facts the ` HRFactChecker ` validates:
@@ -178,14 +366,15 @@ You can extend the dataset with additional scenarios to test more HR topics (rem
178366| ---| ---|
179367| 1 | Install dependencies (` bedrock-agentcore>=1.6.0 ` ) |
180368| 2 | Configure AWS session, region, and Lambda role ARN |
181- | 3 | Agent setup — reload from ` %store ` (groundtruth notebook) or deploy fresh |
369+ | 3 | Agent setup — reload from ` %store ` (groundtruth notebook) or deploy fresh with boto3 |
182370| 4 | Define Lambda evaluator functions using the ` @custom_code_based_evaluator() ` decorator |
183371| 5 | Deploy Lambda functions (bundled with bedrock-agentcore SDK + pydantic) |
184372| 6 | Register evaluators via ` bedrock-agentcore-control ` boto3 service |
185373| 7 | On-demand evaluation with ` EvaluationClient ` (code-based + builtin evaluators) |
186374| 8 | Dataset evaluation with ` OnDemandEvaluationDatasetRunner ` (mixed evaluator set) |
187375| 9 | Inspect and compare results (per-scenario tables + aggregate score comparison) |
188- | 10 | Cleanup — delete Lambda functions, evaluator records, and agent runtime |
376+ | ** 10** | ** Online evaluation with ` create_online_evaluation_config ` (code-based evaluators, auto-triggered)** |
377+ | 11 | Cleanup — delete Lambda functions, evaluator records, online eval config, and agent runtime |
189378
190379---
191380
@@ -213,8 +402,10 @@ span.span_events[*]
213402- ** Business rule enforcement** — encode domain-specific rules that LLMs might interpret loosely
214403- ** High-volume evaluation** — reduce cost for evaluations that run on every production session
215404- ** Regulatory requirements** — verify that required disclosures or disclaimers are always present
405+ - ** Continuous monitoring** — combine with online evaluation for zero-touch production quality gates
216406
217- > ** Note:** Code-based evaluators are supported for ** on-demand evaluation** (` EvaluationClient ` , ` OnDemandEvaluationDatasetRunner ` ) only. Online evaluation configs support built-in LLM evaluators only.
407+ Code-based evaluators are supported for ** both on-demand** (` EvaluationClient ` ,
408+ ` OnDemandEvaluationDatasetRunner ` ) and ** online** (` create_online_evaluation_config ` ) evaluation.
218409
219410---
220411
@@ -223,20 +414,27 @@ span.span_events[*]
223414To remove created AWS resources:
224415
225416``` python
226- # Delete Lambda functions
417+ # 1. Disable online evaluation config first (unlocks evaluators)
418+ cp_client.update_online_evaluation_config(
419+ onlineEvaluationConfigId = ONLINE_EVAL_CONFIG_ID ,
420+ enableOnCreate = False ,
421+ )
422+ cp_client.delete_online_evaluation_config(onlineEvaluationConfigId = ONLINE_EVAL_CONFIG_ID )
423+
424+ # 2. Delete Lambda functions
227425for fn in [" hr-response-length" , " hr-fact-checker" ]:
228426 lambda_client.delete_function(FunctionName = fn)
229427
230- # Delete evaluator registrations
428+ # 3. Delete evaluator registrations (now unlocked)
231429for name, eid in CODE_EVAL_IDS .items():
232430 cp_client.delete_evaluator(evaluatorId = eid)
233431
234- # Delete agent runtime (only if deployed in this notebook)
432+ # 4. Delete agent runtime (only if deployed in this notebook)
235433if not _agent_loaded:
236- agent_runtime.delete( )
434+ agentcore_control.delete_agent_runtime( agentRuntimeId = AGENT_ID )
237435```
238436
239- Alternatively, run the cleanup cell (Step 10 ) in the notebook — it is commented out by default to prevent accidental deletion.
437+ Alternatively, run the cleanup cell (Step 11 ) in the notebook — it is commented out by default to prevent accidental deletion.
240438
241439---
242440
@@ -245,4 +443,5 @@ Alternatively, run the cleanup cell (Step 10) in the notebook — it is commente
245443- Extend ` HRFactChecker ` with additional business rules as your agent and data model evolve
246444- Combine code-based evaluators with ` EvaluationClient ` to validate specific production sessions
247445- Add code-based evaluators to your CI/CD pipeline for zero-cost regression testing on every deployment
446+ - Use online evaluation with a lower sampling rate (e.g. 10%) to cost-effectively monitor high-traffic agents
248447- Explore the [ groundtruth tutorial] ( ../05-groundtruth-based-evalautions/ ) for ` EvaluationClient ` and ground-truth-based evaluations with built-in evaluators
0 commit comments