Skip to content

Commit b8a9cfc

Browse files
authored
Merge pull request awslabs#72 from abhilashbalachandran/docs/fix-custom-evaluator-api-params
Fix Custom Evaluators section in AGENTCORE_EVALUATIONS_GUIDE.md
2 parents 2af05bc + c8ff64a commit b8a9cfc

2 files changed

Lines changed: 151 additions & 20 deletions

File tree

CONTRIBUTORS.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,4 +16,5 @@
1616
- Gauhar Bains
1717
- Meghana Ashok
1818
- Suren Gunturu
19-
- Samaneh Aminikhanghahi
19+
- Samaneh Aminikhanghahi
20+
- Abhilash Balachandran

docs/AGENTCORE_EVALUATIONS_GUIDE.md

Lines changed: 149 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -422,6 +422,26 @@ for evaluator in evaluators.get("evaluatorSummaries", []):
422422

423423
Use custom evaluators when built-in ones don't cover your domain-specific quality criteria (e.g., financial accuracy, medical safety, brand voice compliance).
424424

425+
### Placeholder Reference
426+
427+
Each evaluation level supports a fixed set of placeholders (single braces) that get replaced with actual trace data:
428+
429+
| Level | Placeholder | Description |
430+
|---|---|---|
431+
| SESSION | `{context}` | User prompts, assistant responses, and tool calls across all turns |
432+
| SESSION | `{available_tools}` | Available tool calls including ID, parameters, and description |
433+
| TRACE | `{context}` | Previous turns + current turn's user prompt and tool calls |
434+
| TRACE | `{assistant_turn}` | The assistant response for the current turn |
435+
| TOOL_CALL | `{context}` | Previous turns + current turn's user prompt and prior tool calls |
436+
| TOOL_CALL | `{tool_turn}` | The tool call under evaluation |
437+
| TOOL_CALL | `{available_tools}` | Available tool calls including ID, parameters, and description |
438+
439+
> **Important:** Use single braces `{placeholder}`, not double braces `{{placeholder}}`. The instruction must include at least one placeholder.
440+
441+
### Create a Custom Evaluator (AWS SDK)
442+
443+
The `create_evaluator` API uses a nested `evaluatorConfig` structure with `llmAsAJudge` containing the instructions, rating scale, and model config:
444+
425445
```python
426446
import boto3
427447

@@ -430,33 +450,143 @@ control_client = boto3.client("bedrock-agentcore-control")
430450
response = control_client.create_evaluator(
431451
evaluatorName="domain_accuracy",
432452
description="Evaluates domain-specific accuracy for financial queries",
433-
evaluationLevel="TRACE",
434-
inferenceConfig={
435-
"modelId": "us.anthropic.claude-sonnet-4-5-20250929-v1:0",
436-
"maxTokens": 500,
437-
"temperature": 1.0
438-
},
439-
instructions="""Evaluate the agent's response for domain-specific accuracy
440-
in financial contexts. Consider:
441-
1. Are financial terms used correctly?
442-
2. Are calculations accurate?
443-
3. Are regulatory references correct?
444-
445-
{{input}} {{output}}""",
446-
ratingScale={
447-
"type": "NUMERIC",
448-
"min": 0.0,
449-
"max": 1.0,
450-
"description": "0 = completely inaccurate, 1 = fully accurate"
453+
level="TRACE",
454+
evaluatorConfig={
455+
"llmAsAJudge": {
456+
"instructions": (
457+
"You are evaluating the domain-specific accuracy of the assistant's response "
458+
"in financial contexts. Consider:\n"
459+
"1. Are financial terms used correctly?\n"
460+
"2. Are calculations accurate?\n"
461+
"3. Are regulatory references correct?\n\n"
462+
"Context: {context}\n"
463+
"Candidate Response: {assistant_turn}"
464+
),
465+
"ratingScale": {
466+
"numerical": [
467+
{
468+
"value": 1.0,
469+
"label": "Very Good",
470+
"definition": "Completely accurate, all facts and calculations correct"
471+
},
472+
{
473+
"value": 0.75,
474+
"label": "Good",
475+
"definition": "Mostly accurate with minor issues"
476+
},
477+
{
478+
"value": 0.5,
479+
"label": "OK",
480+
"definition": "Partially correct with notable errors"
481+
},
482+
{
483+
"value": 0.25,
484+
"label": "Poor",
485+
"definition": "Significant errors or misconceptions"
486+
},
487+
{
488+
"value": 0.0,
489+
"label": "Very Poor",
490+
"definition": "Completely incorrect or irrelevant"
491+
}
492+
]
493+
},
494+
"modelConfig": {
495+
"bedrockEvaluatorModelConfig": {
496+
"modelId": "us.anthropic.claude-sonnet-4-5-20250929-v1:0",
497+
"inferenceConfig": {
498+
"maxTokens": 500,
499+
"temperature": 1.0
500+
}
501+
}
502+
}
503+
}
451504
}
452505
)
453506

454507
evaluator_arn = response["evaluatorArn"]
455-
print(f"Created custom evaluator: {evaluator_arn}")
508+
evaluator_id = evaluator_arn.split("/")[-1]
509+
print(f"Created custom evaluator: {evaluator_id}")
510+
```
511+
512+
### Create a Custom Evaluator (Starter Toolkit SDK)
513+
514+
```python
515+
import json
516+
from bedrock_agentcore_starter_toolkit import Evaluation
517+
518+
eval_client = Evaluation(region="us-east-1")
519+
520+
# Load config from JSON file (see config format below)
521+
with open("custom_evaluator_config.json") as f:
522+
evaluator_config = json.load(f)
523+
524+
custom_evaluator = eval_client.create_evaluator(
525+
name="domain_accuracy",
526+
level="TRACE",
527+
description="Evaluates domain-specific accuracy for financial queries",
528+
config=evaluator_config
529+
)
530+
```
531+
532+
### Custom Evaluator Config JSON Format
533+
534+
Save this as `custom_evaluator_config.json` for use with the Starter Toolkit SDK or AWS CLI:
535+
536+
```json
537+
{
538+
"llmAsAJudge": {
539+
"modelConfig": {
540+
"bedrockEvaluatorModelConfig": {
541+
"modelId": "us.anthropic.claude-sonnet-4-5-20250929-v1:0",
542+
"inferenceConfig": {
543+
"maxTokens": 500,
544+
"temperature": 1.0
545+
}
546+
}
547+
},
548+
"instructions": "You are evaluating the quality of the assistant's response. Context: {context}\nCandidate Response: {assistant_turn}",
549+
"ratingScale": {
550+
"numerical": [
551+
{"value": 1.0, "label": "Very Good", "definition": "Completely accurate"},
552+
{"value": 0.75, "label": "Good", "definition": "Mostly accurate with minor issues"},
553+
{"value": 0.5, "label": "OK", "definition": "Partially correct"},
554+
{"value": 0.25, "label": "Poor", "definition": "Significant errors"},
555+
{"value": 0.0, "label": "Very Poor", "definition": "Completely incorrect"}
556+
]
557+
}
558+
}
559+
}
560+
```
561+
562+
### Add a Custom Evaluator to an Online Config
563+
564+
After creating the evaluator, add it to your online evaluation config:
565+
566+
```python
567+
import boto3
568+
569+
control_client = boto3.client("bedrock-agentcore-control")
570+
571+
# Get current evaluators
572+
config = control_client.get_online_evaluation_config(
573+
onlineEvaluationConfigId="your-config-id"
574+
)
575+
current_evaluators = [e["evaluatorId"] for e in config.get("evaluators", [])]
576+
577+
# Add the custom evaluator
578+
current_evaluators.append("domain_accuracy-XXXXXXXXXX") # Use the ID from create_evaluator
579+
580+
control_client.update_online_evaluation_config(
581+
onlineEvaluationConfigId="your-config-id",
582+
evaluators=[{"evaluatorId": eid} for eid in current_evaluators]
583+
)
456584
```
457585

458586
Custom evaluators can then be used in both online and on-demand evaluations just like built-in ones.
459587

588+
> **Note:** The service automatically appends a standardization prompt to your instructions that enforces `reason` and `score` output fields. Do not include output formatting instructions in your evaluator instructions.
589+
460590
---
461591

462592
## Evaluation Results & Format

0 commit comments

Comments
 (0)