Skip to content

Commit 279cc2a

Browse files
docs: add documentation for automated skill evaluations using EvalBench (#129)
1 parent c441803 commit 279cc2a

1 file changed

Lines changed: 25 additions & 0 deletions

File tree

DEVELOPER.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,31 @@ Currently, there are no automated unit or integration test suites
4949
within this repository. All functional testing must be performed manually. All skills
5050
are currently tested in the [MCP Toolbox GitHub](https://github.com/googleapis/mcp-toolbox).
5151

52+
### Automated Skill Evaluations (EvalBench)
53+
54+
This repository uses the [EvalBench framework](https://github.com/GoogleCloudPlatform/evalbench) to automatically evaluate the quality, multi-turn conversational capabilities, and skill execution of the extension.
55+
56+
Evaluations run automatically via Cloud Build (`cloudbuild.yaml`) on pull requests when the `ci:run-evals` or `autorelease: pending` label is applied. Because tests run against a live BigQuery dataset, credentials are securely injected by Secret Manager during CI.
57+
58+
#### Understanding Evaluation Files
59+
60+
All evaluation configurations and datasets are located in the [`evals/`](evals/) directory:
61+
62+
* **Conversational Dataset (`dataset.json`):** Defines test scenarios for the model. Each scenario contains:
63+
* `starting_prompt`: The initial prompt sent to the agent.
64+
* `conversation_plan`: Instructions for the simulated user LLM to drive multi-turn interactions.
65+
* `expected_trajectory`: The sequence of tool/skill calls expected to successfully complete the task.
66+
* **Run Configuration (`run_config.yaml`):** Configures the EvalBench orchestrator, target model configs, and qualitative/performance scorers (e.g., goal completion, behavioral metrics, latency, token consumption).
67+
68+
#### Maintaining and Adding Scenarios
69+
70+
When adding new skills or modifying existing behavior, you should add or update corresponding scenarios in the dataset file:
71+
72+
1. Open `evals/dataset.json`.
73+
2. Add a new scenario block with a unique `id`, a clear `starting_prompt`, a detailed `conversation_plan`, and the `expected_trajectory` of tool calls.
74+
3. Apply the `ci:run-evals` label while creating your pull request to trigger the evaluation pipeline.
75+
4. The evaluation pipeline runs securely via Cloud Build. A maintainer will review the internal logs and results to verify your scenarios pass successfully.
76+
5277
### Other GitHub Checks
5378

5479
* **License Header Check:** A workflow ensures all necessary files contain the

0 commit comments

Comments
 (0)