You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: DEVELOPER.md
+25Lines changed: 25 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -49,6 +49,31 @@ Currently, there are no automated unit or integration test suites
49
49
within this repository. All functional testing must be performed manually. All skills
50
50
are currently tested in the [MCP Toolbox GitHub](https://github.com/googleapis/mcp-toolbox).
51
51
52
+
### Automated Skill Evaluations (EvalBench)
53
+
54
+
This repository uses the [EvalBench framework](https://github.com/GoogleCloudPlatform/evalbench) to automatically evaluate the quality, multi-turn conversational capabilities, and skill execution of the extension.
55
+
56
+
Evaluations run automatically via Cloud Build (`cloudbuild.yaml`) on pull requests when the `ci:run-evals` or `autorelease: pending` label is applied. Because tests run against a live BigQuery dataset, credentials are securely injected by Secret Manager during CI.
57
+
58
+
#### Understanding Evaluation Files
59
+
60
+
All evaluation configurations and datasets are located in the [`evals/`](evals/) directory:
61
+
62
+
***Conversational Dataset (`dataset.json`):** Defines test scenarios for the model. Each scenario contains:
63
+
*`starting_prompt`: The initial prompt sent to the agent.
64
+
*`conversation_plan`: Instructions for the simulated user LLM to drive multi-turn interactions.
65
+
*`expected_trajectory`: The sequence of tool/skill calls expected to successfully complete the task.
66
+
***Run Configuration (`run_config.yaml`):** Configures the EvalBench orchestrator, target model configs, and qualitative/performance scorers (e.g., goal completion, behavioral metrics, latency, token consumption).
67
+
68
+
#### Maintaining and Adding Scenarios
69
+
70
+
When adding new skills or modifying existing behavior, you should add or update corresponding scenarios in the dataset file:
71
+
72
+
1. Open `evals/dataset.json`.
73
+
2. Add a new scenario block with a unique `id`, a clear `starting_prompt`, a detailed `conversation_plan`, and the `expected_trajectory` of tool calls.
74
+
3. Apply the `ci:run-evals` label while creating your pull request to trigger the evaluation pipeline.
75
+
4. The evaluation pipeline runs securely via Cloud Build. A maintainer will review the internal logs and results to verify your scenarios pass successfully.
76
+
52
77
### Other GitHub Checks
53
78
54
79
***License Header Check:** A workflow ensures all necessary files contain the
0 commit comments