docs: add documentation for automated skill evaluations using EvalBench (#129)

omkargaikwad23 · web-flow · commit 279cc2a78b91 · 2026-05-11T10:45:44.000+05:30
diff --git a/DEVELOPER.md b/DEVELOPER.md
@@ -49,6 +49,31 @@ Currently, there are no automated unit or integration test suites
 within this repository. All functional testing must be performed manually. All skills
 are currently tested in the [MCP Toolbox GitHub](https://github.com/googleapis/mcp-toolbox).
 
+### Automated Skill Evaluations (EvalBench)
+
+This repository uses the [EvalBench framework](https://github.com/GoogleCloudPlatform/evalbench) to automatically evaluate the quality, multi-turn conversational capabilities, and skill execution of the extension.
+
+Evaluations run automatically via Cloud Build (`cloudbuild.yaml`) on pull requests when the `ci:run-evals` or `autorelease: pending` label is applied. Because tests run against a live BigQuery dataset, credentials are securely injected by Secret Manager during CI.
+
+#### Understanding Evaluation Files
+
+All evaluation configurations and datasets are located in the [`evals/`](evals/) directory:
+
+*   **Conversational Dataset (`dataset.json`):** Defines test scenarios for the model. Each scenario contains:
+    *   `starting_prompt`: The initial prompt sent to the agent.
+    *   `conversation_plan`: Instructions for the simulated user LLM to drive multi-turn interactions.
+    *   `expected_trajectory`: The sequence of tool/skill calls expected to successfully complete the task.
+*   **Run Configuration (`run_config.yaml`):** Configures the EvalBench orchestrator, target model configs, and qualitative/performance scorers (e.g., goal completion, behavioral metrics, latency, token consumption).
+
+#### Maintaining and Adding Scenarios
+
+When adding new skills or modifying existing behavior, you should add or update corresponding scenarios in the dataset file:
+
+1.  Open `evals/dataset.json`.
+2.  Add a new scenario block with a unique `id`, a clear `starting_prompt`, a detailed `conversation_plan`, and the `expected_trajectory` of tool calls.
+3.  Apply the `ci:run-evals` label while creating your pull request to trigger the evaluation pipeline.
+4.  The evaluation pipeline runs securely via Cloud Build. A maintainer will review the internal logs and results to verify your scenarios pass successfully.
+
 ### Other GitHub Checks
 
 *   **License Header Check:** A workflow ensures all necessary files contain the