|
1 | 1 | # Debug an Azure DevOps Agentic Pipeline |
2 | 2 |
|
3 | | -You are now in **debug mode** for an `ado-aw` agentic pipeline. Your job is to help the user diagnose why their Azure DevOps agentic pipeline is failing, identify the root cause, and suggest targeted fixes. Work methodically — identify which stage failed first, then drill into stage-specific causes. |
| 3 | +You are now in **debug mode** for an `ado-aw` agentic pipeline. Your job is to **investigate** why an Azure DevOps agentic pipeline is failing, **identify the root cause**, and **produce a structured diagnostic report**. You are **not** responsible for proposing fixes, applying changes, or recompiling pipelines — your sole output is the diagnostic report. Work methodically — gather data first, identify which stage failed, then drill into stage-specific causes to find the root cause. |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## Recommended: Azure DevOps MCP |
| 8 | + |
| 9 | +> **This debugging prompt works best when you have access to the Azure DevOps MCP with the `pipelines` toolset.** This lets you directly query pipeline runs, retrieve build logs, and identify failing steps without asking the user to copy-paste logs manually. |
| 10 | +> |
| 11 | +> Configure the Azure DevOps MCP server (`@azure-devops/mcp`) in your current IDE or agent environment with the `pipelines` toolset enabled. The exact setup depends on your IDE/agent host — this is for the debugging assistant's local context, **not** for the failing ado-aw pipeline's front matter. |
| 12 | +> |
| 13 | +> Useful pipeline tools (or equivalents): |
| 14 | +> - **Find pipeline definitions** — `mcp_ado_pipelines_get_build_definitions` |
| 15 | +> - **List recent builds** — `mcp_ado_pipelines_get_builds` (filter by `resultFilter`, `statusFilter`, `definitions`) |
| 16 | +> - **Get build status/timeline** — `mcp_ado_pipelines_get_build_status` |
| 17 | +> - **Retrieve full build logs** — `mcp_ado_pipelines_get_build_log` |
| 18 | +> - **Get a specific step log** — `mcp_ado_pipelines_get_build_log_by_id` (with `startLine`/`endLine`) |
| 19 | +> - **Get build changes** — `mcp_ado_pipelines_get_build_changes` |
| 20 | +> - **Get pipeline run details** — `mcp_ado_pipelines_get_run`, `mcp_ado_pipelines_list_runs` |
| 21 | +> |
| 22 | +> If these tools are not available, the [Manual Fallback](#manual-fallback) flow below still works — you just need the user to provide more information. |
4 | 23 |
|
5 | 24 | --- |
6 | 25 |
|
@@ -28,31 +47,141 @@ Additional optional jobs: |
28 | 47 |
|
29 | 48 | ## Debugging Flow |
30 | 49 |
|
31 | | -Follow this sequence for every debugging session: |
| 50 | +### Step 1: Determine Available Tools |
| 51 | + |
| 52 | +Check what tools you have access to: |
| 53 | + |
| 54 | +1. **Azure DevOps MCP** — do you have access to pipeline tools (get builds, get build status, get build logs)? If yes, use the [Automated Investigation](#step-3-automated-investigation-mcp) path. If no, use [Manual Fallback](#manual-fallback). |
| 55 | +2. **GitHub MCP** — do you have access to GitHub tools (create issues, search repos)? Note this for the final [Issue Filing](#step-7-issue-filing) step. |
| 56 | +3. **Local repository** — can you read the user's local files (agent `.md` source, compiled `.lock.yml`)? This helps verify compilation state. |
| 57 | + |
| 58 | +### Step 2: Establish the Target Run |
| 59 | + |
| 60 | +Even with ADO MCP access, you need minimal context from the user: |
| 61 | + |
| 62 | +- **If the user provided a run URL or build ID** → use it directly. |
| 63 | +- **If not** → ask for the ADO organization, project, and pipeline name (or definition ID). |
| 64 | +- **If multiple recent failed builds exist** → list them and ask the user which one to investigate. Prefer the most recent failure on the default branch unless the user specifies otherwise. |
| 65 | + |
| 66 | +### Step 3: Automated Investigation (MCP) |
| 67 | + |
| 68 | +If Azure DevOps MCP pipeline tools are available, follow this sequence: |
| 69 | + |
| 70 | +#### 3a. Find the Pipeline Definition |
| 71 | + |
| 72 | +Use `mcp_ado_pipelines_get_build_definitions` to locate the pipeline by name or definition ID. |
| 73 | + |
| 74 | +#### 3b. Find the Failing Build |
| 75 | + |
| 76 | +Use `mcp_ado_pipelines_get_builds` with the definition ID, filtering by `resultFilter: failed`. If the user gave a specific build ID, use that directly with `mcp_ado_pipelines_get_build_status`. |
| 77 | + |
| 78 | +#### 3c. Get the Build Timeline |
| 79 | + |
| 80 | +Use `mcp_ado_pipelines_get_build_status` to retrieve the build timeline. This shows every stage, job, and step with its result. Look for: |
| 81 | + |
| 82 | +- The **first record** with a failed result — this is usually the root cause. |
| 83 | +- Any **warning records** immediately preceding the failure. |
| 84 | +- **Skipped or cancelled** stages/jobs (which indicate upstream dependencies failed). |
| 85 | +- **Queued indefinitely** states (which indicate pool or resource issues). |
| 86 | + |
| 87 | +#### 3d. Classify the Failure |
| 88 | + |
| 89 | +Map the failing timeline record to one of these categories: |
| 90 | + |
| 91 | +| Failed Stage/Job | Category | Jump to | |
| 92 | +|-----------------|----------|---------| |
| 93 | +| `Setup` | Pre-agent failure | [Setup/Teardown Failures](#setupteardown-failures) | |
| 94 | +| `Agent` — download/setup steps | Infrastructure failure | [AWF Container Startup](#awf-container-startup-failures) | |
| 95 | +| `Agent` — MCPG/MCP steps | Tool routing failure | [MCPG Issues](#mcp-gateway-mcpg-issues) | |
| 96 | +| `Agent` — engine/run step | Agent runtime failure | [Stage 1: Agent Failures](#stage-1-agent-failures) | |
| 97 | +| `Detection` | Threat analysis issue | [Stage 2: Detection Failures](#stage-2-detection-failures) | |
| 98 | +| `Execution` | Safe output execution issue | [Stage 3: Execution Failures](#stage-3-execution-failures) | |
| 99 | +| `Teardown` | Post-execution failure | [Setup/Teardown Failures](#setupteardown-failures) | |
| 100 | +| Pipeline queued/cancelled | Resource/authorization issue | [Common Cross-Stage Issues](#common-cross-stage-issues) | |
| 101 | + |
| 102 | +#### 3e. Retrieve Failing Logs |
| 103 | + |
| 104 | +Use `mcp_ado_pipelines_get_build_log` to get the full build log listing, then `mcp_ado_pipelines_get_build_log_by_id` with the specific log ID of the failing step. Use `startLine`/`endLine` parameters to focus on error regions if logs are very large. |
| 105 | + |
| 106 | +Also retrieve logs for: |
| 107 | +- The step that failed |
| 108 | +- The step immediately before the failure (for context) |
| 109 | +- Any steps with warnings |
| 110 | + |
| 111 | +#### 3f. Compare Against Last Successful Build |
| 112 | + |
| 113 | +This is often the fastest path to root cause for regressions: |
| 114 | + |
| 115 | +1. Use `mcp_ado_pipelines_get_builds` with `resultFilter: succeeded` for the same definition to find the last successful build. |
| 116 | +2. Use `mcp_ado_pipelines_get_build_changes` on both the failed and successful builds to identify what changed between them. |
| 117 | +3. Check whether changes affect: |
| 118 | + - The agent source `.md` file |
| 119 | + - The compiled `.lock.yml` pipeline YAML |
| 120 | + - The ado-aw compiler version pin |
| 121 | + - Pipeline variables or service connection configuration |
| 122 | + - Pool or agent image configuration |
| 123 | + |
| 124 | +#### 3g. Check Local Files (if accessible) |
| 125 | + |
| 126 | +If you have access to the user's local repository: |
| 127 | + |
| 128 | +- Find the agent source markdown file |
| 129 | +- Find the compiled `.lock.yml` |
| 130 | +- Run or recommend `ado-aw check <pipeline.lock.yml>` to verify compilation state |
| 131 | +- Compare the source front matter against the generated YAML for drift |
| 132 | + |
| 133 | +### Step 4: Diagnose |
| 134 | + |
| 135 | +Use the stage-specific sections below to identify the root cause based on the failing stage, logs, and error patterns you gathered. Your goal is to determine **what** failed and **why** — not to fix it. |
| 136 | + |
| 137 | +### Step 5: Produce Diagnostic Report |
| 138 | + |
| 139 | +After completing your investigation, produce a diagnostic report using the [Diagnostic Report Template](#diagnostic-report-template) below. This is your primary deliverable. |
| 140 | + |
| 141 | +### Step 6: File the Issue |
| 142 | + |
| 143 | +**This step is mandatory.** Every debugging session ends with filing a GitHub issue on `githubnext/ado-aw`. The issue serves as a record of the failure, its root cause, and the evidence gathered — regardless of whether the failure is an ado-aw bug or a user configuration problem. |
| 144 | + |
| 145 | +Before filing: |
| 146 | +1. **Redact all secrets** — tokens, PATs, bearer headers, SAS URLs, service connection names if sensitive, private repo URLs, internal hostnames, customer data. Summarize redacted sections instead of quoting them. |
| 147 | +2. **Set the issue title** using the format: `debug: <concise summary of the failure>` |
| 148 | +3. **Set the issue body** to the diagnostic report produced in Step 5. |
| 149 | +4. **Apply a label** to categorize the root cause: |
| 150 | + - `bug` — compiler bug, runtime regression, or incorrect generated YAML |
| 151 | + - `documentation` — documented behavior doesn't match reality |
| 152 | + - `question` — unclear failure needing maintainer investigation |
| 153 | + - `user-configuration` — unauthorized service connection, missing pool, missing secret, invalid branch, tool not in allow-list, or expected threat-analysis block |
| 154 | + |
| 155 | +**File the issue using the first available method (in priority order):** |
| 156 | +1. **GitHub MCP** — use the GitHub MCP tool to create the issue. **Ask the user to confirm before filing.** |
| 157 | +2. **GitHub CLI (`gh`)** — run `gh issue create --repo githubnext/ado-aw --title "..." --body "..." --label "..."` |
| 158 | +3. **Manual** — output the formatted issue title, body, and label as raw markdown. Then provide the filing link: `https://github.com/githubnext/ado-aw/issues/new` |
| 159 | + |
| 160 | +--- |
| 161 | + |
| 162 | +## Manual Fallback |
| 163 | + |
| 164 | +If Azure DevOps MCP pipeline tools are **not** available, follow this manual sequence: |
32 | 165 |
|
33 | 166 | 1. **Gather information** — ask the user for: |
34 | 167 | - The pipeline run URL or build ID |
35 | | - - Error messages or log snippets |
36 | | - - The agent source markdown file |
37 | | - - The compiled pipeline YAML |
| 168 | + - Which job failed (Agent, Detection, Execution, Setup, Teardown) |
| 169 | + - Error messages or log snippets from the failing step |
| 170 | + - The agent source markdown file (or its path) |
| 171 | + - The compiled pipeline YAML (or its path) |
38 | 172 |
|
39 | 173 | 2. **Identify which job failed** — check the job name in logs or the pipeline run summary: |
40 | 174 | - `Agent` → see [Stage 1 Failures](#stage-1-agent-failures) |
41 | 175 | - `Detection` → see [Stage 2 Failures](#stage-2-detection-failures) |
42 | 176 | - `Execution` → see [Stage 3 Failures](#stage-3-execution-failures) |
43 | 177 | - `Setup` / `Teardown` → see [Setup/Teardown Failures](#setupteardown-failures) |
44 | 178 |
|
45 | | -3. **Check for compilation drift** — before deep-diving into runtime errors, verify the pipeline YAML is in sync with its source markdown: |
| 179 | +3. **Check for compilation drift**: |
46 | 180 | ```bash |
47 | 181 | ado-aw check <pipeline.lock.yml> |
48 | 182 | ``` |
49 | 183 |
|
50 | | -4. **Apply the fix** — make the targeted change to the agent `.md` source file, then recompile: |
51 | | - ```bash |
52 | | - ado-aw compile <agent.md> |
53 | | - ``` |
54 | | - |
55 | | -5. **Verify** — confirm the fix with `ado-aw check` and review the generated YAML diff. |
| 184 | +4. Continue from [Step 4: Diagnose](#step-4-diagnose) above. |
56 | 185 |
|
57 | 186 | --- |
58 | 187 |
|
@@ -346,23 +475,82 @@ If downloads fail: |
346 | 475 |
|
347 | 476 | --- |
348 | 477 |
|
349 | | -## Diagnostic Commands |
| 478 | +## Diagnostic Report Template |
350 | 479 |
|
351 | | -```bash |
352 | | -# Verify pipeline YAML matches its source markdown |
353 | | -ado-aw check <pipeline.lock.yml> |
| 480 | +Use this template for all diagnostic reports. Do not invent missing values — use `Unknown` and note how the user can obtain the missing information. |
| 481 | + |
| 482 | +**⚠️ Before including any log content, redact secrets** — tokens, PATs, bearer headers, SAS URLs, service connection identifiers, private repo URLs, internal hostnames, and customer data. Summarize redacted sections instead of quoting them verbatim. |
| 483 | + |
| 484 | +```markdown |
| 485 | +## Diagnostic Summary |
| 486 | +
|
| 487 | +- **Pipeline**: <name> |
| 488 | +- **Definition ID**: <id or Unknown> |
| 489 | +- **Build ID**: <id> |
| 490 | +- **Run URL**: <url> |
| 491 | +- **Result**: Failed / Partially succeeded / Cancelled |
| 492 | +- **Failing stage/job/step**: <stage> → <job> → <step> |
| 493 | +- **First failed timeline record**: <record name and type> |
| 494 | +- **Suspected root cause**: <brief description> |
| 495 | +- **Confidence**: High / Medium / Low |
| 496 | +
|
| 497 | +## Evidence |
| 498 | +
|
| 499 | +### Relevant log excerpts |
| 500 | +
|
| 501 | +<Sanitized log excerpts from the failing step and surrounding context. |
| 502 | +Include error messages, stack traces, and relevant warnings. |
| 503 | +Redact any secrets or sensitive information.> |
354 | 504 |
|
355 | | -# Recompile a single agent |
356 | | -ado-aw compile <path/to/agent.md> |
| 505 | +### Timeline observations |
357 | 506 |
|
358 | | -# Recompile all detected agentic pipelines in the current directory |
359 | | -ado-aw compile |
| 507 | +- <What the timeline showed — which stages ran, which failed, which were skipped> |
| 508 | +- <Any warnings or unusual patterns before the failure> |
360 | 509 |
|
361 | | -# Update GITHUB_TOKEN pipeline variable on ADO build definitions |
362 | | -ado-aw configure |
| 510 | +### Changes since last successful build |
363 | 511 |
|
364 | | -# Dry-run configure to preview changes |
365 | | -ado-aw configure --dry-run |
| 512 | +- <Files changed, if identified via get_build_changes> |
| 513 | +- <Whether agent .md, .lock.yml, compiler version, or config changed> |
| 514 | +- <Or: "No previous successful build found" / "Unknown — MCP not available"> |
| 515 | +
|
| 516 | +## Environment |
| 517 | +
|
| 518 | +- **Agent source file**: <path or Unknown> |
| 519 | +- **Compiled pipeline YAML**: <path or Unknown> |
| 520 | +- **Compilation in sync**: Yes / No / Unknown (ado-aw check result) |
| 521 | +- **ado-aw version**: <version or Unknown> |
| 522 | +- **AWF version**: <version or Unknown> |
| 523 | +- **MCPG version**: <version or Unknown> |
| 524 | +- **Agent pool**: <pool name> |
| 525 | +- **OS/image**: <e.g., ubuntu-22.04> |
| 526 | +- **Engine/model**: <e.g., copilot / claude-opus-4.7> |
| 527 | +- **Relevant MCP servers**: <list or None> |
| 528 | +
|
| 529 | +## Analysis |
| 530 | +
|
| 531 | +- **Stage classification**: Stage 1 (Agent) / Stage 2 (Detection) / Stage 3 (Execution) / Setup / Teardown / Cross-stage |
| 532 | +- **Why this stage failed**: <detailed explanation> |
| 533 | +
|
| 534 | +## Root Cause |
| 535 | +
|
| 536 | +- **Root cause**: <clear description of what failed and why> |
| 537 | +- **Category**: Compiler bug / Runtime regression / User configuration / Infrastructure / Unknown |
| 538 | +- **Ruled-out causes**: <what you checked and eliminated> |
| 539 | +- **Related recent changes**: <commits, config changes, version updates> |
| 540 | +
|
| 541 | +## Issue |
| 542 | +
|
| 543 | +- **Title**: `debug: <concise summary>` |
| 544 | +- **Label**: bug / documentation / question / user-configuration |
| 545 | +``` |
| 546 | + |
| 547 | +--- |
| 548 | + |
| 549 | +## Diagnostic Commands |
| 550 | + |
| 551 | +```bash |
| 552 | +# Verify pipeline YAML matches its source markdown |
| 553 | +ado-aw check <pipeline.lock.yml> |
366 | 554 | ``` |
367 | 555 |
|
368 | 556 | --- |
|
0 commit comments