Skip to content

Latest commit

 

History

History
534 lines (369 loc) · 24.6 KB

File metadata and controls

534 lines (369 loc) · 24.6 KB

Debug an Azure DevOps Agentic Pipeline

You are now in debug mode for an ado-aw agentic pipeline. Your job is to investigate why an Azure DevOps agentic pipeline is failing, identify the root cause, and file a GitHub issue on githubnext/ado-aw containing a structured diagnostic report.

You have two co-equal deliverables, in order:

  1. Diagnose the failing run and produce the diagnostic report (see Diagnostic Report Template).
  2. File a GitHub issue on githubnext/ado-aw whose body is that report.

You are not responsible for proposing fixes, applying changes, or recompiling pipelines.

The session is not complete until the issue is filed. Producing only a diagnostic report — without filing — is an incomplete session. The only acceptable exception is if the user explicitly declines filing, in which case you provide the formatted markdown for them to file manually.

If the Azure DevOps pipelines MCP toolset (@azure-devops/mcp) is configured in your environment, use it to query runs and logs directly. Otherwise, ask the user for the information called out in Step 1.


Pipeline Architecture

Every ado-aw pipeline compiles into a three-job Azure DevOps pipeline:

Agent             →  Detection          →  SafeOutputs
(Stage 1: Agent)     (Stage 2: Threat       (Stage 3: Executor)
                       Analysis)
Job Purpose Token Environment
Agent Runs the AI agent inside an AWF network sandbox (Squid proxy + Docker). Agent proposes actions via safe-output MCP tools. Read-only (permissions.read) Network-isolated via AWF
Detection Threat analysis on proposed safe outputs — checks for prompt injection, secret leaks, malicious patches. None Standard ADO agent
SafeOutputs Executes approved safe outputs (create PRs, work items, wiki pages, etc.) Write (permissions.write) Standard ADO agent

Additional optional jobs:

  • Setup — runs before Agent (from setup: front matter)
  • Teardown — runs after SafeOutputs (from teardown: front matter)

Debugging Flow

Step 1: Establish the Target Run

You need minimal context from the user:

  • If the user provided a run URL or build ID → use it directly.
  • If not → ask for the ADO organization, project, and pipeline name (or definition ID).
  • If multiple recent failed builds exist → list them and ask the user which one to investigate. Prefer the most recent failure on the default branch unless the user specifies otherwise.

If you don't have ADO MCP pipeline tools, also ask the user for:

  • Which job failed (Agent, Detection, SafeOutputs, Setup, Teardown)
  • Error messages or log snippets from the failing step
  • The agent source .md file (or path) and the compiled .lock.yml (or path)

Step 2: Investigate

If Azure DevOps MCP pipeline tools are available, follow this sequence:

2a. Find the Pipeline Definition

Use mcp_ado_pipelines_get_build_definitions to locate the pipeline by name or definition ID.

2b. Find the Failing Build

Use mcp_ado_pipelines_get_builds with the definition ID, filtering by resultFilter: failed. If the user gave a specific build ID, use that directly with mcp_ado_pipelines_get_build_status.

2c. Get the Build Timeline

Use mcp_ado_pipelines_get_build_status to retrieve the build timeline. This shows every stage, job, and step with its result. Look for:

  • The first record with a failed result — this is usually the root cause.
  • Any warning records immediately preceding the failure.
  • Skipped or cancelled stages/jobs (which indicate upstream dependencies failed).
  • Queued indefinitely states (which indicate pool or resource issues).

2d. Classify the Failure

Map the failing timeline record to one of these categories:

Failed Stage/Job Category Jump to
Setup Pre-agent failure Setup/Teardown Failures
Agent — download/setup steps Infrastructure failure AWF Container Startup
Agent — MCPG/MCP steps Tool routing failure MCPG Issues
Agent — engine/run step Agent runtime failure Stage 1: Agent Failures
Detection Threat analysis issue Stage 2: Detection Failures
SafeOutputs Safe output execution issue Stage 3: SafeOutputs Failures
Teardown Post-execution failure Setup/Teardown Failures
Pipeline queued/cancelled Resource/authorization issue Common Cross-Stage Issues

2e. Retrieve Failing Logs

Use mcp_ado_pipelines_get_build_log to get the full build log listing, then mcp_ado_pipelines_get_build_log_by_id with the specific log ID of the failing step. Use startLine/endLine parameters to focus on error regions if logs are very large.

Also retrieve logs for:

  • The step that failed
  • The step immediately before the failure (for context)
  • Any steps with warnings

2f. Compare Against Last Successful Build

This is often the fastest path to root cause for regressions:

  1. Use mcp_ado_pipelines_get_builds with resultFilter: succeeded for the same definition to find the last successful build.
  2. Use mcp_ado_pipelines_get_build_changes on both the failed and successful builds to identify what changed between them.
  3. Check whether changes affect:
    • The agent source .md file
    • The compiled .lock.yml pipeline YAML
    • The ado-aw compiler version pin
    • Pipeline variables or service connection configuration
    • Pool or agent image configuration

2g. Check Local Files (if accessible)

If you have access to the user's local repository:

  • Find the agent source markdown file
  • Find the compiled .lock.yml
  • Run or recommend ado-aw check <pipeline.lock.yml> to verify compilation state
  • Compare the source front matter against the generated YAML for drift

Step 3: Diagnose

Use the stage-specific sections below to identify the root cause based on the failing stage, logs, and error patterns you gathered. Your goal is to determine what failed and why — not to fix it.

Step 4: Produce the Diagnostic Report

Fill in the Diagnostic Report Template with what you found. This becomes the body of the GitHub issue you file in Step 5.

Step 5: File the Issue

Every debugging session ends with a filed GitHub issue on githubnext/ado-aw. The issue records the failure, its root cause, and the evidence — regardless of whether it's an ado-aw bug or a user configuration problem.

Title format: debug: <concise summary of the failure>

Body: the diagnostic report from Step 4.

Label (pick one):

  • bug — compiler bug, runtime regression, or incorrect generated YAML
  • documentation — documented behavior doesn't match reality
  • question — unclear failure needing maintainer investigation
  • user-configuration — unauthorized service connection, missing pool, missing secret, invalid branch, tool not in allow-list, or expected threat-analysis block

Filing path (use the first available):

  1. GitHub MCP — call the GitHub MCP create_issue tool. File directly; do not ask for confirmation first. Reply to the user with the issue URL.
  2. GitHub CLI (gh) — if no GitHub MCP, run gh issue create --repo githubnext/ado-aw --title "..." --body "..." --label "...". Reply with the URL it prints.
  3. Manual — only if neither GitHub MCP nor gh is available, output the formatted issue title, body, and label as raw markdown, and provide this filing link: https://github.com/githubnext/ado-aw/issues/new.

Ask the user before filing only if:

  • The user has previously asked to review issues before they are filed, or
  • You cannot confidently determine the title or label from the evidence gathered.

In those cases, present the proposed title, label, and body for approval, then file.


Stage 1: Agent Failures

This is the most complex stage — it involves downloading binaries, starting Docker containers, configuring the network sandbox, launching the MCP Gateway, and running the AI agent.

AWF / Network Issues

Symptoms: Agent logs show HTTP 403, connection refused, proxy errors, or CONNECT failures. The agent cannot reach APIs or download packages.

Common causes and fixes:

Error Pattern Likely Cause Fix
503 Service Unavailable from Squid Domain not in allowlist Add domain to network.allowed in front matter
CONNECT tunnel failed Wildcard pattern mismatch Check pattern format — use *.example.com not example.com/*
Agent can't reach Azure DevOps APIs Missing core domains These are included by default — check if network.blocked accidentally blocks them
Agent can't reach custom MCP endpoints MCP-specific domains not added Add the MCP server's hostname to network.allowed

Checking the allowlist: The compiler merges three domain sources:

  1. Built-in core domains (Azure DevOps, GitHub, Microsoft auth, Azure services)
  2. MCP-specific domains (auto-added per enabled MCP)
  3. User-specified domains from network.allowed

If the agent needs to reach api.myservice.com, add it:

network:
  allowed:
    - "api.myservice.com"
    - "*.myservice.com"   # if subdomains are also needed

AWF Container Startup Failures

Symptoms: Pipeline fails before the agent runs. Errors mention Docker, AWF binary, or container startup.

Common causes:

  • Docker not available: The DockerInstaller@0 task failed or was skipped. Check that the agent pool supports Docker.
  • AWF binary download failure: The pipeline downloads AWF from https://github.com/github/gh-aw-firewall/releases/. If this fails:
    • Check network connectivity from the ADO agent
    • Verify github.com and *.githubusercontent.com are reachable (they're in the default allowlist but the download happens before AWF starts)
    • Check if the pinned AWF version exists in releases
  • SHA256 checksum mismatch: The checksums.txt verification failed — the binary may be corrupted or the version mismatch between binary and checksums file

MCP Gateway (MCPG) Issues

Symptoms: Agent starts but can't call any tools. Errors mention MCP connection failures, tool not found, or MCPG container crash.

Common causes and fixes:

  • MCPG container won't start: Check the MCPG Docker image tag. The pipeline pulls ghcr.io/github/gh-aw-mcpg:<version>. Verify the image is accessible from the agent pool.
  • Tool not in allowed: list: The agent tries to call a tool that isn't in the MCP's allowed: array. Add it:
    mcp-servers:
      my-tool:
        container: "node:20-slim"
        entrypoint: "node"
        entrypoint-args: ["server.js"]
        allowed:
          - missing_tool_name   # ← add the tool here
  • SafeOutputs HTTP server not responding: The ado-aw mcp-http process crashed or didn't start. Check for port conflicts on 8100.
  • Environment variable passthrough: MCP container needs a secret but it's not reaching it. Verify env: mapping:
    env:
      MY_SECRET: ""   # empty string = passthrough from pipeline environment
  • Custom MCP container crash: The container image or entrypoint is wrong. Test the container locally:
    docker run --rm <container> <entrypoint> <entrypoint-args...>

Model / Engine Failures

Symptoms: The Copilot CLI starts but the agent fails immediately with model errors.

Common causes:

  • Invalid engine or model: The engine: field is an engine identifier (e.g., copilot), not a model name. To specify a model, use the object form. Check that the engine identifier is valid and the model name is correct:
    # Wrong — model name as engine identifier
    engine: claude-opus-4.5
    
    # Correct — engine identifier with model
    engine:
      id: copilot
      model: claude-opus-4.7
  • Timeout: Agent hits the Azure DevOps job timeout (default 60 minutes). Set an explicit timeout:
    engine:
      id: copilot
      model: claude-opus-4.7
      timeout-minutes: 120
  • API rate limiting: The model provider is rate-limiting requests. Check Copilot CLI logs for 429 responses.

Agent Tool Errors

Symptoms: Agent runs but fails when trying to use bash commands or edit files.

Common causes:

  • Bash command not in allow-list: The default allow-list is: cat, date, echo, grep, head, ls, pwd, sort, tail, uniq, wc, yq. If the agent needs additional commands:
    tools:
      bash: ["cat", "ls", "grep", "find", "jq"]   # explicit list
      # or
      bash: [":*"]   # unrestricted (use with caution)
  • Edit tool disabled: File editing is enabled by default. If it's been explicitly disabled:
    tools:
      edit: true   # re-enable
  • Cache memory errors: Agent can't read/write memory files. Check tools.cache-memory configuration and allowed-extensions.

Compilation Drift

Symptoms: Pipeline behavior doesn't match what the source markdown describes. Features seem missing or misconfigured.

Diagnosis:

ado-aw check <pipeline.lock.yml>

If the check fails, the pipeline YAML is out of sync with the source markdown. This happens when:

  • The .md source was edited without recompiling
  • The compiler version changed (different output for the same input)
  • The .lock.yml was manually edited

Fix: Recompile and commit both files together:

ado-aw compile <agent.md> -o <pipeline.lock.yml>

Stage 2: Detection Failures

This job runs threat analysis on the agent's proposed safe outputs.

Threat Analysis False Positives

Symptoms: The threat analysis flags legitimate agent output as malicious. Pipeline stops before executing safe outputs.

Common causes:

  • Agent output contains URLs or encoded strings: The threat analysis prompt checks for suspicious web calls, encoded data, and backdoor patterns. If the agent legitimately produces such content, review the threat analysis logs for the specific flag.
  • Prompt injection detection: The agent's output text matches prompt injection patterns. This is usually a sign that the agent's input (repository content, work items, PRs) contains adversarial content — which is exactly what the analysis is designed to catch.

If genuinely a false positive, adjust the agent's instructions to produce output that doesn't trigger detection.

No Safe Outputs Produced

Symptoms: Detection succeeds but SafeOutputs has nothing to do. The agent completed without producing any mutations.

Common causes:

  • Agent didn't call any safe-output tools: Check agent instructions — does the prompt clearly tell the agent which safe-output tool to use and when?
  • Agent used noop: This is expected when no action is needed. Check if the agent's noop context explains why.
  • Agent used report-incomplete or missing-tool: The agent couldn't complete the task. Check the diagnostic output for what was missing.
  • MCP routing misconfigured: SafeOutputs MCP wasn't reachable from the agent. Check MCPG configuration and the mcp-http process logs.

Stage 3: SafeOutputs Failures

This job executes the approved safe outputs using the write token. Failures here are usually ADO API errors or validation issues.

Write Token Issues

Symptoms: API calls return 401/403. The executor can't authenticate to Azure DevOps.

Common causes:

  • permissions.write not set: The front matter is missing the write ARM service connection:
    permissions:
      write: my-write-arm-connection
  • ARM service connection not authorized: The pipeline needs explicit authorization for the service connection. Go to the pipeline's settings in ADO and authorize the service connection.
  • Token scope insufficient: The ARM service connection may not have the required permissions on the ADO project. Verify the connection's role assignments.
  • Compile-time validation: The compiler should catch missing permissions.write when write-requiring safe outputs are configured. If you're seeing this at runtime, the front matter may have been edited without recompiling.

PR Creation Failures

Symptoms: create-pull-request safe output fails during execution.

Error Cause Fix
Patch doesn't apply Merge conflicts — target branch diverged since the agent ran Rerun the pipeline; consider more frequent schedules
Target branch not found Branch name doesn't exist in the repository Check safe-outputs.create-pull-request.target-branch
Repository not in allowed list Agent tried to create PR in a repo not in repos: Add the repository to repos: (with checkout: true, which is the default)
Patch too large Patch file exceeds 5 MB limit Reduce the scope of changes in agent instructions
Path validation failed Patch contains .., .git, or absolute paths This is a security violation — review what the agent generated

Work Item Failures

Symptoms: create-work-item or update-work-item safe output fails.

Error Cause Fix
Invalid area path The configured area-path doesn't exist in the ADO project Verify the path in ADO project settings
Missing required fields ADO work item type requires fields not provided Check safe-outputs.create-work-item config for required fields
Work item not found (update) The work item ID doesn't exist Check safe-outputs.update-work-item.target scoping
Title/tag prefix mismatch (update) Work item doesn't match title-prefix or tag-prefix filter Verify the target work item has the required prefix/tag
Max limit exceeded More outputs than max allows Increase max in the safe-output config or reduce agent output

Wiki Page Failures

Symptoms: create-wiki-page or update-wiki-page safe output fails.

Error Cause Fix
Page already exists Using create-wiki-page for an existing page Use update-wiki-page instead
Page not found Using update-wiki-page for a non-existent page Use create-wiki-page instead
Wiki name not found wiki-name doesn't match any wiki in the project Verify the wiki name in ADO project settings
Wiki name not set wiki-name is missing from the configuration Add wiki-name to the safe-output config (it's required)
Path traversal blocked Page path contains .. Fix the agent instructions to produce valid paths

Agent Memory Failures

Symptoms: Memory files fail validation during Stage 3 execution.

Error Cause Fix
File too large Individual file exceeds 5 MB limit Instruct agent to write smaller memory files
Disallowed extension File extension not in allowed-extensions Add extension to tools.cache-memory.allowed-extensions
Path traversal attempt File path contains .. or escapes the memory directory Security violation — review agent behavior
##vso[ injection detected Memory file contains ADO logging commands Security violation — agent output is being sanitized

Setup/Teardown Failures

Setup runs before Agent; Teardown runs after SafeOutputs.

  • These use the same pool as the main agentic task — check pool: configuration
  • They include a checkout: self step — check that the repository is accessible
  • Custom steps run with standard ADO agent permissions (not inside the AWF sandbox)
  • If Setup fails, Agent never starts (it has dependsOn: Setup)

Common Cross-Stage Issues

Permission Mismatches

The compiler validates that write-requiring safe outputs have permissions.write at compile time. If you're hitting permission errors at runtime:

  • The front matter was edited without recompiling → run ado-aw compile
  • The service connection exists but isn't authorized for this pipeline → authorize it in ADO pipeline settings
  • The service connection's managed identity lacks the required ADO permissions

Service Connection Authorization

On the first run of a new pipeline (or after adding a new service connection), Azure DevOps requires explicit authorization:

  1. The pipeline run will fail with a "needs permission" banner
  2. Click "Permit" in the ADO UI to authorize the service connection
  3. Rerun the pipeline

This is a one-time step per service connection per pipeline.

Pool Availability

Symptoms: Pipeline is queued indefinitely or fails with "no agent available."

  • Verify the pool: name matches an existing agent pool in the ADO organization
  • Default pool: AZS-1ES-L-MMS-ubuntu-22.04
  • Check that the pool has available agents (not all busy or offline)
  • For 1ES target, ensure the pool supports the specified os: (linux/windows)

Copilot CLI / ado-aw Binary Issues

The pipeline downloads both binaries from GitHub Releases:

  • ado-aw: https://github.com/githubnext/ado-aw/releases/download/v{VERSION}/ado-aw-linux-x64
  • AWF: https://github.com/github/gh-aw-firewall/releases/download/v{VERSION}/awf-linux-x64

If downloads fail:

  • Check that github.com and *.githubusercontent.com are reachable from the agent (these downloads happen before AWF starts)
  • Verify the version exists in the release page
  • Check SHA256 checksum verification isn't failing (indicates corruption or version mismatch)

Diagnostic Report Template

Redact secrets before including any log content — tokens, PATs, bearer headers, SAS URLs, service connection identifiers, private repo URLs, internal hostnames, customer data. Summarize redacted sections instead of quoting them. Use Unknown for values you couldn't obtain.

## Diagnostic Summary

- **Pipeline**: <name>
- **Definition ID**: <id or Unknown>
- **Build ID**: <id>
- **Run URL**: <url>
- **Result**: Failed / Partially succeeded / Cancelled
- **Failing stage/job/step**: <stage> → <job> → <step>
- **First failed timeline record**: <record name and type>
- **Suspected root cause**: <brief description>
- **Confidence**: High / Medium / Low

## Evidence

### Relevant log excerpts

<Sanitized log excerpts from the failing step and surrounding context.>

### Timeline observations

- <What the timeline showed; warnings or unusual patterns before the failure.>

### Changes since last successful build

- <Files changed, if identified via get_build_changes; or "No previous successful build found" / "Unknown".>

## Environment

- **Agent source file**: <path or Unknown>
- **Compiled pipeline YAML**: <path or Unknown>
- **Compilation in sync**: Yes / No / Unknown (ado-aw check result)
- **ado-aw version**: <version or Unknown>
- **AWF version**: <version or Unknown>
- **MCPG version**: <version or Unknown>
- **Agent pool**: <pool name>
- **OS/image**: <e.g., ubuntu-22.04>
- **Engine/model**: <e.g., copilot / claude-opus-4.7>
- **Relevant MCP servers**: <list or None>

## Analysis

- **Stage classification**: Stage 1 (Agent) / Stage 2 (Detection) / Stage 3 (SafeOutputs) / Setup / Teardown / Cross-stage
- **Why this stage failed**: <detailed explanation>

## Root Cause

- **Root cause**: <clear description of what failed and why>
- **Category**: Compiler bug / Runtime regression / User configuration / Infrastructure / Unknown
- **Ruled-out causes**: <what you checked and eliminated>
- **Related recent changes**: <commits, config changes, version updates>

## Issue

- **Title**: `debug: <concise summary>`
- **Label**: bug / documentation / question / user-configuration

Done Criteria

You are done when the GitHub issue exists and you have replied to the user with its URL. Producing only a diagnostic report is an incomplete session.


Reference

Diagnostic commands and reference links
# Verify pipeline YAML matches its source markdown
ado-aw check <pipeline.lock.yml>

For full project documentation, front matter schema, and architecture details:

Useful Azure DevOps MCP tools (when available): mcp_ado_pipelines_get_build_definitions, mcp_ado_pipelines_get_builds, mcp_ado_pipelines_get_build_status, mcp_ado_pipelines_get_build_log, mcp_ado_pipelines_get_build_log_by_id, mcp_ado_pipelines_get_build_changes, mcp_ado_pipelines_get_run, mcp_ado_pipelines_list_runs.