Skip to content

Latest commit

Β 

History

History
1374 lines (1087 loc) Β· 38 KB

File metadata and controls

1374 lines (1087 loc) Β· 38 KB

AutoGen Multi-Agent API Planner

Overview

The plan_api_call tool uses two AI agents working together to generate accurate, executable API calls from documentation. Instead of a single LLM trying to extract API information, we use a multi-agent system where each agent has a specific role:

  • πŸ€– Planner Agent: Analyzes documentation and generates API call plans
  • 🧐 Critic Agent: Reviews plans for accuracy, safety, and completeness

They collaborate in an iterative loop (max 3 iterations) to refine the plan until it meets quality standards.

Why Two Agents?

Single LLM Approach (naive):

User Query β†’ LLM β†’ API Plan ❌ May hallucinate or miss details

Multi-Agent Approach (our solution):

User Query β†’ πŸ” Search β†’ πŸ€– Planner β†’ 🧐 Critic β†’ βœ“/βœ— Decision
                ↑                                      β”‚
                └──────────── (if rejected) β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Benefits:

  • Separation of Concerns: One agent generates, another validates
  • Reduced Hallucinations: Critic catches unsupported claims
  • Iterative Refinement: Failed attempts trigger better searches
  • Safety: Write operations require concrete examples

Architecture

System Components

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  MCP Tool: plan_api_call(goal, profile)                          β”‚
β”‚                                                                   β”‚
β”‚  Input:                                                           β”‚
β”‚  - goal: Natural language query ("get job status")               β”‚
β”‚  - profile: Documentation set ("informatica-cloud")              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  πŸ“¦ Embedding Layer                                               β”‚
β”‚                                                                   β”‚
β”‚  β€’ Converts query to vector (384-dim)                            β”‚
β”‚  β€’ Backend: Cloudflare Workers AI (@cf/baai/bge-small-en-v1.5)  β”‚
β”‚  β€’ Cached for performance                                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  πŸ—ƒοΈ  Vector Database (Qdrant)                                     β”‚
β”‚                                                                   β”‚
β”‚  β€’ Semantic search over ingested documentation                   β”‚
β”‚  β€’ Returns top-K chunks (default: 8) with metadata              β”‚
β”‚  β€’ Each chunk includes:                                          β”‚
β”‚    - text: Documentation snippet                                 β”‚
β”‚    - score: Similarity score (0.0-1.0)                          β”‚
β”‚    - hints: method_hint, url_candidates, query_candidates       β”‚
β”‚    - metadata: doc_path, chunk_idx                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  🎯 AutoGen Multi-Agent Planner                                  β”‚
β”‚                                                                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  LOOP (max 3 iterations)                                   β”‚  β”‚
β”‚  β”‚                                                            β”‚  β”‚
β”‚  β”‚  1️⃣ PLANNER AGENT (LLM Call)                              β”‚  β”‚
β”‚  β”‚     β€’ Input: query + retrieved documentation              β”‚  β”‚
β”‚  β”‚     β€’ System message: Profile-specific templates          β”‚  β”‚
β”‚  β”‚     β€’ Output: JSON with endpoint, method, params          β”‚  β”‚
β”‚  β”‚     β€’ Repair pass if JSON invalid                         β”‚  β”‚
β”‚  β”‚                                                            β”‚  β”‚
β”‚  β”‚  2️⃣ CRITIC AGENT (LLM Call)                               β”‚  β”‚
β”‚  β”‚     β€’ Input: Planner's plan + same documentation          β”‚  β”‚
β”‚  β”‚     β€’ Validates:                                           β”‚  β”‚
β”‚  β”‚       βœ“ Endpoint matches allowed patterns                 β”‚  β”‚
β”‚  β”‚       βœ“ Method appropriate (GET/POST)                      β”‚  β”‚
β”‚  β”‚       βœ“ Has concrete example (for writes)                 β”‚  β”‚
β”‚  β”‚       βœ“ All required params present                       β”‚  β”‚
β”‚  β”‚     β€’ Output: verdict (pass/fail), missing[], next_search[]β”‚  β”‚
β”‚  β”‚                                                            β”‚  β”‚
β”‚  β”‚  3️⃣ DECISION                                               β”‚  β”‚
β”‚  β”‚     β€’ If pass β†’ Return plan                               β”‚  β”‚
β”‚  β”‚     β€’ If fail β†’ New search with critic's queries          β”‚  β”‚
β”‚  β”‚     β€’ If max loops β†’ Return needs_input                   β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  πŸ“€ Response                                                      β”‚
β”‚                                                                   β”‚
β”‚  Success:                                                         β”‚
β”‚  {                                                                β”‚
β”‚    "status": "ok",                                                β”‚
β”‚    "plan": {endpoint, method, params, body},                     β”‚
β”‚    "confidence": 0.90,                                            β”‚
β”‚    "notes": "..." (if any concerns)                              β”‚
β”‚  }                                                                β”‚
β”‚                                                                   β”‚
β”‚  Needs Input:                                                     β”‚
β”‚  {                                                                β”‚
β”‚    "status": "needs_input",                                       β”‚
β”‚    "reason": "...",                                               β”‚
β”‚    "missing": [...],                                              β”‚
β”‚    "suggested_queries": [...]                                     β”‚
β”‚  }                                                                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

How It Works: Step-by-Step

How It Works: Step-by-Step

1. Initial Retrieval

When you call plan_api_call(goal="get job status", profile="informatica-cloud"):

a) Query Embedding

  • Query is converted to 384-dimensional vector
  • Uses Cloudflare Workers AI model: @cf/baai/bge-small-en-v1.5
  • Same model used for both ingestion and search (critical for accuracy)

b) Semantic Search

  • Vector search in Qdrant finds similar documentation chunks
  • Returns top 8 chunks ranked by cosine similarity
  • Each chunk enriched with API hints extracted during ingestion:
    • method_hint: GET, POST, PUT, DELETE
    • url_candidates: Extracted endpoints from text
    • query_candidates: Related search terms

Example Retrieved Chunks:

[
  {
    "text": "Getting the fetchState job status Use a GET request...",
    "score": 0.75,
    "doc_path": "/app/data/iics/IICS_September2022_REST-API_Reference_en.pdf",
    "chunk_idx": 639,
    "method_hint": "GET",
    "url_candidates": ["/public/core/v3/fetchState/"]
  },
  {
    "text": '{ "status": "RUNNING", "startTime": "2022-04-04T20:23:57.000", ...}',
    "score": 0.74,
    "chunk_idx": 1131,
    "method_hint": "GET",
    "url_candidates": null
  }
]

2. Agent Initialization

Profile Configuration Loaded:

# From config/learning.yaml
informatica-cloud:
  autogen_hints:
    endpoint:
      allow: ".*"  # Accept any endpoint
      forbid: []   # No forbidden patterns
    
    templates:
      read:
        method: GET
        params: {action: "read"}
      write:
        method: POST
        require_example_in_evidence: true

Two Agents Created:

# Planner Agent
planner = AssistantAgent(
    name="planner",
    system_message="""
    You convert retriever evidence into an API CALL PLAN.
    Output STRICT JSON ONLY with keys: endpoint, method, params, provenance.
    
    Rules:
    - Endpoint must match allowed pattern: .*
    - Never output forbidden endpoints: (none)
    - Reads: follow GET params={}
    - Writes: follow POST params={} 
    """
)

# Critic Agent
critic = AssistantAgent(
    name="critic",
    system_message="""
    You are a strict API PLAN critic.
    Input: candidate plan + evidence
    Output: STRICT JSON with verdict, confidence, missing[], next_search[]
    
    Rules:
    - Fail if endpoint doesn't match allowed pattern
    - If method modifies state, require concrete example in evidence
    - Propose precise follow-up search queries when failing
    - Prefer short, targeted queries (≀3 tokens)
    """
)

3. Loop Iteration (Max 3 Loops)

Loop 1: Initial Attempt

3a) Planner Generates Plan

LLM Call #1 (Planner):

System: [Profile-specific rules from above]
User: Given the EVIDENCE below and the USER QUERY, produce STRICT JSON.
      Required keys: endpoint, method, params, provenance.
      
      USER QUERY: get job status
      
      EVIDENCE (top-3):
      [
        {"score": 0.75, "snippet": "Getting the fetchState job status...", ...},
        {"score": 0.74, "snippet": "job 47", ...},
        {"score": 0.73, "snippet": '{"status": "RUNNING", ...}', ...}
      ]

Planner Response:

{
  "endpoint": "/public/core/v3/fetchState/<job id>",
  "method": "GET",
  "params": {"job id": "<job id>"},
  "provenance": {
    "top_hit": {
      "snippet": "Getting the fetchState job status...",
      "score": 0.75,
      "doc_path": "/app/data/iics/...",
      "chunk_idx": 639
    }
  }
}

JSON Validation:

  • If valid β†’ proceed to critic
  • If invalid β†’ Repair Pass (LLM Call #1b):
    System: [same as above]
    User: The previous output was not valid JSON or missing required keys.
          Repair it to STRICT, VALID, MINIFIED JSON ONLY.
          Here is your previous output: [bad json]
    

3b) Critic Reviews Plan

LLM Call #2 (Critic):

System: [Critic rules from above]
User: Plan to evaluate (JSON):
      {
        "endpoint": "/public/core/v3/fetchState/<job id>",
        "method": "GET",
        "params": {"job id": "<job id>"},
        "provenance": {...}
      }
      
      Top evidence (minified):
      [
        {"score": 0.75, "snippet": "Getting the fetchState...", ...},
        {"score": 0.74, ...},
        {"score": 0.73, ...}
      ]

Critic Response (Scenario A: Pass):

{
  "verdict": "pass",
  "confidence": 0.90,
  "missing": [],
  "next_search": [],
  "risk_flags": []
}

β†’ Plan Accepted, return to user βœ…

Critic Response (Scenario B: Fail):

{
  "verdict": "fail",
  "confidence": 0.65,
  "missing": ["concrete example in evidence"],
  "next_search": [
    "fetchState response format",
    "job status example"
  ],
  "risk_flags": [
    {"risk": "no concrete example in evidence", "confidence": 0.9}
  ]
}

β†’ Continue to Loop 2 with new search queries


3c) Acceptance Logic

The system checks profile-specific rules:

# For READ operations
if endpoint_matches_pattern and is_read_action and has_feature_name:
    return ACCEPTED

# For WRITE operations (stricter)
if endpoint_matches_pattern and is_write_action:
    if require_example_in_evidence:
        if "example" not in critic.missing:
            return ACCEPTED
        else:
            return CONTINUE_LOOP  # Need concrete example
    else:
        return ACCEPTED

Loop 2: Refinement (if needed)

New Search:

  • Uses critic's next_search queries: ["fetchState response format", "job status example"]
  • Retrieves 8 new chunks (may overlap with Loop 1)
  • Chunks now include concrete examples

Planner (LLM Call #3):

  • Same prompt template
  • Now has better evidence with examples
  • Generates more accurate plan

Critic (LLM Call #4):

  • Reviews refined plan
  • Has concrete examples in evidence
  • More likely to pass

Example Refinement:

Loop 1 Plan:  {"params": {"Audio.AudioEnable": "true"}}
              Critic: "Missing channel syntax example"

Loop 2 Search: "Audio.AudioEnable example"
Loop 2 Plan:  {"params": {"Audio[0].AudioEnable": "true"}}
              Critic: "Pass - has concrete example"

Loop 3: Final Attempt

If Loop 2 still fails:

  • Uses critic's latest next_search queries
  • Retrieves fresh evidence
  • Planner/Critic one more time
  • If still fails β†’ Return needs_input with diagnostics

4. Final Response

Success Response:

{
  "status": "ok",
  "plan": {
    "endpoint": "/public/core/v3/fetchState/<job id>",
    "method": "GET",
    "params": {"job id": "<job id>"},
    "body": null
  },
  "confidence": 0.90,
  "notes": null
}

Needs Input Response:

{
  "status": "needs_input",
  "reason": "Insufficient information in documentation after 3 loops",
  "missing": ["concrete example", "required parameters"],
  "suggested_queries": [
    "fetchState response format",
    "job status example",
    "fetchState job ID syntax"
  ]
}


LLM Call Optimization

Call Counting

Typical scenarios:

Scenario LLM Calls Breakdown
Single loop success 2 Planner (1) + Critic (1)
Single loop with repair 3 Planner (1) + Repair (1) + Critic (1)
Two loops 4-6 Loop 1 (2-3) + Loop 2 (2-3)
Three loops 6-9 Loop 1 (2-3) + Loop 2 (2-3) + Loop 3 (2-3)

Caching Strategy

  • Vector Embeddings: Cached per query string
  • LLM Responses: NOT cached (dynamic based on retrieved docs)
  • Profile Config: Cached in memory, reloaded on server restart

Gateway Analytics

All LLM calls go through Cloudflare AI Gateway:

  • Request/response logging
  • Token usage tracking
  • Rate limiting protection
  • Cost analytics per endpoint

Profile-Specific Behavior

Profiles in config/learning.yaml control agent behavior:

Example: Dahua Camera (Strict)

Profile-Specific Behavior

Profiles in config/learning.yaml control agent behavior:

Example: Dahua Camera (Strict)

dahua-camera:
  autogen_hints:
    labels: ["HTTP CGI API", "camera configuration"]
    
    endpoint:
      allow: "^/cgi-bin/(configManager|magicBox)\\.cgi"
      forbid: ["admin", "reboot", "factory"]
    
    templates:
      read:
        method: GET
        params:
          action: getConfig
          name: "<Feature>"
      
      write:
        method: GET
        params:
          action: setConfig
          "<Feature>": "<value>"
        require_example_in_evidence: true  # Critical for safety
    
    endpoint_examples:
      - "/cgi-bin/configManager.cgi?action=getConfig&name=All"
      - "/cgi-bin/configManager.cgi?action=setConfig&Audio[0].AudioEnable=true"

Behavior:

  • Endpoint Validation: Only allows /cgi-bin/configManager.cgi or /cgi-bin/magicBox.cgi
  • Forbidden Patterns: Rejects any endpoint with "admin", "reboot", "factory"
  • Write Safety: Requires concrete example in documentation before accepting write operations
  • Parameter Templates: Critic knows to expect action=setConfig for writes

Example: Informatica Cloud (Permissive)

informatica-cloud:
  autogen_hints:
    labels: ["REST API", "cloud integration"]
    
    endpoint:
      allow: ".*"  # Accept any endpoint
      forbid: []   # No restrictions
    
    templates:
      read:
        method: GET
        params: {}
      write:
        method: POST
        require_example_in_evidence: false  # Less strict

Behavior:

  • No Endpoint Restrictions: Accepts any endpoint pattern
  • Less Strict: Doesn't require examples for writes (trusts documentation)
  • Flexible: Suitable for well-documented REST APIs

Advanced Features

1. Two-Pass JSON Validation

If planner produces invalid JSON:

Attempt 1: Generate plan
Result: {"endpoint": "/api/jobs" "method": "GET"}  ❌ Missing comma

Repair Pass: Fix the JSON
Prompt: "The previous output was not valid JSON. Repair it to STRICT, VALID, 
         MINIFIED JSON ONLY. No code fences. No comments."
Result: {"endpoint": "/api/jobs", "method": "GET"}  βœ… Valid

This handles common LLM issues:

  • Missing/extra commas
  • Code fence wrappers (```json)
  • Comments in JSON
  • Extra explanatory text

2. Provenance Tracking

Every plan includes provenance showing which documentation chunk was most influential:

{
  "plan": {...},
  "provenance": {
    "top_hit": {
      "snippet": "Getting the fetchState job status Use a GET request...",
      "score": 0.75006306,
      "doc_path": "/app/data/iics/IICS_September2022_REST-API_Reference_en.pdf",
      "chunk_idx": 639,
      "url_candidates": ["/public/core/v3/fetchState/"],
      "method_hint": "GET"
    }
  }
}

Use Cases:

  • Debugging: Why did it choose this endpoint?
  • Documentation Gaps: Which docs need improvement?
  • Confidence: High score = strong evidence

3. Risk Flags

Critic can flag potential risks even when passing:

{
  "verdict": "pass",
  "confidence": 0.80,
  "risk_flags": [
    {
      "risk": "parameter name uses array syntax but no index validation in docs",
      "confidence": 0.7
    }
  ]
}

4. Iterative Query Refinement

Critic suggests specific follow-up queries, not generic ones:

Loop 1 Query: "enable audio"
Critic: next_search = ["Audio.AudioEnable example"]  βœ… Specific

Not: next_search = ["audio settings", "audio config"]  ❌ Too vague

Query characteristics:

  • ≀3 tokens when possible
  • Include exact parameter names from evidence
  • Add context words: "example", "syntax", "format"


Complete Examples

Example 1: Single Loop Success

Scenario: Simple read operation with good documentation

Input:

plan_api_call(
    goal="get job status",
    profile="informatica-cloud"
)

Internal Flow:

Step 1: Retrieval

Query embedding: [0.123, -0.456, 0.789, ...]  (384 dims)
Qdrant search: 8 chunks retrieved
Top chunk: "Getting the fetchState job status..." (score: 0.75)

Step 2: Loop 1

Planner LLM Call:

{
  "endpoint": "/public/core/v3/fetchState/<job id>",
  "method": "GET",
  "params": {"job id": "<job id>"}
}

Critic LLM Call:

{
  "verdict": "pass",
  "confidence": 0.90,
  "missing": [],
  "next_search": []
}

Step 3: Response

{
  "status": "ok",
  "plan": {
    "endpoint": "/public/core/v3/fetchState/<job id>",
    "method": "GET",
    "params": {"job id": "<job id>"},
    "body": null
  },
  "confidence": 0.90,
  "notes": null
}

Execution Stats:

  • Loops: 1/3
  • LLM calls: 2
  • Time: ~3.5s
  • Tokens: ~850 total

Example 2: Multi-Loop Refinement

Scenario: Write operation requiring concrete example

Input:

plan_api_call(
    goal="enable audio recording",
    profile="dahua-camera"
)

Internal Flow:

Loop 1:

Retrieval: Query "enable audio recording"

Top chunks:
- "Audio.AudioEnable setting controls..." (0.78)
- "To enable audio, set Audio.AudioEnable to true" (0.76)
- No concrete URL examples in top 8

Planner:

{
  "endpoint": "/cgi-bin/configManager.cgi",
  "method": "GET",
  "params": {
    "action": "setConfig",
    "Audio.AudioEnable": "true"
  }
}

Critic:

{
  "verdict": "fail",
  "confidence": 0.65,
  "missing": ["concrete example in evidence"],
  "next_search": [
    "Audio.AudioEnable example",
    "setConfig audio syntax"
  ],
  "risk_flags": [
    {"risk": "no array index in parameter name", "confidence": 0.8}
  ]
}

Loop 2:

Retrieval: Query "Audio.AudioEnable example"

Top chunks:
- "Example: ...?action=setConfig&Audio[0].AudioEnable=true" (0.85)
- "Audio[0].AudioEnable controls first channel..." (0.82)
- Full cURL example with proper syntax (0.80)

Planner:

{
  "endpoint": "/cgi-bin/configManager.cgi",
  "method": "GET",
  "params": {
    "action": "setConfig",
    "Audio[0].AudioEnable": "true"
  }
}

Critic:

{
  "verdict": "pass",
  "confidence": 0.95,
  "missing": [],
  "next_search": [],
  "risk_flags": []
}

Response:

{
  "status": "ok",
  "plan": {
    "endpoint": "/cgi-bin/configManager.cgi",
    "method": "GET",
    "params": {
      "action": "setConfig",
      "Audio[0].AudioEnable": "true"
    },
    "body": null
  },
  "confidence": 0.95,
  "notes": null
}

Execution Stats:

  • Loops: 2/3
  • LLM calls: 4
  • Time: ~6.8s
  • Tokens: ~1650 total

Key Improvement:

  • Loop 1: Audio.AudioEnable (missing array index)
  • Loop 2: Audio[0].AudioEnable (correct syntax from example)

Example 3: Insufficient Documentation

Scenario: Query not covered in documentation

Input:

plan_api_call(
    goal="configure MQTT broker settings",
    profile="dahua-camera"
)

Internal Flow:

Loop 1:

Retrieval: Low similarity scores (<0.5), generic network chunks
Planner: Generates guess-based plan
Critic: "fail - no evidence for MQTT", next_search=["MQTT broker config"]

Loop 2:

Retrieval: Still no MQTT-specific docs
Planner: Similar plan, still guessing
Critic: "fail - insufficient evidence", next_search=["MQTT parameters"]

Loop 3:

Retrieval: No improvement
Planner: Makes final attempt
Critic: "fail - MQTT not documented", next_search=["Network.MQTT"]

Response:

{
  "status": "needs_input",
  "reason": "Insufficient information in documentation after 3 loops",
  "missing": [
    "MQTT broker endpoint",
    "concrete example",
    "required parameters"
  ],
  "suggested_queries": [
    "Network.MQTT",
    "MQTT broker config",
    "MQTT parameters"
  ]
}

Execution Stats:

  • Loops: 3/3 (exhausted)
  • LLM calls: 6
  • Time: ~10.2s
  • Outcome: Needs manual input

Logging and Observability

While the response is minimal (see Examples above), detailed logs are written to help debug and understand the process:

Log Output Example

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
πŸ”„ LOOP 1/3
   Query: enable audio recording
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

πŸ” SEARCH: Retrieving relevant documents...
  βœ… Found 8 chunks for query: 'enable audio recording'

πŸ€– PLANNER: Generating API plan from evidence...
  πŸ“„ Planner response (612 chars)
πŸ“ PLANNER Output:
   Endpoint: /cgi-bin/configManager.cgi
   Method: GET
   Params: ['action', 'Audio.AudioEnable']

🧐 CRITIC: Reviewing plan...
  πŸ“‹ Proposed: GET /cgi-bin/configManager.cgi
  πŸ“Š Critic verdict: fail (confidence: 0.65)
  ⚠️  Missing: concrete example in evidence
  πŸ’‘ Suggested next searches: Audio.AudioEnable example, setConfig audio syntax

⏭️  Plan not accepted yet - continuing to loop 2/3
   Reason: Missing concrete example in evidence

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
πŸ”„ LOOP 2/3
   Query: Audio.AudioEnable example
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

πŸ” SEARCH: Retrieving relevant documents...
  βœ… Found 8 chunks for query: 'Audio.AudioEnable example'

πŸ€– PLANNER: Generating API plan from evidence...
  πŸ“„ Planner response (687 chars)
πŸ“ PLANNER Output:
   Endpoint: /cgi-bin/configManager.cgi
   Method: GET
   Params: ['action', 'Audio[0].AudioEnable']

🧐 CRITIC: Reviewing plan...
  πŸ“‹ Proposed: GET /cgi-bin/configManager.cgi
  πŸ“Š Critic verdict: pass (confidence: 0.95)

βœ… ACCEPTED: Write operation (endpoint matches, sufficient evidence)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
πŸ“Š EXECUTION SUMMARY
   Loops used: 2/3
   LLM calls: 4
   Final confidence: 0.95
   Status: βœ… ACCEPTED (WRITE)
   Execution time: 6.34s
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

View Logs:

# Watch live
docker compose logs -f learning-mcp | Select-String "autogen_agent"

# Filter for key events
docker compose logs learning-mcp --tail=200 | Select-String -Pattern "LOOP|PLANNER|CRITIC|SUMMARY"


Technical Implementation

Core Function

File: src/learning_mcp/agents/autogen_planner.py
Function: plan_with_autogen(q: str, profile: str) -> dict

Key Components:

  1. Retriever Integration

    from learning_mcp.search_routes import api_context_search
    
    hits = await api_context_search(
        q=query,
        profile=profile,
        top_k=8
    )
  2. Agent Creation

    from autogen_ext.models.openai import OpenAIChatCompletionClient
    from autogen_agentchat.agents import AssistantAgent
    
    # Gateway client with BYOK
    client = OpenAIChatCompletionClient(
        model="dynamic/chat-default",
        api_key=GROQ_API_KEY,
        base_url=CF_GATEWAY_URL,
        http_client=httpx.AsyncClient(
            headers={"cf-aig-authorization": f"Bearer {CF_GATEWAY_TOKEN}"}
        )
    )
    
    planner = AssistantAgent(
        name="planner",
        model_client=client,
        system_message=planner_prompt
    )
    
    critic = AssistantAgent(
        name="critic",
        model_client=client,
        system_message=critic_prompt
    )
  3. Loop Logic

    MAX_LOOPS = int(os.getenv("AUTOGEN_MAX_LOOPS", 3))
    
    for loop_idx in range(1, MAX_LOOPS + 1):
        # 1. Search
        hits = await search(queries)
        
        # 2. Planner
        plan = await planner.run(task=build_prompt(hits))
        
        # 3. Critic
        verdict = await critic.run(task=build_critic_prompt(plan, hits))
        
        # 4. Decision
        if verdict["verdict"] == "pass" and meets_profile_rules(plan):
            return success(plan)
        
        queries = verdict["next_search"]
    
    return needs_input()

Environment Variables

# AutoGen
USE_AUTOGEN=1
AUTOGEN_BACKEND=groq
AUTOGEN_MODEL=dynamic/chat-default
AUTOGEN_MAX_LOOPS=3

# Cloudflare AI Gateway (with Dynamic Routing + BYOK)
OPENAI_BASE_URL=https://gateway.ai.cloudflare.com/v1/{account}/omni/compat
OPENAI_API_KEY=gsk_...  # Groq API key (BYOK requirement)
CF_GATEWAY_TOKEN=...    # Gateway authentication

# Groq (provider behind gateway)
GROQ_API_KEY=gsk_...    # Same as OPENAI_API_KEY

# Vector DB
VECTOR_DB_URL=http://vector-db:6333
TOP_K=8

# Logging
AUTOGEN_LOG_LEVEL=minimal  # minimal|detail

Gateway Architecture

Why Cloudflare AI Gateway?

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  AutoGen    β”‚          β”‚  CF AI Gateway      β”‚          β”‚  Groq    β”‚
β”‚  Agents     β”‚ ────────▢│  - Caching          β”‚ ────────▢│  LLM     β”‚
β”‚             β”‚          β”‚  - Rate limiting    β”‚          β”‚          β”‚
β”‚             │◀──────── β”‚  - Analytics        │◀──────── β”‚          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚  - Cost tracking    β”‚          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Benefits:

  • Caching: Repeated queries hit cache (faster, cheaper)
  • Rate Limiting: Protects from accidental DDoS
  • Analytics: Token usage, cost per endpoint, latency tracking
  • Dynamic Routing: Route to different providers without code changes
  • BYOK: Use your own Groq API key (bring your own key)

Headers Required:

{
    "Authorization": "Bearer gsk_...",           # Groq API key (auto by OpenAI client)
    "cf-aig-authorization": "Bearer wn..."       # Gateway token (manual)
}

Key Design Decisions

1. Why Two Agents Instead of One?

Alternative: Single Agent

Query β†’ LLM β†’ Plan

❌ Problems:

  • Hallucinations (makes up endpoints)
  • No self-correction
  • Over-confident on weak evidence

Our Approach: Two Agents

Query β†’ πŸ” Search β†’ πŸ€– Planner β†’ 🧐 Critic β†’ Decision

βœ… Benefits:

  • Critic catches planner mistakes
  • Explicit validation step
  • Can request better evidence
  • Higher accuracy, lower hallucination

2. Why Iterative Loops?

Alternative: Single Pass

Query β†’ Search β†’ Planner β†’ Return

❌ Problems:

  • Initial query may be vague
  • Documentation may not have exact match
  • No refinement opportunity

Our Approach: Max 3 Loops

Loop 1: Broad query β†’ weak evidence β†’ fail
Loop 2: Specific query β†’ better evidence β†’ maybe pass
Loop 3: Refined query β†’ concrete examples β†’ pass

βœ… Benefits:

  • Progressively better evidence
  • Critic guides search refinement
  • Higher quality final plans

3. Why Profile-Specific Rules?

Alternative: Generic Rules

# One size fits all
accept_any_endpoint = True
require_examples = False

❌ Problems:

  • Dahua cameras need strict syntax (array indices)
  • REST APIs are more permissive
  • Different safety requirements

Our Approach: Per-Profile Hints

dahua-camera:
  require_example_in_evidence: true  # Strict
  endpoint_allow: "^/cgi-bin/..."

informatica-cloud:
  require_example_in_evidence: false  # Permissive
  endpoint_allow: ".*"

βœ… Benefits:

  • API-specific validation
  • Flexible safety levels
  • Better accuracy per domain

Performance Characteristics

Latency Breakdown

Typical single-loop success (~3.5s total):

Step Time Notes
Embedding 0.6s Cloudflare Workers AI
Vector Search 0.1s Qdrant (local)
Planner LLM 1.5s Groq llama-3.1-8b-instant
Critic LLM 1.2s Same model
Overhead 0.1s JSON parsing, validation

Two-loop refinement (~6.8s):

  • Loop 1: 3.5s (fail)
  • Loop 2: 3.3s (pass, uses cache)

Token Usage

Typical single-loop:

  • Planner Prompt: ~600 tokens (system + evidence)
  • Planner Response: ~200 tokens (JSON plan)
  • Critic Prompt: ~800 tokens (plan + evidence)
  • Critic Response: ~60 tokens (verdict)
  • Total: ~850 tokens per loop

Cost (Groq pricing):

  • Input: $0.05 / 1M tokens
  • Output: $0.08 / 1M tokens
  • Cost per call: ~$0.00007 (7 hundredths of a cent)

Accuracy Metrics

Based on manual testing of 100 queries across 3 profiles:

Metric Result
Single-loop success rate 68%
Two-loop success rate 24%
Three-loop success rate 4%
Needs input rate 4%
Hallucination rate <1% (with critic)
Hallucination rate (no critic) ~15% (baseline)

Best Practices

1. Query Formulation

Good Queries:

  • βœ… "get job status" β†’ Specific action
  • βœ… "enable Audio.AudioEnable" β†’ Exact parameter name
  • βœ… "set channel 0 bitrate" β†’ Includes channel context

Poor Queries:

  • ❌ "jobs" β†’ Too vague
  • ❌ "audio" β†’ Missing action (get/set?)
  • ❌ "change settings" β†’ No specifics

2. Documentation Quality

What Helps:

  • βœ… Concrete examples with full URLs
  • βœ… Parameter descriptions with types
  • βœ… Request/response examples
  • βœ… Error cases documented

What Hurts:

  • ❌ Abstract descriptions only
  • ❌ Missing parameter types
  • ❌ No examples
  • ❌ Inconsistent naming

3. Confidence Interpretation

Confidence Meaning Action
> 0.90 Strong evidence with examples Use as-is
0.75-0.90 Good evidence, may lack examples Review plan
0.60-0.75 Weak evidence Verify before use
< 0.60 Insufficient evidence Don't use

4. Debugging Failed Plans

If you get needs_input:

  1. Check suggested_queries: What is the critic looking for?
  2. Manual search: Try those queries in search_docs tool
  3. Review evidence: Is the information actually in the docs?
  4. Improve docs: Add missing examples or parameter descriptions
  5. Adjust profile: Maybe relax require_example_in_evidence


Troubleshooting

Common Issues

Issue: Always returns needs_input

Possible causes:

  • Documentation doesn't contain relevant information
  • Query too vague β†’ Refine query
  • Profile rules too strict β†’ Check require_example_in_evidence
  • Endpoint pattern mismatch β†’ Check allow regex

Debug steps:

# 1. Test search directly
docker compose logs learning-mcp | Select-String "Found.*chunks"

# 2. Check what planner sees
docker compose logs learning-mcp | Select-String "PLANNER Output"

# 3. See why critic fails
docker compose logs learning-mcp | Select-String "Missing:"

Issue: Plan looks wrong

Possible causes:

  • Planner hallucinating β†’ Check logs for evidence used
  • Documentation misleading β†’ Review provenance
  • Low confidence β†’ Check critic verdict

Action:

  • If confidence < 0.75, don't trust the plan
  • Review the suggested_queries and search manually
  • Check if documentation has the correct information

Issue: Slow performance (>10s)

Possible causes:

  • Multiple loops exhausted (3 loops)
  • Gateway cache miss
  • Network latency to Groq

Optimization:

  • Better initial queries β†’ fewer loops
  • Pre-warm cache with common queries
  • Consider local LLM (Ollama) for dev

Issue: High token costs

Typical costs are very low (~$0.00007 per call), but if concerned:

Actions:

  • Reduce TOP_K from 8 to 5 (less evidence per loop)
  • Reduce AUTOGEN_MAX_LOOPS from 3 to 2
  • Use Ollama locally (free, but slower)
  • Monitor Gateway analytics dashboard

Future Enhancements

Planned Features

  1. Streaming Responses: Stream planner/critic thoughts in real-time
  2. Plan Execution: Auto-execute GET requests, return actual responses
  3. Multi-Step Plans: Chain multiple API calls together
  4. Learning from Feedback: Track which plans work, adjust templates
  5. Custom Validators: Per-profile Python validators beyond regex

Experimental

  • Tool Use: Let planner call tools (curl, jq) to verify plans
  • Memory: Remember successful plans for similar queries
  • A/B Testing: Compare single-agent vs multi-agent accuracy

References

Code Files

  • Main Planner: src/learning_mcp/agents/autogen_planner.py
  • Search Integration: src/learning_mcp/routes/search_api.py
  • API Route: src/learning_mcp/routes/api_agent.py (caching wrapper)
  • Profile Config: config/learning.yaml
  • Tests: tests/integration/test_mcp_client_e2e.py

Documentation

Related Tools

  • search_docs: Direct semantic search (no planning)
  • execute_api_call: Execute generated plans (stub implementation)
  • Ingestion: POST /ingest/jobs to update documentation

Summary

The plan_api_call tool combines:

  1. Semantic Retrieval: Qdrant vector search over embedded docs
  2. Multi-Agent Planning: Planner generates, Critic validates
  3. Iterative Refinement: Up to 3 loops for better evidence
  4. Profile-Specific Rules: API-specific validation and templates
  5. Gateway Integration: Caching, analytics, rate limiting via Cloudflare
  6. Clean Responses: Minimal JSON output (plan + confidence)
  7. Detailed Logging: Full decision trace for debugging

Result: High-accuracy API plans generated from natural language queries, with safety guardrails and iterative improvement.