Agents.KT/docs/error-recovery.md at main · Deep-CodeAI/Agents.KT

Tool Error Recovery

Tools fail at runtime — network errors, timeouts, flaky APIs. Agents.KT lets each tool declare its own recovery strategy, and the fixer is always an agent. No special parser class, no lambda callbacks — a regular Agent<String, String> with the same composition, telemetry, and budget tracking as everything else. Deterministic agents (implementedBy) cost zero LLM calls.

`onError` inside the tool block

Error handling lives where the tool lives:

tools {
    tool("fetch") {
        description("Fetch a URL")
        executor { args -> httpGet(args["url"].toString()) }
        onError {
            executionError { _ -> retry(maxAttempts = 3) }
        }
    }
}
// Tool throws → retries up to 3 times → succeeds or throws ToolExecutionException

Agent-based repair — the fixer is an Agent<String, String>:

val jsonFixer = agent<String, String>("json-fixer") {
    skills {
        skill<String, String>("cleanup", "Fixes common JSON issues") {
            implementedBy { input -> input.replace(",}", "}").replace(",]", "]") }
        }
    }
}

tools {
    tool("parse") {
        description("Parse JSON input")
        executor { args -> parseJson(args["json"].toString()) }
        onError {
            invalidArgs { _, _ -> fix(agent = jsonFixer) }
            executionError { _ -> fix(agent = jsonFixer, retries = 3) }
        }
    }
}

Shorthand form also works — onError as a named parameter:

tools {
    tool("fetch", "Fetch a URL", onError = {
        executionError { _ -> retry(maxAttempts = 3) }
    }) { args -> httpGet(args["url"].toString()) }
}

Or at the agent level via onToolError:

onToolError("fetch") {
    executionError { _ -> retry(maxAttempts = 3) }
}

Defaults and priority

Set defaults once, override per tool:

tools {
    defaults {
        onError {
            executionError { _ -> retry(maxAttempts = 3) }
        }
    }
    tool("fetch", "Fetch URL") { _ -> httpGet() }      // inherits retry(3)
    tool("compile") {
        description("Compile code")
        executor { _ -> compile() }
        onError { executionError { _ -> retry(maxAttempts = 1) } }  // overrides
    }
}

Resolution priority: tool block onError > agent-level onToolError > defaults.

Built-in tools: `escalate` and `throwException`

Every agent has two framework-provided tools — escalate and throwException. They exist in every agent's toolMap but are inactive by default. A skill activates them by referencing them in tools(...):

val fixer = agent<String, String>("json-fixer") {
    prompt("Fix malformed JSON. If structural error, call escalate. If binary garbage, call throwException.")
    model { ollama("gpt-4o-mini"); temperature = 0.0 }
    skills {
        skill<String, String>("fix", "Fix JSON") {
            tools("escalate", "throwException")   // activates built-in tools
        }
    }
}

escalate — soft failure. The error is fed back to the parent LLM as a tool result, giving it a chance to retry with corrected arguments:

LLM calls parseJson(json = "{name: world}")  →  tool throws (unquoted keys)
  → fixer agent tries to fix
  → fixer LLM calls escalate(reason = "Unquoted keys. Corrected: {\"name\":\"world\"}")
    → error fed back to parent LLM: "ERROR: Tool 'parseJson' failed: Unquoted keys..."
      → parent LLM retries: parseJson(json = '{"name":"world"}')  →  succeeds

throwException — hard failure. ToolExecutionException propagates immediately through the pipeline. No retries.

Deterministic agents can also escalate by throwing directly:

implementedBy { _ ->
    throw EscalationException("Schema mismatch", Severity.HIGH)  // soft
    // or
    throw ToolExecutionException("Binary data, not JSON")         // hard
}

Full example: JSON key counter with escalation recovery

A complete working example — agent parses malformed JSON via a tool, fixer agent escalates with corrected data, the LLM retries and succeeds:

// Fixer agent: LLM-driven, uses the built-in escalate tool.
// Analyzes the parse error and suggests corrected JSON in the escalation reason.
val fixer = agent<String, String>("json-fixer") {
    prompt(
        "You receive a string that failed to parse as JSON. " +
        "Call the escalate tool with a reason that includes the corrected valid JSON."
    )
    model { ollama("gpt-4o-mini"); temperature = 0.0 }
    budget { maxTurns = 3 }
    skills {
        skill<String, String>("fix", "Analyze and escalate JSON errors") {
            tools("escalate")   // activates the built-in escalate tool
        }
    }
}

// Main agent: uses calculateNumberOfKeys tool with onError inside the tool block.
val agent = agent<String, String>("json-analyst") {
    prompt(
        "Use the calculateNumberOfKeys tool to count keys in JSON objects. " +
        "If a tool returns an ERROR, read it carefully — it contains corrected JSON. " +
        "Retry the tool with the corrected JSON. Reply with ONLY the number."
    )
    model { ollama("llama3"); temperature = 0.0 }
    budget { maxTurns = 10 }
    tools {
        tool("calculateNumberOfKeys") {
            description("Count top-level keys in a JSON object. Args: json (valid JSON string)")
            executor { args ->
                val json = args["json"]?.toString()
                    ?: throw IllegalArgumentException("Missing 'json' argument")
                val keys = Regex(""""([^"]+)"\s*:""").findAll(json).toList()
                if (keys.isEmpty()) throw IllegalArgumentException("No valid keys — unquoted keys?")
                keys.size
            }
            onError {
                executionError { _ -> fix(agent = fixer, retries = 2) }
            }
        }
    }
    skills {
        skill<String, String>("solve", "Analyze JSON using tools") {
            tools("calculateNumberOfKeys")
        }
    }
    onToolUse { name, args, result ->
        println("  $name(${args["json"].toString().take(60)}) = $result")
    }
}

agent("How many keys? {name: world, age: 30, active: true}")
//   calculateNumberOfKeys({name: world, age: 30, active: true}) =
//       ERROR: Tool 'calculateNumberOfKeys' failed: No valid keys — unquoted keys? ...
//   calculateNumberOfKeys({"name":"world","age":30,"active":true}) = 3
// → "3"

The flow:

LLM calls calculateNumberOfKeys(json="{name: world, ...}") — malformed, unquoted keys
Tool throws → onError invokes fixer agent
Fixer LLM analyzes the error, calls escalate(reason="...corrected: {\"name\":\"world\",...}")
Escalation error fed back to main LLM as tool result
Main LLM reads the corrected JSON from the error, retries with valid JSON
Tool succeeds → returns 3

Error types

ToolError is a sealed hierarchy for programmatic handling:

sealed interface ToolError {
    data class InvalidArgs(val rawArgs: String, val parseError: String, ...)
    data class DeserializationError(val rawValue: String, val targetType: KType, ...)
    data class ExecutionError(val args: Map<String, Any?>, val cause: Throwable)
    data class EscalationError(val source: String, val reason: String, val severity: Severity, ...)
}
// Severity: LOW, MEDIUM, HIGH, CRITICAL

No tool has error handling by default. When no handler is set and a tool throws, the exception propagates normally — zero overhead on the happy path.

Model-call error recovery — `onLLMError`

Tool recovery (above) handles a tool that fails; onLLMError (#3508) handles the model call itself failing — a down provider (raw ConnectException), a 5xx, a malformed response. The default with no handler is fail fast and loud: the original exception propagates, identity preserved. A registered handler returns an LlmErrorDecision per failed attempt:

agent<String, Report>("analyst") {
    model { openai("gpt-5") }
    onLLMError { e ->
        when {
            e is ConnectException -> LlmErrorDecision.Retry(maxAttempts = 3, initialBackoffMillis = 500)
            else                  -> LlmErrorDecision.RespondWith(Report.empty())
        }
    }
}

Rethrow (the default) — fail fast and loud; the original exception propagates.
RespondWith(output) — recover with a typed fallback; the value must be assignable to the agent's OUT (a wrong type fails fast with ClassCastException).
Retry(maxAttempts, initialBackoffMillis) (#4495) — re-run the call with exponential backoff (500ms → 1s → 2s …, by default). maxAttempts counts the original call; the handler is consulted again on every failure, so it can switch to RespondWith/Rethrow mid-schedule. The attempt budget is per model turn — one flaky turn can't starve a multi-turn run. Exhaustion rethrows the original error, exactly as Rethrow would.

The handler does not fire for budget caps (onBudgetExceeded owns those) or cancellation, and v1 scopes recovery to the agentic loop — a model failure during multi-skill LLM routing still propagates loud. See production-hardening.md for the deployment checklist entry.

Transport-layer retry (automatic, below `onLLMError`) — #4560

Before a failure ever reaches onLLMError, the shared HTTP transport (HttpModelClientSupport.sendBounded, used by every provider's non-streaming call — Claude, OpenAI + DeepSeek/Kimi/OpenRouter/Perplexity, Gemini, Ollama) already retries transient failures by default, no opt-in:

connection-level exceptions — an IOException from the send (connection reset, refused, no-route, unexpected EOF), and
transient HTTP statuses — 408 / 429 / 500 / 502 / 503 / 504.

Up to 3 attempts, exponential backoff (250ms → 500ms). This matches what official SDKs (e.g. OpenAI) do by default — a dropped connection or a 503 is the textbook retryable case, so you don't have to write a handler for it.

Two deliberate exclusions: HttpTimeoutException is not retried — the per-request timeout is your total budget, and retrying would silently multiply it, so a timeout surfaces immediately. And the original exception type is preserved on exhaustion (rethrown as-is, not wrapped) — that's what lets the onLLMError handler above match e is ConnectException.

The two layers compose, transport first:

http.send → [transport retry: conn-level IOException / 5xx, ×3] → raw exception → [onLLMError: your policy] → loop

So onLLMError sees only what survives the transport retries — use it for semantic recovery (a RespondWith fallback, a longer Retry schedule, escalation, or retrying a HttpTimeoutException), not for plain connection blips. Streaming (sendChatStream) is not transport-retried (re-issuing mid-stream would duplicate delivered tokens); wrap a streaming call in onLLMError { Retry() } or a firstOf(...) fallback if you need connect-phase resilience there. For higher-level recovery — fall over to another provider/model, or take the first of N samples — use firstOf(a, b) / agent.speculative(n); to re-run until the output is valid, loopUntil { … } (see composition.md).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tool Error Recovery

`onError` inside the tool block

Defaults and priority

Built-in tools: `escalate` and `throwException`

Full example: JSON key counter with escalation recovery

Error types

Model-call error recovery — `onLLMError`

Transport-layer retry (automatic, below `onLLMError`) — #4560

FilesExpand file tree

error-recovery.md

Latest commit

History

error-recovery.md

File metadata and controls

Tool Error Recovery

onError inside the tool block

Defaults and priority

Built-in tools: escalate and throwException

Full example: JSON key counter with escalation recovery

Error types

Model-call error recovery — onLLMError

Transport-layer retry (automatic, below onLLMError) — #4560

`onError` inside the tool block

Built-in tools: `escalate` and `throwException`

Model-call error recovery — `onLLMError`

Transport-layer retry (automatic, below `onLLMError`) — #4560