[FEATURE] Agent-level error handling hooks (rescue_from)

### Scope check

- [x] This is **core LLM communication** (not application logic)
- [x] This **benefits most users** (not just my use case)
- [x] This **can't be solved in application code** with current RubyLLM
- [x] I read the [Contributing Guide](https://github.com/crmne/ruby_llm/blob/main/CONTRIBUTING.md)

### Due diligence

- [x] I searched existing issues
- [x] I checked the documentation

### What problem does this solve?

In production we need to handle LLM errors differently than typical application errors:

1. **Suppress transient errors from error reporting** — Rate limits, timeouts, and 5xx errors are expected in LLM workloads. We don't want these sent to our error tracker since they're noise, not bugs. Only `BadRequestError` (which indicates a defect in our pipeline) should be reported.
2. **Instrument errors for alerting** — We increment counters in our metrics provider on every LLM error so we can alert on spikes (e.g. sudden increase in rate limits), without flooding our error tracker.
3. **Log with context** — Log the error class and message for debugging.

Today the only way to do this is monkeypatching `Chat#complete` via `prepend`:

```ruby
module RubyLLMErrorHandling
  def complete(...)
    super
  rescue RubyLLM::Error, Faraday::TimeoutError => e
    metrics.increment("llm.api_error", type: error_type(e))
    Rails.logger.error("#{e.class}: #{e.message}")
    raise
  end
end

RubyLLM::Chat.prepend(RubyLLMErrorHandling)
```

This works but is fragile — it couples to internal implementation details of `Chat#complete` and has no access to agent context (which agent failed, what inputs were used).

### What does it look like?

A declarative `rescue_from` on `RubyLLM::Agent` — similar to what [ActiveAgents](https://docs.activeagents.ai/agents/error_handling#rescue-handlers) provides — would let us handle this cleanly with full agent context:

```ruby
class ApplicationAgent < RubyLLM::Agent
  rescue_from RubyLLM::RateLimitError, with: :handle_transient
  rescue_from RubyLLM::ServerError, with: :handle_transient
  rescue_from RubyLLM::ServiceUnavailableError, with: :handle_transient
  rescue_from RubyLLM::OverloadedError, with: :handle_transient
  rescue_from Faraday::TimeoutError, with: :handle_transient
  rescue_from RubyLLM::BadRequestError, with: :handle_bad_request

  private

  def handle_transient(exception)
    metrics.increment("llm.api_error", type: "transient")
    logger.error("#{exception.class}: #{exception.message}")
    raise # re-raise after instrumentation, but caller knows to suppress from error tracker
  end

  def handle_bad_request(exception)
    metrics.increment("llm.api_error", type: "bad_request")
    logger.error("#{self.class.name}: #{exception.message}")
    error_tracker.notify(exception) # this one IS a bug — report it
    raise
  end
end
```

The handler has access to `self` (agent instance, class name, inputs, chat state), which is exactly the context needed for useful instrumentation.

### Why can't this be solved in application code?

- `Chat#complete` has no error hooks or callbacks
- `RubyLLM::Agent` has no error hooks or callbacks
- The only option is `prepend` on `Chat#complete`, which is a monkeypatch with no agent context
- Rescuing at every call site works but duplicates logic across every place an agent is used

### References

- [ActiveAgents rescue_from](https://docs.activeagents.ai/agents/error_handling#rescue-handlers) — the API pattern we'd love to see
- Related: #341, #688, #621

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEATURE] Agent-level error handling hooks (rescue_from) #708

Scope check

Due diligence

What problem does this solve?

What does it look like?

Why can't this be solved in application code?

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

[FEATURE] Agent-level error handling hooks (rescue_from) #708

Description

Scope check

Due diligence

What problem does this solve?

What does it look like?

Why can't this be solved in application code?

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions