Skip to content

[FEATURE] Agent-level error handling hooks (rescue_from) #708

@skovy

Description

@skovy

Scope check

  • This is core LLM communication (not application logic)
  • This benefits most users (not just my use case)
  • This can't be solved in application code with current RubyLLM
  • I read the Contributing Guide

Due diligence

  • I searched existing issues
  • I checked the documentation

What problem does this solve?

In production we need to handle LLM errors differently than typical application errors:

  1. Suppress transient errors from error reporting — Rate limits, timeouts, and 5xx errors are expected in LLM workloads. We don't want these sent to our error tracker since they're noise, not bugs. Only BadRequestError (which indicates a defect in our pipeline) should be reported.
  2. Instrument errors for alerting — We increment counters in our metrics provider on every LLM error so we can alert on spikes (e.g. sudden increase in rate limits), without flooding our error tracker.
  3. Log with context — Log the error class and message for debugging.

Today the only way to do this is monkeypatching Chat#complete via prepend:

module RubyLLMErrorHandling
  def complete(...)
    super
  rescue RubyLLM::Error, Faraday::TimeoutError => e
    metrics.increment("llm.api_error", type: error_type(e))
    Rails.logger.error("#{e.class}: #{e.message}")
    raise
  end
end

RubyLLM::Chat.prepend(RubyLLMErrorHandling)

This works but is fragile — it couples to internal implementation details of Chat#complete and has no access to agent context (which agent failed, what inputs were used).

What does it look like?

A declarative rescue_from on RubyLLM::Agent — similar to what ActiveAgents provides — would let us handle this cleanly with full agent context:

class ApplicationAgent < RubyLLM::Agent
  rescue_from RubyLLM::RateLimitError, with: :handle_transient
  rescue_from RubyLLM::ServerError, with: :handle_transient
  rescue_from RubyLLM::ServiceUnavailableError, with: :handle_transient
  rescue_from RubyLLM::OverloadedError, with: :handle_transient
  rescue_from Faraday::TimeoutError, with: :handle_transient
  rescue_from RubyLLM::BadRequestError, with: :handle_bad_request

  private

  def handle_transient(exception)
    metrics.increment("llm.api_error", type: "transient")
    logger.error("#{exception.class}: #{exception.message}")
    raise # re-raise after instrumentation, but caller knows to suppress from error tracker
  end

  def handle_bad_request(exception)
    metrics.increment("llm.api_error", type: "bad_request")
    logger.error("#{self.class.name}: #{exception.message}")
    error_tracker.notify(exception) # this one IS a bug — report it
    raise
  end
end

The handler has access to self (agent instance, class name, inputs, chat state), which is exactly the context needed for useful instrumentation.

Why can't this be solved in application code?

  • Chat#complete has no error hooks or callbacks
  • RubyLLM::Agent has no error hooks or callbacks
  • The only option is prepend on Chat#complete, which is a monkeypatch with no agent context
  • Rescuing at every call site works but duplicates logic across every place an agent is used

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions