Scope check
Due diligence
What problem does this solve?
In production we need to handle LLM errors differently than typical application errors:
- Suppress transient errors from error reporting — Rate limits, timeouts, and 5xx errors are expected in LLM workloads. We don't want these sent to our error tracker since they're noise, not bugs. Only
BadRequestError (which indicates a defect in our pipeline) should be reported.
- Instrument errors for alerting — We increment counters in our metrics provider on every LLM error so we can alert on spikes (e.g. sudden increase in rate limits), without flooding our error tracker.
- Log with context — Log the error class and message for debugging.
Today the only way to do this is monkeypatching Chat#complete via prepend:
module RubyLLMErrorHandling
def complete(...)
super
rescue RubyLLM::Error, Faraday::TimeoutError => e
metrics.increment("llm.api_error", type: error_type(e))
Rails.logger.error("#{e.class}: #{e.message}")
raise
end
end
RubyLLM::Chat.prepend(RubyLLMErrorHandling)
This works but is fragile — it couples to internal implementation details of Chat#complete and has no access to agent context (which agent failed, what inputs were used).
What does it look like?
A declarative rescue_from on RubyLLM::Agent — similar to what ActiveAgents provides — would let us handle this cleanly with full agent context:
class ApplicationAgent < RubyLLM::Agent
rescue_from RubyLLM::RateLimitError, with: :handle_transient
rescue_from RubyLLM::ServerError, with: :handle_transient
rescue_from RubyLLM::ServiceUnavailableError, with: :handle_transient
rescue_from RubyLLM::OverloadedError, with: :handle_transient
rescue_from Faraday::TimeoutError, with: :handle_transient
rescue_from RubyLLM::BadRequestError, with: :handle_bad_request
private
def handle_transient(exception)
metrics.increment("llm.api_error", type: "transient")
logger.error("#{exception.class}: #{exception.message}")
raise # re-raise after instrumentation, but caller knows to suppress from error tracker
end
def handle_bad_request(exception)
metrics.increment("llm.api_error", type: "bad_request")
logger.error("#{self.class.name}: #{exception.message}")
error_tracker.notify(exception) # this one IS a bug — report it
raise
end
end
The handler has access to self (agent instance, class name, inputs, chat state), which is exactly the context needed for useful instrumentation.
Why can't this be solved in application code?
Chat#complete has no error hooks or callbacks
RubyLLM::Agent has no error hooks or callbacks
- The only option is
prepend on Chat#complete, which is a monkeypatch with no agent context
- Rescuing at every call site works but duplicates logic across every place an agent is used
References
Scope check
Due diligence
What problem does this solve?
In production we need to handle LLM errors differently than typical application errors:
BadRequestError(which indicates a defect in our pipeline) should be reported.Today the only way to do this is monkeypatching
Chat#completeviaprepend:This works but is fragile — it couples to internal implementation details of
Chat#completeand has no access to agent context (which agent failed, what inputs were used).What does it look like?
A declarative
rescue_fromonRubyLLM::Agent— similar to what ActiveAgents provides — would let us handle this cleanly with full agent context:The handler has access to
self(agent instance, class name, inputs, chat state), which is exactly the context needed for useful instrumentation.Why can't this be solved in application code?
Chat#completehas no error hooks or callbacksRubyLLM::Agenthas no error hooks or callbacksprependonChat#complete, which is a monkeypatch with no agent contextReferences