Skip to content

fix(deriver): surface save_representation errors to queue manager#660

Open
mrlufepines wants to merge 1 commit into
plastic-labs:mainfrom
mrlufepines:fix/deriver-silent-save
Open

fix(deriver): surface save_representation errors to queue manager#660
mrlufepines wants to merge 1 commit into
plastic-labs:mainfrom
mrlufepines:fix/deriver-silent-save

Conversation

@mrlufepines
Copy link
Copy Markdown

@mrlufepines mrlufepines commented May 7, 2026

Surface save_representation failures so queue rows are not silently marked processed

Target repo: plastic-labs/honcho
Suggested branch name: fix/deriver-silent-save-failure

Problem

When save_representation raises an exception (e.g. due to an embedding quota exhaustion,
a transient database error, or a network failure to the vector store), the deriver catches
the error, logs it, and continues. The queue row is then marked processed=true with
error=null. From the queue manager's perspective the task succeeded.

The practical result: zero observations are saved for that batch, but the queue row is
consumed and will never be retried. Any data in the message window is silently lost. This
was discovered in production after a Gemini embedding API quota event: docs=0 for
multiple peers despite clean-looking deriver logs.

Root cause

In process_representation_tasks_batch, the save loop catches Exception per observer,
appends to a save_errors list, and logs at ERROR level. However, the function returns
normally after the loop, so the exception never propagates to the queue manager. The queue
manager marks the row processed=true based on a clean return, regardless of whether any
data was actually persisted.

Fix

After the save loop, checks if save_errors is non-empty and raises a RuntimeError
aggregating all per-observer failures:

if save_errors:
    raise RuntimeError(
        f"save_representation failed for {len(save_errors)} observer(s): "
        + "; ".join(save_errors)
    )

This propagates the error to the queue manager, which then sets processed=false and
populates the error column so the row can be inspected and retried.

Files touched

  • src/deriver/deriver.py

Test plan

  • Mock save_representation to raise for one observer; assert the function raises
    RuntimeError containing the observer name and exception class.
  • Mock save_representation to raise for all observers; assert all errors appear in
    the single RuntimeError message.
  • Mock save_representation to succeed; assert the function returns normally
    (no regression).
  • End-to-end with a forced save failure: confirm the queue row has processed=false
    and error is populated after the deriver processes the batch.

Notes

  • Self-healing in Lufe's deployment via cleo ensure-patches until merged.
  • See full context in cleo's STATE.md and patches dir.
  • Pre-fix behavior: transient backend errors (embedding quota, DB hiccup) cause silent
    data loss with no indication in the queue. Post-fix: failures surface, rows stay
    retryable, and the error column provides a debugging trail.
  • This is a pure error-propagation fix with no logic changes to the save path itself.

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced error reporting for data persistence failures. Operations that fail to save data now properly report errors instead of completing silently, improving visibility into system issues.

The deriver's process_representation_tasks_batch had a try/except around
save_representation that swallowed every exception with only a logger.error.
The caller (queue_manager.process_work_unit) had no signal that anything
went wrong, so it called mark_queue_items_as_processed with processed=true
and the queue.error column stayed null. When Gemini embedding quota
exhausted, every save failed but the deriver kept "processing" rows
silently. Lufe got bit by this once already.

Now the per-observer try/except still logs and continues (so partial saves
aren't lost), but errors are collected in save_errors. After the metrics
emit, if save_errors is non-empty the function raises a RuntimeError
aggregating every observer that failed. queue_manager catches that,
calls _handle_processing_error -> mark_queue_item_as_errored, and the
queue.error column finally tells the truth.

Notes:
- Errored rows still get processed=true (queue_manager design choice to
  avoid busy-loop on permanent failures). Recovery is manual: query rows
  WHERE error LIKE '%quota%' or similar, set processed=false, restart
  deriver.
- Metrics still emit on partial failure so we don't lose duration data.

This fork file is also mirrored into cleo/patches/honcho_deriver/deriver.py
on the host so cleo ensure-patches keeps the running container in sync.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

Review Change Stack

Walkthrough

The process_representation_tasks_batch function in src/deriver/deriver.py now explicitly propagates representation save failures. Previously, save failures were only logged; now they accumulate in a save_errors list and trigger a RuntimeError after telemetry emission, surfacing failures to upstream queue processing.

Changes

Error Propagation in Batch Representation Saving

Layer / File(s) Summary
Error Collection Setup
src/deriver/deriver.py
A save_errors list collects exceptions raised during observer-specific RepresentationManager.save_representation calls.
Preserved Logic & Telemetry
src/deriver/deriver.py
Surrounding batch processing logic and telemetry event emission remain unchanged in structure and content.
Error Propagation
src/deriver/deriver.py
After telemetry is emitted, the function raises RuntimeError if save_errors is non-empty, including failure count and exception details.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Poem

🐰 A rabbit observed with delight,
Save failures now brought to light!
No silent defeats in the queue,
Errors surfaced bright and true—
Exception justice, oh what a sight! 🌟

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: surfacing save_representation errors from the deriver to the queue manager instead of silently swallowing them.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/deriver/deriver.py`:
- Around line 264-268: Replace the generic RuntimeError raised when save_errors
is non-empty with a project-specific exception from src.exceptions.py: import
the appropriate exception (e.g., DeriverProcessingError or another existing
deriver/processing failure type) and raise that instead with the same aggregated
message built from save_errors; update the raise in deriver.py (the block
referencing save_errors and the f"save_representation failed..." message) to use
the imported custom exception so the code follows the repo rule for typed
exceptions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e51d0a58-f29d-4397-bdad-e31dcfd9e6e0

📥 Commits

Reviewing files that changed from the base of the PR and between a4ae372 and a874b63.

📒 Files selected for processing (1)
  • src/deriver/deriver.py

Comment thread src/deriver/deriver.py
Comment on lines +264 to +268
if save_errors:
raise RuntimeError(
f"save_representation failed for {len(save_errors)} observer(s): "
+ "; ".join(save_errors)
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use a project-specific exception instead of RuntimeError.

Raising a generic RuntimeError here loses domain semantics and violates the repo rule for typed exceptions in src/**. Please raise an appropriate custom exception from src/exceptions.py (e.g., a deriver/processing failure type) with the same aggregated message.

Suggested change
-    if save_errors:
-        raise RuntimeError(
-            f"save_representation failed for {len(save_errors)} observer(s): "
-            + "; ".join(save_errors)
-        )
+    if save_errors:
+        raise DeriverProcessingException(
+            f"save_representation failed for {len(save_errors)} observer(s): "
+            + "; ".join(save_errors)
+        )

As per coding guidelines: src/**/*.py: “Use custom exceptions defined in src/exceptions.py with specific exception types.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/deriver/deriver.py` around lines 264 - 268, Replace the generic
RuntimeError raised when save_errors is non-empty with a project-specific
exception from src.exceptions.py: import the appropriate exception (e.g.,
DeriverProcessingError or another existing deriver/processing failure type) and
raise that instead with the same aggregated message built from save_errors;
update the raise in deriver.py (the block referencing save_errors and the
f"save_representation failed..." message) to use the imported custom exception
so the code follows the repo rule for typed exceptions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant