fix(deriver): surface save_representation errors to queue manager by mrlufepines · Pull Request #660 · plastic-labs/honcho

mrlufepines · 2026-05-07T22:25:54Z

Surface save_representation failures so queue rows are not silently marked processed

Target repo: plastic-labs/honcho
Suggested branch name: fix/deriver-silent-save-failure

Problem

When save_representation raises an exception (e.g. due to an embedding quota exhaustion,
a transient database error, or a network failure to the vector store), the deriver catches
the error, logs it, and continues. The queue row is then marked processed=true with
error=null. From the queue manager's perspective the task succeeded.

The practical result: zero observations are saved for that batch, but the queue row is
consumed and will never be retried. Any data in the message window is silently lost. This
was discovered in production after a Gemini embedding API quota event: docs=0 for
multiple peers despite clean-looking deriver logs.

Root cause

In process_representation_tasks_batch, the save loop catches Exception per observer,
appends to a save_errors list, and logs at ERROR level. However, the function returns
normally after the loop, so the exception never propagates to the queue manager. The queue
manager marks the row processed=true based on a clean return, regardless of whether any
data was actually persisted.

Fix

After the save loop, checks if save_errors is non-empty and raises a RuntimeError
aggregating all per-observer failures:

if save_errors:
    raise RuntimeError(
        f"save_representation failed for {len(save_errors)} observer(s): "
        + "; ".join(save_errors)
    )

This propagates the error to the queue manager, which then sets processed=false and
populates the error column so the row can be inspected and retried.

Files touched

src/deriver/deriver.py

Test plan

Mock save_representation to raise for one observer; assert the function raises
RuntimeError containing the observer name and exception class.
Mock save_representation to raise for all observers; assert all errors appear in
the single RuntimeError message.
Mock save_representation to succeed; assert the function returns normally
(no regression).
End-to-end with a forced save failure: confirm the queue row has processed=false
and error is populated after the deriver processes the batch.

Notes

Self-healing in Lufe's deployment via cleo ensure-patches until merged.
See full context in cleo's STATE.md and patches dir.
Pre-fix behavior: transient backend errors (embedding quota, DB hiccup) cause silent
data loss with no indication in the queue. Post-fix: failures surface, rows stay
retryable, and the error column provides a debugging trail.
This is a pure error-propagation fix with no logic changes to the save path itself.

Summary by CodeRabbit

Bug Fixes
- Enhanced error reporting for data persistence failures. Operations that fail to save data now properly report errors instead of completing silently, improving visibility into system issues.

The deriver's process_representation_tasks_batch had a try/except around save_representation that swallowed every exception with only a logger.error. The caller (queue_manager.process_work_unit) had no signal that anything went wrong, so it called mark_queue_items_as_processed with processed=true and the queue.error column stayed null. When Gemini embedding quota exhausted, every save failed but the deriver kept "processing" rows silently. Lufe got bit by this once already. Now the per-observer try/except still logs and continues (so partial saves aren't lost), but errors are collected in save_errors. After the metrics emit, if save_errors is non-empty the function raises a RuntimeError aggregating every observer that failed. queue_manager catches that, calls _handle_processing_error -> mark_queue_item_as_errored, and the queue.error column finally tells the truth. Notes: - Errored rows still get processed=true (queue_manager design choice to avoid busy-loop on permanent failures). Recovery is manual: query rows WHERE error LIKE '%quota%' or similar, set processed=false, restart deriver. - Metrics still emit on partial failure so we don't lose duration data. This fork file is also mirrored into cleo/patches/honcho_deriver/deriver.py on the host so cleo ensure-patches keeps the running container in sync. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

coderabbitai · 2026-05-07T22:26:09Z

Walkthrough

The process_representation_tasks_batch function in src/deriver/deriver.py now explicitly propagates representation save failures. Previously, save failures were only logged; now they accumulate in a save_errors list and trigger a RuntimeError after telemetry emission, surfacing failures to upstream queue processing.

Changes

Error Propagation in Batch Representation Saving

Layer / File(s)	Summary
Error Collection Setup `src/deriver/deriver.py`	A `save_errors` list collects exceptions raised during observer-specific `RepresentationManager.save_representation` calls.
Preserved Logic & Telemetry `src/deriver/deriver.py`	Surrounding batch processing logic and telemetry event emission remain unchanged in structure and content.
Error Propagation `src/deriver/deriver.py`	After telemetry is emitted, the function raises `RuntimeError` if `save_errors` is non-empty, including failure count and exception details.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Poem

🐰 A rabbit observed with delight,
Save failures now brought to light!
No silent defeats in the queue,
Errors surfaced bright and true—
Exception justice, oh what a sight! 🌟

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: surfacing save_representation errors from the deriver to the queue manager instead of silently swallowing them.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/deriver/deriver.py`:
- Around line 264-268: Replace the generic RuntimeError raised when save_errors
is non-empty with a project-specific exception from src.exceptions.py: import
the appropriate exception (e.g., DeriverProcessingError or another existing
deriver/processing failure type) and raise that instead with the same aggregated
message built from save_errors; update the raise in deriver.py (the block
referencing save_errors and the f"save_representation failed..." message) to use
the imported custom exception so the code follows the repo rule for typed
exceptions.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e51d0a58-f29d-4397-bdad-e31dcfd9e6e0

📥 Commits

Reviewing files that changed from the base of the PR and between a4ae372 and a874b63.

📒 Files selected for processing (1)

src/deriver/deriver.py

coderabbitai · 2026-05-07T22:27:57Z

+    if save_errors:
+        raise RuntimeError(
+            f"save_representation failed for {len(save_errors)} observer(s): "
+            + "; ".join(save_errors)
+        )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use a project-specific exception instead of RuntimeError.

Raising a generic RuntimeError here loses domain semantics and violates the repo rule for typed exceptions in src/**. Please raise an appropriate custom exception from src/exceptions.py (e.g., a deriver/processing failure type) with the same aggregated message.

Suggested change

- if save_errors: - raise RuntimeError( - f"save_representation failed for {len(save_errors)} observer(s): " - + "; ".join(save_errors) - ) + if save_errors: + raise DeriverProcessingException( + f"save_representation failed for {len(save_errors)} observer(s): " + + "; ".join(save_errors) + )

As per coding guidelines: src/**/*.py: “Use custom exceptions defined in src/exceptions.py with specific exception types.”

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/deriver/deriver.py` around lines 264 - 268, Replace the generic RuntimeError raised when save_errors is non-empty with a project-specific exception from src.exceptions.py: import the appropriate exception (e.g., DeriverProcessingError or another existing deriver/processing failure type) and raise that instead with the same aggregated message built from save_errors; update the raise in deriver.py (the block referencing save_errors and the f"save_representation failed..." message) to use the imported custom exception so the code follows the repo rule for typed exceptions.

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(deriver): surface save_representation errors to queue manager#660

fix(deriver): surface save_representation errors to queue manager#660
mrlufepines wants to merge 1 commit into
plastic-labs:mainfrom
mrlufepines:fix/deriver-silent-save

mrlufepines commented May 7, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 7, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mrlufepines commented May 7, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Surface save_representation failures so queue rows are not silently marked processed

Problem

Root cause

Fix

Files touched

Test plan

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mrlufepines commented May 7, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 7, 2026 •

edited

Loading