Skip to content

fix: keep agent snapshots resumable when runtime-only inputs cannot be serialized#11127

Open
shaun0927 wants to merge 2 commits intodeepset-ai:mainfrom
shaun0927:fix/agent-snapshot-resume-fallback
Open

fix: keep agent snapshots resumable when runtime-only inputs cannot be serialized#11127
shaun0927 wants to merge 2 commits intodeepset-ai:mainfrom
shaun0927:fix/agent-snapshot-resume-fallback

Conversation

@shaun0927
Copy link
Copy Markdown

@shaun0927 shaun0927 commented Apr 16, 2026

Related Issues

Proposed Changes:

The robustness change in #11108 prevents agent snapshot serialization errors from masking the real runtime error, but it can also replace the entire chat_generator / tool_invoker payload with {} when only one runtime-only field is non-serializable (for example a lambda streaming_callback).

That makes the saved snapshot non-resumable even though the essential payload (messages, state, tools, etc.) is still serializable.

This PR keeps the scope narrow:

How did you test it?

  • hatch -e test run pytest test/components/agents/test_agent_breakpoints.py -k 'non_serializable_runtime_callback' -q
  • hatch -e test run pytest test/core/pipeline/test_breakpoint.py -k 'create_agent_snapshot' -q
  • hatch -e test run pytest test/components/agents/test_agent_breakpoints.py -k 'resume_from_tool_invoker and not new_breakpoint' -q
  • hatch run fmt-check haystack/core/pipeline/breakpoint.py test/components/agents/test_agent_breakpoints.py

Notes for the reviewer

I intentionally kept this fix local to agent snapshot serialization instead of introducing a broader “non-resumable snapshot” concept.

The regression test covers a concrete user-visible case where a non-serializable runtime callback causes the fallback path to trigger, but the snapshot should still remain resumable because the essential state is serializable.

Checklist

  • I have read the contributors guidelines and the code of conduct.
  • I have updated the related issue with new insights and changes.
  • I have added unit tests and updated the docstrings.
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I have documented my code.
  • I have added a release note file, following the contributors guidelines.
  • I have run pre-commit hooks and fixed any issue.

…ialized

The fallback added for agent snapshot serialization errors preserved the
original runtime failure, but it could also replace the entire
chat_generator or tool_invoker payload with an empty dict. That made the
saved snapshot impossible to resume even when only a runtime-only field
such as a streaming callback was non-serializable.

This change narrows the fallback behavior: Haystack now retries those
component inputs field-by-field and omits only the fields that cannot be
serialized, preserving resumable fields like messages, state, and tools.
A regression test covers resuming from a tool-invoker snapshot created
with a non-serializable runtime callback.

Constraint: Must preserve the original deepset-ai#11108 goal of not masking the real runtime error
Rejected: Keep saving `{}` and document snapshots as non-resumable | breaks the existing resume contract more than necessary
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: If agent snapshot fallback behavior changes again, verify both error preservation and snapshot resumability
Tested: hatch -e test run pytest test/components/agents/test_agent_breakpoints.py -k 'resume_from_tool_invoker' -q
Tested: hatch -e test run pytest test/core/pipeline/test_breakpoint.py -k 'create_agent_snapshot' -q
Tested: hatch run fmt-check haystack/core/pipeline/breakpoint.py test/components/agents/test_agent_breakpoints.py
Not-tested: Full unit suite and integration suite
Related: deepset-ai#11126
@shaun0927 shaun0927 requested a review from a team as a code owner April 16, 2026 15:41
@shaun0927 shaun0927 requested review from bogdankostic and removed request for a team April 16, 2026 15:41
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 16, 2026

@shaun0927 is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 16, 2026

CLA assistant check
All committers have signed the CLA.

@shaun0927
Copy link
Copy Markdown
Author

I re-validated this fix locally against both current upstream main and this branch.

Reproduced on current upstream main

Audited main commit:

  • 8dd36b366a5974c41eed4bc9be937fa7d2bcd9b9

Repro command:

cd /Users/jh0927/Workspace/haystack-autopilot-audit
hatch -e test run python /Users/jh0927/Workspace/haystack_snapshot_resume_probe.py

Observed on current main:

  • the snapshot is saved at the tool-invoker breakpoint
  • tool_invoker falls back to {} when the runtime-only lambda streaming_callback is not serializable
  • resume then fails with DeserializationError

Validated on this branch

Branch head validated locally:

  • 98e8317aa2a0bc6bd3d55d5bb1aba9a1d0939b47

Commands:

cd /Users/jh0927/Workspace/haystack-diff-audit
hatch -e test run python /Users/jh0927/Workspace/haystack_snapshot_resume_probe.py
hatch -e test run pytest test/components/agents/test_agent_breakpoints.py -k 'non_serializable_runtime_callback' -q
hatch -e test run pytest test/core/pipeline/test_breakpoint.py -k 'create_agent_snapshot' -q

Observed on this branch:

  • only the non-serializable streaming_callback field is omitted
  • tool_invoker.state is preserved in the snapshot payload
  • resume succeeds
  • targeted regression coverage passes

This scope looks right to me for mergeability as well: it stays local to agent snapshot serialization, preserves the #11108 non-masking behavior, and fixes the concrete resumability regression without introducing a broader new snapshot state model.

@shaun0927
Copy link
Copy Markdown
Author

Quick follow-up on the current PR state:

  • license/cla is now green
  • maintainer edit access is enabled on the PR
  • the only remaining failing check is the Vercel authorization step, which looks like a deepset team-side approval rather than a contributor-side issue

So at this point I do not see any remaining contributor-side blocker on my end. If you want any change in scope or implementation, I’m happy to adjust, but otherwise this should be ready for normal review once the maintainer-side Vercel step is handled.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves agent snapshot serialization during breakpoint snapshotting so that snapshots remain resumable when only some runtime-only inputs (e.g., callbacks) can’t be serialized, without masking the original runtime error.

Changes:

  • Replaces the {} fallback for failed chat_generator / tool_invoker input serialization with a field-by-field retry that omits only the failing fields.
  • Adds a regression test ensuring non-serializable runtime callbacks are omitted while resumable fields (notably state) remain.
  • Adds a release note documenting the resumability fix.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
haystack/core/pipeline/breakpoint.py Introduces _serialize_agent_component_inputs to preserve serializable agent inputs field-by-field when full serialization fails.
test/components/agents/test_agent_breakpoints.py Adds regression test asserting streaming_callback is omitted but snapshot remains resumable.
releasenotes/notes/fix-agent-snapshot-resume-after-fallback-7fd7ff9a0f8f8b87.yaml Release note for snapshot resumability improvement.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread haystack/core/pipeline/breakpoint.py Outdated
Comment on lines +434 to +440
if not serialized_properties:
return {}

return {
"serialization_schema": {"type": "object", "properties": serialized_properties},
"serialized_data": serialized_data,
}
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_serialize_agent_component_inputs can still return a bare {} when no fields serialize, but _deserialize_value_with_schema treats {} as an invalid payload and will raise DeserializationError. Consider always returning a structurally valid payload (e.g., empty object schema + empty serialized_data) so snapshots don’t become invalid-by-format even when all fields are omitted; this can matter if the corresponding component inputs are optional for resume (e.g., resuming from a ToolBreakpoint where chat_generator inputs may not be needed).

Copilot uses AI. Check for mistakes.
…puts fail to serialize

Address the Copilot review on deepset-ai#11127: when every field of a
chat_generator or tool_invoker input fails to serialize,
_serialize_agent_component_inputs previously returned a bare `{}`. The
downstream `_deserialize_value_with_schema` rejects `{}` with
DeserializationError, which would silently re-introduce the exact
non-resumable snapshot behavior the fix was meant to prevent (e.g. when
resuming from a ToolBreakpoint where the sub-component's inputs are
not strictly required).

The helper now always returns a structurally valid
`{"serialization_schema", "serialized_data"}` pair. When all fields
are omitted the payload degrades to the schema-valid empty object
(`{"type": "object", "properties": {}}` + `{}`), which deserializes
back to `{}` without raising.

Existing unit tests in TestCreateAgentSnapshot are updated to assert
the new empty-but-valid payload shape, and a new test verifies that
the all-fields-fail payload round-trips through
`_deserialize_value_with_schema` without raising. The release note is
extended to describe the empty-payload edge case.

Constraint: Must preserve the original deepset-ai#11108 goal of not masking the real runtime error
Constraint: Must keep the narrower field-by-field fallback from the previous commit intact
Rejected: Keep returning bare `{}` and document snapshots as non-resumable in this edge case | regresses the snapshot resume contract in the very scenario the previous commit promised to preserve
Rejected: Fall back to a different marker payload (e.g. string sentinel) | breaks downstream deserializer's object/properties contract and would require changes outside the fallback helper
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: If agent snapshot fallback behavior changes again, verify both DeserializationError is not raised on the empty-fields case and that the helper never returns a bare `{}`
Tested: hatch -e test run pytest test/core/pipeline/test_breakpoint.py -k 'create_agent_snapshot' -q
Tested: hatch -e test run pytest test/components/agents/test_agent_breakpoints.py -k 'non_serializable_runtime_callback' -q
Tested: hatch -e test run pytest test/components/agents/test_agent_breakpoints.py -k 'resume_from_tool_invoker and not new_breakpoint' -q
Tested: hatch run fmt-check haystack/core/pipeline/breakpoint.py test/core/pipeline/test_breakpoint.py test/components/agents/test_agent_breakpoints.py
Tested: hatch run test:types haystack/core/pipeline/breakpoint.py
Not-tested: Full unit suite and integration suite
Related: deepset-ai#11126
@shaun0927
Copy link
Copy Markdown
Author

Addressed the Copilot review point on _serialize_agent_component_inputs (line 440 of haystack/core/pipeline/breakpoint.py).

Problem

When every field of a sub-component input failed to serialize, the helper still returned a bare {}. _deserialize_value_with_schema rejects {} with DeserializationError, which silently re-introduced the non-resumable snapshot regression in the all-fields-fail case — exactly the scenario Copilot flagged for resumes from a ToolBreakpoint where the sub-component inputs are not strictly required.

Fix (commit e896522)

  • _serialize_agent_component_inputs now always returns a structurally valid {"serialization_schema", "serialized_data"} pair. The all-fields-fail branch degrades to {"serialization_schema": {"type": "object", "properties": {}}, "serialized_data": {}}, which _deserialize_value_with_schema loads back as {} without raising.
  • The feat: Enhance agent snapshot serialization with error handling for non-serializable inputs #11108 non-masking behavior is preserved — the helper still warns and the original runtime error is not swallowed.
  • Existing TestCreateAgentSnapshot assertions that compared against {} are updated to assert the empty-but-valid payload shape.
  • Added test_create_agent_snapshot_all_fields_non_serializable_payload_is_deserializable which triggers the all-fields-fail path and asserts the payload round-trips through _deserialize_value_with_schema.
  • Release note extended to describe the empty-payload edge case.

Local verification on this branch head

  • hatch -e test run pytest test/core/pipeline/test_breakpoint.py -k 'create_agent_snapshot' -q — 4 passed, 28 deselected
  • hatch -e test run pytest test/components/agents/test_agent_breakpoints.py -k 'non_serializable_runtime_callback' -q — 1 passed, 22 deselected
  • hatch -e test run pytest test/components/agents/test_agent_breakpoints.py -k 'resume_from_tool_invoker and not new_breakpoint' -q — 3 passed, 1 skipped, 19 deselected
  • hatch run fmt-check haystack/core/pipeline/breakpoint.py test/core/pipeline/test_breakpoint.py test/components/agents/test_agent_breakpoints.py — all checks passed, 3 files already formatted
  • hatch run test:types haystack/core/pipeline/breakpoint.py — no issues found

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Agent snapshots can become non-resumable when snapshot fallback stores empty tool_invoker/chat_generator payloads

3 participants