Add type-aware custom object serialization#154
Conversation
Rewrite the JSON codec in shared.py to emit plain JSON (no internal type marker) and add type-directed deserialization via an optional expected_type. Custom objects round-trip everywhere: - call_activity/call_sub_orchestrator/call_entity gain return_type; wait_for_external_event gains data_type; these also refine the returned task's static type via overloads. - Inbound payloads (orchestrator/activity/entity inputs) and call_activity results are reconstructed from function type annotations (new internal type_discovery module), best-effort and conservative. - Entity get_state and new client OrchestrationState.get_input/get_output/get_custom_status accessors route through the shared codec. - Fix nested-dataclass round-trip bug; chain serialization errors with the original cause. Legacy AUTO_SERIALIZED payloads still deserialize for in-flight replay.
…m-type-serialization # Conflicts: # CHANGELOG.md
|
Expanding on the The gap todaySerialization is currently hardcoded to A Python-equivalent shape# durabletask/serialization.py (new public API)
from abc import ABC, abstractmethod
from typing import Any
class DataConverter(ABC):
@abstractmethod
def serialize(self, value: Any) -> str | None:
...
@abstractmethod
def deserialize(self, data: str | None, target_type: type | None = None) -> Any:
...The default preserves today's behavior, so it's a non-breaking refactor: class JsonDataConverter(DataConverter):
def serialize(self, value):
return None if value is None else shared.to_json(value)
def deserialize(self, data, target_type=None):
return None if data is None else shared.from_json(data, target_type)Then wire it through the worker/client once (a constructor arg defaulting to Pydantic exampleA team using pydantic models could drop in a converter that gives full validation on the way in and pydantic's JSON encoding on the way out, while still falling back to stdlib JSON for plain dicts/dataclasses: import pydantic
from durabletask.internal import shared
from durabletask.serialization import DataConverter
class PydanticDataConverter(DataConverter):
"""Round-trips pydantic models; falls back to stdlib JSON for everything else."""
def serialize(self, value):
if value is None:
return None
if isinstance(value, pydantic.BaseModel):
return value.model_dump_json() # pydantic v2
return shared.to_json(value) # dataclasses, dicts, ...
def deserialize(self, data, target_type=None):
if data is None:
return None
if (
isinstance(target_type, type)
and issubclass(target_type, pydantic.BaseModel)
):
return target_type.model_validate_json(data) # pydantic v2, validates
return shared.from_json(data, target_type)Register it once on each end: converter = PydanticDataConverter()
worker = TaskHubGrpcWorker(data_converter=converter)
client = TaskHubGrpcClient(data_converter=converter)…and pydantic models round-trip everywhere through the same type-directed flow already in this PR: class Order(pydantic.BaseModel):
customer: str
total: float
class Receipt(pydantic.BaseModel):
confirmation_id: str
def orchestrator(ctx, order: Order): # discovery -> Order (validated)
receipt = yield ctx.call_activity(charge, input=order, return_type=Receipt)
return receipt.confirmation_idHere Why this is worth considering over the duck-typed hooks
This is complementary to the PR, not a blocker: the Note One open question for the design: whether the converter is global (one per worker/client) like .NET, or can also be overridden per call (e.g. |
…m-type-serialization # Conflicts: # durabletask/worker.py
|
One follow-up on the breaking-change cleanup: That example predates this PR and is unchanged by it, but it still relies on the old "top-level object → def process_order(ctx, order): # no annotation
yield ctx.call_activity(validate_order, input=order)
total = yield ctx.call_activity(calculate_total, input=order.items) # <-- here
...
def validate_order(ctx, order): # no annotation
if not order.items: # <-- here
...After this PR, an unannotated payload decodes to a plain >>> wire = json_codec.to_json(Order("Alice", [OrderItem("Widget", 2, 10.0)]))
>>> got = json_codec.from_json(wire) # no target type -> dict
>>> type(got).__name__
'dict'
>>> got.items
<built-in method items of dict object at 0x...>
>>> for item in got.items: ...
TypeError: 'builtin_function_or_method' object is not iterableA few things worth noting:
The fix needs to be coherent rather than one line, because the example currently mixes attribute access at the top level with dict access on nested items (
Not a blocker, but since this PR is explicitly cleaning up the breaking-change surface and already fixed |
There was a problem hiding this comment.
Pull request overview
This PR modernizes Durable Task Python’s payload handling by introducing secure, type-directed JSON deserialization (caller/annotation supplied types, not payload-driven), while keeping the wire format as plain JSON and maintaining back-compat for legacy marker payloads.
Changes:
- Added a pluggable
DataConverterabstraction (defaultJsonDataConverter) and routed all payload boundaries through it (worker + client + entities). - Implemented type-directed reconstruction via explicit
return_type/data_typeparameters and best-effort annotation-based type discovery for inbound payloads. - Replaced legacy marker-based JSON codec with a plain-JSON codec that still recognizes legacy payloads for replay compatibility, plus expanded tests/docs/examples/changelog.
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
durabletask/internal/json_codec.py |
New plain-JSON codec with type-directed coercion and legacy-marker back-compat. |
durabletask/serialization.py |
Introduces DataConverter + default JsonDataConverter used across SDK boundaries. |
durabletask/internal/type_discovery.py |
Adds conservative type-hint discovery for orchestrator/activity/entity inputs and activity outputs. |
durabletask/internal/shared.py |
Keeps older imports working via re-exports to the new JSON codec. |
durabletask/worker.py |
Routes orchestration/activity/entity execution through DataConverter; adds typed task completion + event payload typing. |
durabletask/task.py |
Updates public abstract context APIs with overloads and typed return_type / data_type parameters. |
durabletask/client.py |
Threads DataConverter into clients and adds typed OrchestrationState accessors. |
durabletask/internal/client_helpers.py |
Uses the converter for request payload serialization (start/terminate/event/entity signal). |
durabletask/internal/entity_state_shim.py |
Routes entity runtime state coercion through the converter while preserving strictness expectations. |
durabletask/entities/entity_context.py |
Uses converter for entity-side scheduling/signaling payloads. |
durabletask/entities/entity_metadata.py |
Adds get_typed_state() for typed state deserialization via the converter. |
durabletask/extensions/history_export/client.py |
Switches history-export entity state parsing to typed state access. |
tests/durabletask/extensions/history_export/test_entity.py |
Aligns tests with typed state access for history-export entities. |
tests/durabletask/test_serialization.py |
New unit tests for plain-JSON codec, hooks, coercion, and legacy-marker behavior. |
tests/durabletask/test_type_discovery.py |
New tests for annotation discovery + inbound coercion paths. |
tests/durabletask/test_orchestration_state.py |
New tests for typed OrchestrationState accessors. |
tests/durabletask/test_orchestration_executor.py |
Adds executor coverage for return_type and annotation-discovered activity results. |
tests/durabletask/test_entity_executor.py |
Adds coverage for StateShim coercion behavior via the converter. |
tests/durabletask/test_data_converter.py |
New tests for the DataConverter abstraction and default implementation behavior. |
examples/in_memory_backend_example/src/workflows.py |
Updates example to rely on annotated dataclass reconstruction instead of dict access. |
examples/human_interaction.py |
Updates example to use data_type for external-event payload reconstruction. |
docs/supported-patterns.md |
Documents using data_type / return_type and annotations to reconstruct original types. |
CHANGELOG.md |
Documents new converter + typed APIs and the behavioral shift toward plain dict/list when no type is supplied. |
…m-type-serialization # Conflicts: # durabletask/internal/entity_state_shim.py # examples/human_interaction.py # examples/in_memory_backend_example/src/workflows.py
* Fix custom serialization gaps from #154 Close several round-tripping gaps left by the type-aware custom serialization work in #154, without introducing new breaking changes versus 1.5.0 or any serialization-related security concerns. Serialize side: - Prefer a to_json() hook over the built-in dataclass / SimpleNamespace handling so a dataclass (or namespace) with a non-serializable field can opt in, mirroring the decode side which already prefers from_json(). - Encode dataclasses via a shallow field mapping instead of dataclasses.asdict(), so nested to_json() hooks are honored and leaf values are not deep-copied. - Serialize enum.Enum values to their underlying .value so non-int enums round-trip (IntEnum already serialized as integers). Deserialize side: - Recurse type-directed reconstruction into dict/Mapping values and tuple elements, in addition to the existing list / Optional / Union / dataclass recursion. - Optionally pass the active DataConverter to a from_json(cls, value, converter) hook so it can rebuild nested typed values the built-in recursion does not cover. Entity state: - Defer deserialization of an entity's wire state until get_state() is called, so the caller's requested type reaches the converter together with the raw payload. Track whether the held value is still the raw serialized string and pass it back through unchanged on persist to avoid double-encoding. - Replace a redundant serialize/deserialize round-trip in the legacy entity event path with converter.coerce(). Module structure / deprecation: - Merge the internal json_codec module into durabletask.serialization and make the codec functions private; the supported surface is the pluggable DataConverter. - Deprecate durabletask.internal.shared.to_json / from_json with a DeprecationWarning; they continue to work for backwards compatibility. Adds a comprehensive JsonDataConverter round-trip test suite plus targeted tests for each fix, and documents intentional limitations (multi-member Union, types needing a custom converter such as datetime/Decimal/set). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address PR feedback * Fix annotation discovery in 3.10 * Add pydantic example, fix reconstructibility concern * More fixes: - Rename is_reconstructible to can_reconstruct - Correct ownership of _can_reconstruct - Required DataConverter for internal classes * CHANGELOG summarization * No more silent fallbacks to JsonDataConverter * PR feedback * Final CHANGELOG tuneups --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Reworks user-payload JSON serialization to be secure and type-aware while
keeping the on-the-wire format stable. Custom objects (dataclasses and
from_json()-capable types) now round-trip everywhere in the programming model,and deserialization is driven by caller-supplied or annotation-discovered types
rather than by the payload itself.
Motivation
The previous codec (
shared.to_json/from_json) had real problems:AUTO_SERIALIZEDmarker, so nested dataclasses/namedtuples decodedinconsistently, and everything came back as a
SimpleNamespacerather than theoriginal type.
reconstruct arbitrary classes named in the payload is classic insecure
deserialization (OWASP A08). The secure model is type-directed decoding,
where the destination type comes from the caller/annotations, never the wire.
This also aligns the SDK with the direction of
azure-functions-durableand the.NET
DataConverter(plain JSON + caller-supplied target type).What changed
Type-directed deserialization
call_activity,call_sub_orchestrator, andcall_entitygain an optionalreturn_type;wait_for_external_eventgainsdata_type. When provided, theresult/event is coerced to that type (dataclasses incl. nested /
Optional/listfields, andfrom_json()-capable types). These also refine the statictype of the returned task via
@overload(e.g.
call_activity(..., return_type=Foo) -> CompletableTask[Foo]).client.OrchestrationState:get_input(),get_output(),get_custom_status()— each takes an optionalexpected_typeand is overloaded to return
T | None. Rawserialized_*fields are retained.get_state(intended_type=...)now routes through the shared codec(dataclass +
from_jsonsupport).Annotation-based discovery (new
durabletask/internal/type_discovery.py)call_activityresults are reconstructed from the function's type annotationswhen no explicit type is supplied. Discovery is best-effort and
conservative: builtins and unannotated/unknown types pass through unchanged,
and a payload that fails to coerce falls back to the raw value.
Codec / behavior
to_jsonnow emits plain JSON (no internal type marker). Objects exposingto_json()are serializable.TypeErrorthat chains the original cause andnames the offending type.
AUTO_SERIALIZEDmarker) still deserialize — into aSimpleNamespacewhen notype is supplied, or stripped and coerced when an
expected_typeis given — soin-flight orchestrations replay cleanly across the upgrade.
Behavior change to note
Decoding without a type now yields a plain
dict/list(previously aSimpleNamespacefor marked payloads). Callers that want the original typeshould pass
return_type/data_type/expected_type, or rely on annotationdiscovery. Documented under
## UnreleasedinCHANGELOG.md.Tests & validation
test_serialization.py,test_type_discovery.py,test_orchestration_state.py, plus additions to the entity/orchestrationexecutor tests (codec round-trips, nested dataclasses,
expected_typeprecedence, legacy-marker back-compat, error chaining, annotation discovery,
static-type inference).
pyright(0 errors),flake8(source + tests),pytest(non-e2e), andpymarkdownall green.durabletask-azuremanagedunit tests pass (it reuses the core client/worker).
Note
Opening as a draft for review. Static-type inference is refined on the task
object; the
result = yield ctx.call_activity(...)pattern still yieldsAnydue to Python generator send-type limitations (would require an await-style API).