Skip to content

persist: Harden durable-state and binary decoding against malformed input#36985

Merged
def- merged 12 commits into
MaterializeInc:mainfrom
def-:fuzz-persist
Jun 25, 2026
Merged

persist: Harden durable-state and binary decoding against malformed input#36985
def- merged 12 commits into
MaterializeInc:mainfrom
def-:fuzz-persist

Conversation

@def-

@def- def- commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Hardens decode paths in persist, persist-client, persist-types, pgrepr, and postgres-util against malformed/untrusted input: reject rollup-less/empty-frontier/hollow-only/invalid StateDiff state instead of panicking or debug-asserting, guard UUID parsing, fix i16 overflow in pgrepr numeric-scale binary decode, and propagate u16-conversion errors in the postgres table-desc protos.

Found by the cargo-fuzz suite (separate infra PR). Each fix has a regression test.

🤖 Generated with Claude Code

@def- def- requested a review from DAlperin June 12, 2026 10:40
@def- def- marked this pull request as ready for review June 12, 2026 10:41
@def- def- requested review from a team as code owners June 12, 2026 10:41
@def-

def- commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Ready for review
Edit: Last blocker for #36982

@def- def- force-pushed the fuzz-persist branch 3 times, most recently from dfa0401 to c351cc6 Compare June 23, 2026 11:26

@petrosagg petrosagg left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. The comments could be more laconic and still be to the point

Comment thread src/persist-client/src/internal/encoding.rs Outdated
Comment thread src/persist-client/src/internal/encoding.rs Outdated
@def- def- requested a review from petrosagg June 24, 2026 10:19

@DAlperin DAlperin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks for the fixes!

def- and others added 12 commits June 25, 2026 10:57
`PostgresKeyDesc::cols` is `Vec<u16>` but the proto carries it as
`Vec<u32>`; the decode path did `c.try_into().expect("values roundtrip")`,
which panicked on any value above 65535. That is reachable from untrusted
proto bytes (the fuzz target found this from many distinct inputs).
Propagate the conversion error via `TryFromProtoError` instead.

Also fixes a build error found while bringing up new fuzz crates: pgwire's
`Codec` was a fully-private struct, blocking the `fuzz_exports` re-export.
Make it `pub` so the same-crate `pub use` in lib.rs works under
`#[cfg(feature = "fuzz")]`.
…nDesc

Same shape as the `ProtoPostgresKeyDesc.cols` fix: `col_num` is `u16`
in Rust but `u32` on the wire, and the decode path used
`.expect("u16 must roundtrip")`. That is reachable from untrusted proto
bytes (the fuzz target found this from many distinct inputs once the
key-desc panic was out of the way).
`Numeric`'s binary `FromSql` computed `(units - weight - 1) * 4` in i16,
but `units`/`weight` are attacker-controlled i16 header fields from a
client bind parameter. Extreme `weight` values overflow i16. That is a
panic under overflow checks (a client-triggerable crash) and a silent
wraparound to a wrong scale otherwise. Compute the scale (and the
`in_scale` comparison) in i32; the values are well within range.

Found by the new value_decode_binary cargo-fuzz target.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…panicking

`from_proto_with_type` zipped the proto's `children` against the fields
its declared data type expects using `zip_eq`, which panics when the
counts disagree. `children` is reconstructed from (untrusted-on-disk)
Parquet part bytes, so a corrupted or crafted part panicked the reader.
Check the lengths and return a `TryFromProtoError` on mismatch.

Found by an exploratory cargo-fuzz target over the `ProtoArrayData ->
ArrayData` decode path. (That target is not committed: adversarial-but-
build-valid layouts also panic inside arrow's own ArrayData build/align/
equal routines, which is an upstream concern, so the target can't yet run
green in CI.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-asserting

`StateDiff`'s proto decode `debug_assert_eq!(field_diffs.validate(), Ok(()))`
on field diffs reconstructed from an untrusted consensus blob. Under debug
assertions (and fuzzing) a corrupt/crafted diff panics; in release the
assert is compiled out and the unvalidated diffs flow through. Validate
and return a `TryFromProtoError::InvalidPersistState` on failure so the
behaviour is consistent and a bad blob is a decode error, not a panic.

Found by the new state_diff_proto_roundtrip cargo-fuzz target.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tead of asserting

`BatchPart::from_proto`'s inline branch asserted (`assert_eq!`/`assert_none!`,
not `debug_assert`, so it panicked even in release) that the hollow-only
fields (`encoded_size_bytes`, `key_lower`, `key_stats`, `diffs_sum`) were
unset. These are decoded from an untrusted blob (read on every rollup/state
load), so a crafted/corrupted inline `ProtoHollowBatchPart` panicked the
reader. Validate and return a `TryFromProtoError` instead.

Found by the rollup_proto_roundtrip cargo-fuzz target.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`Description::from_proto` passed the decoded `lower` antichain straight to
`Description::new`, which asserts a non-empty lower frontier. `lower` comes
from an untrusted blob, so a crafted/corrupted `ProtoHollowBatch` with an
empty lower frontier panicked on rollup/state load. Validate it and return
a `TryFromProtoError` instead.

Found by the rollup_proto_roundtrip cargo-fuzz target (a second, distinct
rollup decode panic after the inline-batch-part fix).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A rollup proto's diff bounds are validated against the latest rollup, which
State::latest_rollup .expect()s to exist. ProtoRollup::into_rust reached
latest_rollup while decoding, so an untrusted/corrupt rollup proto that
carried diffs but whose state had no rollups panicked instead of being
rejected. That is a rollup_proto_roundtrip fuzz crash, and a decode-time
availability risk for a corrupt blob.

Validate the invariant in from_proto, alongside the diff-bounds check that
needs it, and return TryFromProtoError::InvalidPersistState when a proto has
diffs but no rollups, before latest_rollup is reached (and before the decoded
Rollup is later re-encoded). A rollup-less state without diffs is legitimate
(applier_version_state decodes a freshly initialized state that has no rollups
yet) and still decodes. Adds a regression test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A rollup's trace is decoded from an untrusted blob, but Trace::unflatten
reconstructed the spine through code whose invariants are enforced by panics,
so a corrupted or crafted blob could kill the decoder instead of failing with
a decode error. The rollup_proto_roundtrip fuzz target hit the first of these
in release qualification: legacy batches that don't tile the timeline
contiguously trip Spine::insert's (always-on) contiguity assert. Probing the
neighboring reconstruction code found the rest of the class, all reachable
from crafted bytes (cargo-fuzz builds enable debug assertions and overflow
checks, so the debug_asserts and arithmetic count too):

* Spine::insert also asserts a non-empty time range per legacy batch.
* An absurd batch len overflows the spine's maintenance arithmetic
  (len.next_power_of_two(), summing lens of merged batches).
* An absurd spine batch level overflows `level + 1` and sizes a vec![]
  allocation; a level past the first batch's indexes out of bounds.
* More descs than parts underflows the padding subtraction.
* A spine batch whose parts don't tile its id range trips SpineBatch::id's
  debug_asserts.
* MergeState::push_batch asserts id/desc chaining within a level, expects at
  most two batches per level, and that no incomplete merge is present.
* Anything else invalid (e.g. a batch since past the trace since) only failed
  Trace::validate, which ran under debug_assert: a panic in fuzz/dev builds,
  silent acceptance of corrupted state in release.

Validate all of this in unflatten (plus a fallible try_push_batch that
push_batch now wraps) and run Trace::validate unconditionally at the end,
returning decode errors. For real blobs nothing changes: every state a
correct writer flattens satisfies these checks (the flatten/unflatten
roundtrip proptest exercises both the structured and legacy paths), and prod
treats a decode error the same as the old panic (UntypedState::decode expects
rollups to decode and State::apply_diffs panics on apply errors), just with a
diagnosable message instead of an assert deep in spine internals.

Adds decode-rejection regression tests for each vector alongside the
existing rollup proto tests, named after what the crafted proto violates.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Trace::unflatten's check that a spine batch's reconstructed parts cover its
id range only verified the endpoints: the first part starts at the batch
id's lower and the last ends at its upper. A crafted rollup could encode an
outer id [0, 3) tiled by non-contiguous parts [0, 2) and [1, 3): the endpoint
check passes, and Trace::validate also passes because it inspects the outer
SpineBatch ids and descriptions, never the adjacency of a SpineBatch's parts.

That malformed trace would be accepted at decode and panic later in normal
maintenance instead. fueled_merge_reqs_before_ms emits a FueledMergeReq from
the accepted parts, and Compactor::chunk_runs / compact_all (plus the id-range
merge path in apply_merge_res_checked) call id_range on those ids, whose
adjacency check is an assert_eq!. So the hardening still converted a corrupted
durable rollup into a process panic for this invariant.

Validate the full tiling in unflatten: require adjacent parts to be contiguous
in addition to the existing endpoint check, so non-adjacent parts fail with a
decode error. Adds a decode-rejection regression test for the overlapping-parts
vector alongside the existing partless-spine-batch one.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…panic

`strconv::parse_uuid` (MaterializeInc#36984) rejected inputs shorter than 32 chars or
non-ASCII before handing them to the `uuid` crate, on the premise that
`Uuid::parse_str` panics (rather than erroring) while building its parse error
for some malformed inputs by slicing on a non-char boundary. That premise is
false for the version we lock (uuid 1.19.0) and for every release back to
0.8.2: `Uuid::parse_str` returns a clean `Err` for all such inputs (`{}`,
`{\0}`, non-ASCII, etc.). `InvalidUuid::into_err` rejects non-UTF8 up front,
strips the `{}`/`urn:uuid:` wrappers on raw bytes (always standalone ASCII, so
char-safe), and its one subtraction can't underflow.

So the guard only suppressed the `uuid` crate's own (more detailed) error
message for short/non-ASCII input, and the `value_decode_text` regression test
asserted an invariant the crate already upholds. No saved cargo-fuzz artifact
reaches a uuid panic. Restore the original parser and drop the vacuous test.
Mirrors removing the equivalent persist `parse_id`/`ShardId::from_str` guard.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@def- def- requested a review from a team as a code owner June 25, 2026 10:57
@def- def- merged commit 25802a5 into MaterializeInc:main Jun 25, 2026
121 checks passed
@def- def- deleted the fuzz-persist branch June 25, 2026 11:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants