persist: Harden durable-state and binary decoding against malformed input#36985
Merged
Conversation
13 tasks
Contributor
Author
|
Ready for review |
dfa0401 to
c351cc6
Compare
petrosagg
reviewed
Jun 24, 2026
petrosagg
left a comment
Contributor
There was a problem hiding this comment.
Looks good overall. The comments could be more laconic and still be to the point
DAlperin
approved these changes
Jun 24, 2026
DAlperin
approved these changes
Jun 24, 2026
DAlperin
left a comment
Member
There was a problem hiding this comment.
Looks good to me! Thanks for the fixes!
`PostgresKeyDesc::cols` is `Vec<u16>` but the proto carries it as
`Vec<u32>`; the decode path did `c.try_into().expect("values roundtrip")`,
which panicked on any value above 65535. That is reachable from untrusted
proto bytes (the fuzz target found this from many distinct inputs).
Propagate the conversion error via `TryFromProtoError` instead.
Also fixes a build error found while bringing up new fuzz crates: pgwire's
`Codec` was a fully-private struct, blocking the `fuzz_exports` re-export.
Make it `pub` so the same-crate `pub use` in lib.rs works under
`#[cfg(feature = "fuzz")]`.
…nDesc
Same shape as the `ProtoPostgresKeyDesc.cols` fix: `col_num` is `u16`
in Rust but `u32` on the wire, and the decode path used
`.expect("u16 must roundtrip")`. That is reachable from untrusted proto
bytes (the fuzz target found this from many distinct inputs once the
key-desc panic was out of the way).
`Numeric`'s binary `FromSql` computed `(units - weight - 1) * 4` in i16, but `units`/`weight` are attacker-controlled i16 header fields from a client bind parameter. Extreme `weight` values overflow i16. That is a panic under overflow checks (a client-triggerable crash) and a silent wraparound to a wrong scale otherwise. Compute the scale (and the `in_scale` comparison) in i32; the values are well within range. Found by the new value_decode_binary cargo-fuzz target. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…panicking `from_proto_with_type` zipped the proto's `children` against the fields its declared data type expects using `zip_eq`, which panics when the counts disagree. `children` is reconstructed from (untrusted-on-disk) Parquet part bytes, so a corrupted or crafted part panicked the reader. Check the lengths and return a `TryFromProtoError` on mismatch. Found by an exploratory cargo-fuzz target over the `ProtoArrayData -> ArrayData` decode path. (That target is not committed: adversarial-but- build-valid layouts also panic inside arrow's own ArrayData build/align/ equal routines, which is an upstream concern, so the target can't yet run green in CI.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-asserting `StateDiff`'s proto decode `debug_assert_eq!(field_diffs.validate(), Ok(()))` on field diffs reconstructed from an untrusted consensus blob. Under debug assertions (and fuzzing) a corrupt/crafted diff panics; in release the assert is compiled out and the unvalidated diffs flow through. Validate and return a `TryFromProtoError::InvalidPersistState` on failure so the behaviour is consistent and a bad blob is a decode error, not a panic. Found by the new state_diff_proto_roundtrip cargo-fuzz target. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tead of asserting `BatchPart::from_proto`'s inline branch asserted (`assert_eq!`/`assert_none!`, not `debug_assert`, so it panicked even in release) that the hollow-only fields (`encoded_size_bytes`, `key_lower`, `key_stats`, `diffs_sum`) were unset. These are decoded from an untrusted blob (read on every rollup/state load), so a crafted/corrupted inline `ProtoHollowBatchPart` panicked the reader. Validate and return a `TryFromProtoError` instead. Found by the rollup_proto_roundtrip cargo-fuzz target. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`Description::from_proto` passed the decoded `lower` antichain straight to `Description::new`, which asserts a non-empty lower frontier. `lower` comes from an untrusted blob, so a crafted/corrupted `ProtoHollowBatch` with an empty lower frontier panicked on rollup/state load. Validate it and return a `TryFromProtoError` instead. Found by the rollup_proto_roundtrip cargo-fuzz target (a second, distinct rollup decode panic after the inline-batch-part fix). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A rollup proto's diff bounds are validated against the latest rollup, which State::latest_rollup .expect()s to exist. ProtoRollup::into_rust reached latest_rollup while decoding, so an untrusted/corrupt rollup proto that carried diffs but whose state had no rollups panicked instead of being rejected. That is a rollup_proto_roundtrip fuzz crash, and a decode-time availability risk for a corrupt blob. Validate the invariant in from_proto, alongside the diff-bounds check that needs it, and return TryFromProtoError::InvalidPersistState when a proto has diffs but no rollups, before latest_rollup is reached (and before the decoded Rollup is later re-encoded). A rollup-less state without diffs is legitimate (applier_version_state decodes a freshly initialized state that has no rollups yet) and still decodes. Adds a regression test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A rollup's trace is decoded from an untrusted blob, but Trace::unflatten reconstructed the spine through code whose invariants are enforced by panics, so a corrupted or crafted blob could kill the decoder instead of failing with a decode error. The rollup_proto_roundtrip fuzz target hit the first of these in release qualification: legacy batches that don't tile the timeline contiguously trip Spine::insert's (always-on) contiguity assert. Probing the neighboring reconstruction code found the rest of the class, all reachable from crafted bytes (cargo-fuzz builds enable debug assertions and overflow checks, so the debug_asserts and arithmetic count too): * Spine::insert also asserts a non-empty time range per legacy batch. * An absurd batch len overflows the spine's maintenance arithmetic (len.next_power_of_two(), summing lens of merged batches). * An absurd spine batch level overflows `level + 1` and sizes a vec![] allocation; a level past the first batch's indexes out of bounds. * More descs than parts underflows the padding subtraction. * A spine batch whose parts don't tile its id range trips SpineBatch::id's debug_asserts. * MergeState::push_batch asserts id/desc chaining within a level, expects at most two batches per level, and that no incomplete merge is present. * Anything else invalid (e.g. a batch since past the trace since) only failed Trace::validate, which ran under debug_assert: a panic in fuzz/dev builds, silent acceptance of corrupted state in release. Validate all of this in unflatten (plus a fallible try_push_batch that push_batch now wraps) and run Trace::validate unconditionally at the end, returning decode errors. For real blobs nothing changes: every state a correct writer flattens satisfies these checks (the flatten/unflatten roundtrip proptest exercises both the structured and legacy paths), and prod treats a decode error the same as the old panic (UntypedState::decode expects rollups to decode and State::apply_diffs panics on apply errors), just with a diagnosable message instead of an assert deep in spine internals. Adds decode-rejection regression tests for each vector alongside the existing rollup proto tests, named after what the crafted proto violates. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Trace::unflatten's check that a spine batch's reconstructed parts cover its id range only verified the endpoints: the first part starts at the batch id's lower and the last ends at its upper. A crafted rollup could encode an outer id [0, 3) tiled by non-contiguous parts [0, 2) and [1, 3): the endpoint check passes, and Trace::validate also passes because it inspects the outer SpineBatch ids and descriptions, never the adjacency of a SpineBatch's parts. That malformed trace would be accepted at decode and panic later in normal maintenance instead. fueled_merge_reqs_before_ms emits a FueledMergeReq from the accepted parts, and Compactor::chunk_runs / compact_all (plus the id-range merge path in apply_merge_res_checked) call id_range on those ids, whose adjacency check is an assert_eq!. So the hardening still converted a corrupted durable rollup into a process panic for this invariant. Validate the full tiling in unflatten: require adjacent parts to be contiguous in addition to the existing endpoint check, so non-adjacent parts fail with a decode error. Adds a decode-rejection regression test for the overlapping-parts vector alongside the existing partless-spine-batch one. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…panic `strconv::parse_uuid` (MaterializeInc#36984) rejected inputs shorter than 32 chars or non-ASCII before handing them to the `uuid` crate, on the premise that `Uuid::parse_str` panics (rather than erroring) while building its parse error for some malformed inputs by slicing on a non-char boundary. That premise is false for the version we lock (uuid 1.19.0) and for every release back to 0.8.2: `Uuid::parse_str` returns a clean `Err` for all such inputs (`{}`, `{\0}`, non-ASCII, etc.). `InvalidUuid::into_err` rejects non-UTF8 up front, strips the `{}`/`urn:uuid:` wrappers on raw bytes (always standalone ASCII, so char-safe), and its one subtraction can't underflow. So the guard only suppressed the `uuid` crate's own (more detailed) error message for short/non-ASCII input, and the `value_decode_text` regression test asserted an invariant the crate already upholds. No saved cargo-fuzz artifact reaches a uuid panic. Restore the original parser and drop the vacuous test. Mirrors removing the equivalent persist `parse_id`/`ShardId::from_str` guard. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hardens decode paths in
persist,persist-client,persist-types,pgrepr, andpostgres-utilagainst malformed/untrusted input: reject rollup-less/empty-frontier/hollow-only/invalid StateDiff state instead of panicking or debug-asserting, guard UUID parsing, fix i16 overflow in pgrepr numeric-scale binary decode, and propagate u16-conversion errors in the postgres table-desc protos.Found by the cargo-fuzz suite (separate infra PR). Each fix has a regression test.
🤖 Generated with Claude Code