Skip to content

fix(sqle,admin): preserve legacy encoding on ALTER + add safe schema-encoding-drift heal toolkit#11126

Open
realies wants to merge 1 commit into
dolthub:mainfrom
realies:fix/adaptive-encoding-invalid-hash-length
Open

fix(sqle,admin): preserve legacy encoding on ALTER + add safe schema-encoding-drift heal toolkit#11126
realies wants to merge 1 commit into
dolthub:mainfrom
realies:fix/adaptive-encoding-invalid-hash-length

Conversation

@realies
Copy link
Copy Markdown

@realies realies commented May 29, 2026

Fix adaptive-encoding drift: preserve legacy encoding on ALTER, and add a safe schema-encoding-drift heal toolkit

Summary

UseAdaptiveEncoding became the default in v2.0.7 (#11017). This surfaced a class of
encoding-drift bugs where a TEXT/BLOB/JSON column's persisted encoding tag and its
on-disk row data disagree, producing panic: invalid hash length: N on read and on write.

This PR:

  1. Preserves the legacy storage encoding across ALTER TABLE ... MODIFY so a 1.x-written
    StringAddr/BytesAddr column is not silently re-tagged adaptive while its rows remain
    raw 20-byte hashes.
  2. Adds a dolt admin schema-encoding-drift toolkit to detect and safely heal columns
    that already drifted, including a forward migrate-adaptive heal.
  3. Closes a data-corruption hole in the heal path itself: the schema-only repair flip
    could strand genuine adaptive rows under a legacy schema tag. repair/recover-rows/
    check are hardened, and migrate-adaptive is the safe forward path.

Background

In v2.0.7+, blobStringType.Encoding() (TEXT) / varBinaryType.Encoding() (BLOB) consult the
global UseAdaptiveEncoding flag when a column's per-field enc is the zero value. A column whose
in-memory TypeInfo was rebuilt from a SQL type (rather than deserialized from disk, which
pins enc) therefore serialized as the adaptive variant even though its rows were still in
legacy raw-hash (StringAddrEnc/BytesAddrEnc) form. The adaptive reader then parses a raw
20-byte hash as [varint length][address] and crashes in hash.New.

The reverse mismatch is just as damaging and is the focus of most of this PR (see Root cause).

Root cause

1. ALTER MODIFY re-tagged surviving columns (encoder side is fine)

ToDoltCol / ToDoltSchema rebuild columns via FromSqlType, which returns a TypeInfo
with enc == 0. On an in-place MODIFY (or a rewrite ALTER), the surviving TEXT/BLOB
columns were re-serialized with enc == 0 → the UseAdaptiveEncoding fallback flipped their
persisted tag to adaptive while their rows were untouched.

Note: the steady-state INSERT/UPDATE value encoder is not implicated — it builds the
value TupleDescriptor from the deserialized schema, whose enc is pinned by
deserializeColumns via WithEncoding. The fix belongs at the schema-write boundary (ALTER),
not the writer.

2. The heal command could cause corruption: repair's schema-only flip

The drift toolkit's repair subcommand flips a column's persisted tag from its adaptive
variant back to the legacy sibling without rewriting row data ("data unchanged"). That is
only safe if every row is already in legacy form. On a column that still holds genuine
adaptive content
, the flip strands every adaptive row under a legacy schema tag, producing
two failure modes:

  • Read panic. GetStringAddr → GetAddr → hash.New runs on the adaptive-inline payload
    (e.g. a 59-byte [0x00]<utf8> value) and panics invalid hash length: 59.
  • Write/flush panic. Any insert/update flushes a prolly node; countAddresses walks every
    legacy-AddrEnc field of every row in the node and calls hash.New on each. A stranded
    non-20-byte adaptive field panics there — so a single stranded row poisons all writes to
    the table.

The gap that let repair flip such a column: its scan classified a non-empty
adaptive-inline
value (small [0x00]<content>, not spilled out-of-band) as merely
"ambiguous" rather than "definitive adaptive". A column with both a legacy witness and inline
adaptive content therefore fell through to the flip branch. A [0x00]-leading 20-byte
value (inline content of 19 bytes) is additionally ambiguous with a legacy hash that happens
to start 0x00, and was silently misread as a bogus address — a quiet corruption, not a
loud panic.

3. dolt_ignore'd tables: out-of-band adaptive values dangle

A dolt_ignore'd table's data lives only in the working set; its content chunks are never
rooted in a commit, and GC can reclaim never-committed chunks. So an out-of-band (addressed)
adaptive value on such a table references a chunk that isn't durable: persisting the working
set dangling-faults on the un-rooted chunk, and GC would reclaim it even if it persisted.
Forward-migrating these tables therefore has to avoid out-of-line refs entirely.

Changes

Preserve legacy encoding across ALTER

  • In-place MODIFY: inherit the existing column's encoding onto the freshly-built TypeInfo
    when the column kind is unchanged.
  • Rewrite/ADD/DROP path: preserveSurvivingColumnEncodings re-pins every surviving TEXT/BLOB
    column's encoding after ToDoltSchema. Extended to JSON + the geometry families.

dolt admin schema-encoding-drift toolkit

  • check — multi-database, full-payload scan; classifies each row's field bytes
    (legacy raw-hash / adaptive-inline / adaptive-addressed / empty); reports drift and hints the
    correct heal command.
  • repair — schema-only tag flip, now guarded by a full-payload scan: a non-empty
    non-20-byte adaptive-inline value is counted as definitive adaptive; repair refuses any
    column with both legacy and adaptive rows (heterogeneous), any genuine-adaptive column, and
    any dolt_ignore'd table. It only flips a column proven homogeneously legacy (or, with
    --include-empty, the all-empty bucket).
  • recover-rows — row-by-row migration to canonical legacy form for committed tables; refuses
    dolt_ignore'd tables.
  • migrate-adaptive (new) — the forward heal: rewrites legacy rows to canonical adaptive,
    keeps already-adaptive rows, flips the schema to the adaptive sibling. Row data + schema move
    together (a commit for committed tables; a working-set persist for dolt_ignore'd ones).
    Content is preserved; supports --dry-run.

Force-inline for dolt_ignore'd tables (the persist fix)

  • When migrate-adaptive targets a dolt_ignore'd table it force-inlines every value of
    every out-of-band-capable column (String/Bytes/JSON/geometry) in the table — zero out-of-line refs — so the values are
    self-contained, GC-immune, and persistable without a dangling ref. An in-code invariant
    asserts zero addressed refs before persisting (fail-fast), and the command prints a
    storage report: storage: N values inline, M addressed (M MUST be 0 for a dolt_ignore'd table).

Authoritative reads fail loud on transient errors

  • The 0x00-leading-20-byte disambiguation reads the candidate chunk to decide
    legacy-vs-inline. readContentAuthoritative now returns (content, present, err) and
    distinguishes a genuine absence (recovered empty-chunk → treat as inline) from a
    transient read error (propagated). The migrate path no longer risks inlining a present
    legacy row's 20 address bytes as content because a read momentarily failed.
  • The forward migrate-adaptive path uses this ReadBytes-attempt; the legacy-direction
    recover-rows resolver intentionally keeps the chunkStore.Has check, since switching it is
    not behavior-equivalent there — Has propagates a transient chunkstore error that the
    ReadBytes-attempt would swallow, and on the table's own (complete) chunkstore
    HasReadBytes-success are consistent.

Why these choices

  • Don't remove the UseAdaptiveEncoding fallback in the type getter — that would regress
    the v2.0.7 product default for newly-created tables. The encoding decision belongs at the
    schema-write boundary.
  • Force-inline only for dolt_ignore'd tables — committed tables can hold out-of-band
    adaptive values safely (their chunks are rooted). Inlining is required precisely where chunks
    are never committed.
  • repair refuses rather than guesses — a schema-only flip cannot be safe on a column with
    real content; migrate-adaptive/recover-rows (which rewrite rows) are the safe heals.

Testing

Go regressions (each fails on the pre-fix code, passes on this PR):

  • repair must refuse a legacy-witness + adaptive-inline heterogeneous column (the
    inline-stranding shape that the first-witness scan missed).
  • repair must refuse a 20-byte 0x00-leading inline value coexisting with a legacy witness
    (the quiet-misread shape).
  • migrate-adaptive heals a stranded column byte-identically; reading the previously-panicking
    rows returns their exact content.
  • migrate-adaptive on a dolt_ignore'd table reports 0 addressed (force-inline invariant).
  • resolveFieldToAdaptive propagates a transient read error during 0x00-lead disambiguation
    instead of swallowing it as inline.
  • Plus the existing ALTER MODIFY preserves encoding and recover-rows heterogeneous /
    sibling-AddrEnc cases.

bats (cross-version compat suite): a panic-shape assertion (invalid hash length / panic recovered / runtime error must not leak into output) on the legacy-TEXT/BLOB read path; a
CLI migrate-adaptive heal + repair refusal; and the dolt_ignore'd-table force-inline
storage-report assertion (0 addressed).

Validated end-to-end on a production dolt database: every affected column read back
byte-identical after the heal (content hashes diffed pre/post), zero genuinely-missing chunks,
zero out-of-line refs on the ignored tables, and both the read and the write/flush panics
resolved.

Minor

dolt admin schema-encoding-drift --help does not list its subcommands (check, repair,
recover-rows, migrate-adaptive); they work when invoked but aren't discoverable. Worth a
one-line registration fix so the subcommand list renders.

@realies realies force-pushed the fix/adaptive-encoding-invalid-hash-length branch from 7dd44f4 to b54cd10 Compare May 29, 2026 21:54
…a safe schema-encoding-drift heal toolkit

UseAdaptiveEncoding became the default in v2.0.7 (dolthub#11017), surfacing a class of
encoding-drift bugs where a TEXT/BLOB/JSON column's persisted encoding tag and its
on-disk row data disagree, causing `panic: invalid hash length: N` on read and write.

- Preserve the legacy storage encoding across ALTER TABLE ... MODIFY (and the
  rewrite/ADD/DROP path), extended to JSON + geometry families, so a 1.x-written
  StringAddr/BytesAddr column is not silently re-tagged adaptive.
- Add `dolt admin schema-encoding-drift` (check/repair/recover-rows/migrate-adaptive)
  to detect and safely heal already-drifted columns. `repair` is hardened to refuse
  heterogeneous/genuine-adaptive/dolt_ignore'd columns; `migrate-adaptive` is the
  forward heal; dolt_ignore'd tables are force-inlined (zero out-of-line refs) so the
  working-set persist cannot dangle; the ambiguous 0x00-leading-20-byte decision uses
  an authoritative read that fails loud on a read error.

See the PR description for the full root-cause analysis, design rationale, and test plan.
@realies realies force-pushed the fix/adaptive-encoding-invalid-hash-length branch from b54cd10 to 02564b0 Compare May 30, 2026 09:20
@realies realies changed the title fix(val): handle legacy raw-hash StringAddrEnc/BytesAddrEnc fields under adaptive dispatch fix(sqle,admin): preserve legacy encoding on ALTER + add safe schema-encoding-drift heal toolkit May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants