sql: dedup primary index value encoding#169061
Closed
mw5h wants to merge 4 commits into
Closed
Conversation
Contributor
|
Merging to
After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here |
|
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Member
CockroachDB has accumulated four independent implementations of "encode the values for a primary index given a row's datums" - prepareInsertOrUpdateBatch, Deleter.encodeValueForPrimaryIndexFamily, EncodePrimaryIndexWithKeyPrefix, and BatchEncoder.encodePK. The first three operate on []tree.Datum and reimplement the same per-family value encoding skeleton (single-default-column legacy encoding vs. multi- column tuple encoding, family-0 sentinel rules, composite-aware skip checks). This commit adds a single PrimaryIndexEncoder type in rowenc that captures those rules. The encoder caches the table-derived state used by every call (key column set, stored column set, sorted family column IDs), so callers pass only what genuinely varies per row: the family, the column-to-position map, the row's datums, and a reusable scratch buffer. PrimaryIndexFamilyValue distinguishes "Skipped" (no work to do; do not issue a delete) from "Value not present" (no encoded columns; callers in overwrite paths should issue a delete) so the three call sites can share the same per-family decision logic without paying for a delete in the wrong cases. This commit is purely additive. Subsequent commits route the existing encoders through the new helper. Informs cockroachdb#131589. Release note: None Epic: none
…oder Replace the inlined per-family encoding logic in EncodePrimaryIndexWithKeyPrefix with a call to PrimaryIndexEncoder (added in the previous commit). This removes the duplicated single-default-column / multi-column / family-0 / composite-skip branching, and lets EncodePrimaryIndexWithKeyPrefix focus on what only it cares about: building the index key, applying includeEmpty, and wrapping entries for delete-preserving encoding. getStoredColumnsForPrimaryIndex moves into PrimaryIndexEncoder construction (where it belongs with the other table-derived caches). The unused free function is deleted. Behavior is unchanged in every case observed by the rowenc tests and by the colenc/row equality cross-check (TestEncoderEqualityDatums). There is one rare edge case where behavior is technically improved: when includeEmpty=true and a primary key column lives in a non-zero single-default-column family with a CompositeDatum type whose value is not composite, the new path emits an empty entry where the old path silently skipped. Existing callers either ignore the Value (deleter) or treat an extra empty entry as a no-op delete on a possibly absent key, so the change is harmless in practice and is the more defensible behavior anyway (it lets deleters clean up potentially- stale composite-encoded KVs from prior writes). Informs cockroachdb#131589. Release note: None Epic: none
…ncoder RowHelper now embeds a rowenc.PrimaryIndexEncoder, initialized in RowHelper.Init alongside PrimaryIndexKeyPrefix. The three private fields it replaces (primaryIndexKeyCols, primaryIndexValueCols, sortedColumnFamilies) are gone; the encoder owns that state. The two RowHelper public methods that those fields backed - SkipColumnNotInPrimaryIndexValue and SortedColumnFamily - become thin delegations and keep their existing signatures, so colenc and the in-package writer/inserter call sites are unaffected. Deleter.encodeValueForPrimaryIndexFamily collapses to a single delegating call to encoder.EncodeFamily. The "TODO(ssd): Lots of duplication ... rather unfortunate." note that this function was born with goes away too. Note: today's deleter returned an empty Value for the rare "single-default-column composite-but-not-composite-value" case where the writer would have produced an empty Value (and then issued a delete in overwrite mode). Both produce the same caller-visible expected-bytes (nil), so this is a behavior preservation, not a change. Informs cockroachdb#131589. Release note: None Epic: none
Route prepareInsertOrUpdateBatch's per-family value encoding through RowHelper.PrimaryIndexEncoder().EncodeFamily, the same helper used by the Deleter and by rowenc.EncodePrimaryIndexWithKeyPrefix. The two parallel branches that the function carried for years - one for single-default-column families using legacy direct value encoding, one for multi-column families using tuple encoding - collapse into a single body. Both paths previously reimplemented the family-0 sentinel, the composite-key skip rule, the empty-family delete, and the column-id-delta tuple write; now those rules live in one place. The fetchedCols parameter falls out of the call signature; the encoder looks up types from the table descriptor when it needs them (single-default-column legacy encoding only). The two remaining callers - Inserter.InsertRow and Updater.UpdateRow - pass one fewer argument. The redundant RowHelper.encodePrimaryIndexValuesToBuf method (and the last in-package use of pkg/sql/rowenc/valueside) is removed. Behavior is unchanged for the cases covered by existing tests: TestEncoderEqualityDatums (which compares the row-side write path to the unmodified vector path byte-for-byte), the rowenc/colenc/row unit suites, and the LDR delete tests. Closes cockroachdb#131589. Release note: None Epic: none
340868d to
225d397
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Four independent implementations of "encode the values for a primary index given a row's datums" had grown up over the years across
pkg/sql/row(insert/update + delete) andpkg/sql/rowenc(the pure encoder), each reimplementing the single-default-column legacy encoding, the multi-column tuple encoding, the family-0 sentinel rules, and the composite-aware skip checks. The most recent of these (Sep 2024) shipped with an explicitTODO(ssd): Lots of duplication ... rather unfortunate.next to it.This PR introduces
rowenc.PrimaryIndexEncoderas the canonical owner of those rules, then routes each of the three row-oriented call sites through it. The encoder caches everything that depends only on the table descriptor / index (key column set, stored column set, sorted family column IDs); per-call it takes only what genuinely varies (family, column-to-position map, datums, scratch buffer).The vectorized fourth encoder (
colenc.BatchEncoder.encodePK) is intentionally untouched — its inner loops are vectorized over N rows in a way that doesn't fit the row-oriented helper without paying performance cost. A future PR can extract per-family static metadata so colenc consumes the same plan without sharing the inner loops.The four commits stage the change for review:
rowenc: add PrimaryIndexEncoderis purely additive (helper + 11 unit-test cases, including a byte-for-byte cross-check againstEncodePrimaryIndexWithKeyPrefix).rowenc.EncodePrimaryIndexWithKeyPrefixandDeleter.encodeValueForPrimaryIndexFamilythrough the new helper.prepareInsertOrUpdateBatch, drops the now-redundantfetchedColsarg from its callers, and removesRowHelper.encodePrimaryIndexValuesToBuf.Net delta: +609 / −393 across 7 files; the three private cache fields on
RowHelper(primaryIndexKeyCols,primaryIndexValueCols,sortedColumnFamilies) and one helper method also go away.Closes #131589.
Release note: None
Epic: CRDB-48845