Skip to content

sql: dedup primary index value encoding#169061

Closed
mw5h wants to merge 4 commits into
cockroachdb:masterfrom
mw5h:mw5h/dedup-pk-encoding-131589
Closed

sql: dedup primary index value encoding#169061
mw5h wants to merge 4 commits into
cockroachdb:masterfrom
mw5h:mw5h/dedup-pk-encoding-131589

Conversation

@mw5h

@mw5h mw5h commented Apr 24, 2026

Copy link
Copy Markdown
Contributor

Four independent implementations of "encode the values for a primary index given a row's datums" had grown up over the years across pkg/sql/row (insert/update + delete) and pkg/sql/rowenc (the pure encoder), each reimplementing the single-default-column legacy encoding, the multi-column tuple encoding, the family-0 sentinel rules, and the composite-aware skip checks. The most recent of these (Sep 2024) shipped with an explicit TODO(ssd): Lots of duplication ... rather unfortunate. next to it.

This PR introduces rowenc.PrimaryIndexEncoder as the canonical owner of those rules, then routes each of the three row-oriented call sites through it. The encoder caches everything that depends only on the table descriptor / index (key column set, stored column set, sorted family column IDs); per-call it takes only what genuinely varies (family, column-to-position map, datums, scratch buffer).

The vectorized fourth encoder (colenc.BatchEncoder.encodePK) is intentionally untouched — its inner loops are vectorized over N rows in a way that doesn't fit the row-oriented helper without paying performance cost. A future PR can extract per-family static metadata so colenc consumes the same plan without sharing the inner loops.

The four commits stage the change for review:

  • rowenc: add PrimaryIndexEncoder is purely additive (helper + 11 unit-test cases, including a byte-for-byte cross-check against EncodePrimaryIndexWithKeyPrefix).
  • The next two commits route rowenc.EncodePrimaryIndexWithKeyPrefix and Deleter.encodeValueForPrimaryIndexFamily through the new helper.
  • The last commit collapses the two parallel branches in prepareInsertOrUpdateBatch, drops the now-redundant fetchedCols arg from its callers, and removes RowHelper.encodePrimaryIndexValuesToBuf.

Net delta: +609 / −393 across 7 files; the three private cache fields on RowHelper (primaryIndexKeyCols, primaryIndexValueCols, sortedColumnFamilies) and one helper method also go away.

Closes #131589.

Release note: None
Epic: CRDB-48845

@trunk-io

trunk-io Bot commented Apr 24, 2026

Copy link
Copy Markdown
Contributor

Merging to master in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

@blathers-crl

blathers-crl Bot commented Apr 24, 2026

Copy link
Copy Markdown

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity

Copy link
Copy Markdown
Member

This change is Reviewable

mw5h added 4 commits June 4, 2026 16:26
CockroachDB has accumulated four independent implementations of "encode
the values for a primary index given a row's datums" -
prepareInsertOrUpdateBatch, Deleter.encodeValueForPrimaryIndexFamily,
EncodePrimaryIndexWithKeyPrefix, and BatchEncoder.encodePK. The first
three operate on []tree.Datum and reimplement the same per-family value
encoding skeleton (single-default-column legacy encoding vs. multi-
column tuple encoding, family-0 sentinel rules, composite-aware skip
checks).

This commit adds a single PrimaryIndexEncoder type in rowenc that
captures those rules. The encoder caches the table-derived state used
by every call (key column set, stored column set, sorted family column
IDs), so callers pass only what genuinely varies per row: the family,
the column-to-position map, the row's datums, and a reusable scratch
buffer.

PrimaryIndexFamilyValue distinguishes "Skipped" (no work to do; do not
issue a delete) from "Value not present" (no encoded columns; callers
in overwrite paths should issue a delete) so the three call sites can
share the same per-family decision logic without paying for a delete in
the wrong cases.

This commit is purely additive. Subsequent commits route the existing
encoders through the new helper.

Informs cockroachdb#131589.

Release note: None
Epic: none
…oder

Replace the inlined per-family encoding logic in
EncodePrimaryIndexWithKeyPrefix with a call to PrimaryIndexEncoder
(added in the previous commit). This removes the duplicated
single-default-column / multi-column / family-0 / composite-skip
branching, and lets EncodePrimaryIndexWithKeyPrefix focus on what only
it cares about: building the index key, applying includeEmpty, and
wrapping entries for delete-preserving encoding.

getStoredColumnsForPrimaryIndex moves into PrimaryIndexEncoder
construction (where it belongs with the other table-derived caches).
The unused free function is deleted.

Behavior is unchanged in every case observed by the rowenc tests and
by the colenc/row equality cross-check (TestEncoderEqualityDatums).
There is one rare edge case where behavior is technically improved:
when includeEmpty=true and a primary key column lives in a non-zero
single-default-column family with a CompositeDatum type whose value is
not composite, the new path emits an empty entry where the old path
silently skipped. Existing callers either ignore the Value (deleter)
or treat an extra empty entry as a no-op delete on a possibly absent
key, so the change is harmless in practice and is the more
defensible behavior anyway (it lets deleters clean up potentially-
stale composite-encoded KVs from prior writes).

Informs cockroachdb#131589.

Release note: None
Epic: none
…ncoder

RowHelper now embeds a rowenc.PrimaryIndexEncoder, initialized in
RowHelper.Init alongside PrimaryIndexKeyPrefix. The three private
fields it replaces (primaryIndexKeyCols, primaryIndexValueCols,
sortedColumnFamilies) are gone; the encoder owns that state.

The two RowHelper public methods that those fields backed -
SkipColumnNotInPrimaryIndexValue and SortedColumnFamily - become thin
delegations and keep their existing signatures, so colenc and the
in-package writer/inserter call sites are unaffected.

Deleter.encodeValueForPrimaryIndexFamily collapses to a single
delegating call to encoder.EncodeFamily. The "TODO(ssd): Lots of
duplication ... rather unfortunate." note that this function was born
with goes away too.

Note: today's deleter returned an empty Value for the rare
"single-default-column composite-but-not-composite-value" case where
the writer would have produced an empty Value (and then issued a
delete in overwrite mode). Both produce the same caller-visible
expected-bytes (nil), so this is a behavior preservation, not a
change.

Informs cockroachdb#131589.

Release note: None
Epic: none
Route prepareInsertOrUpdateBatch's per-family value encoding through
RowHelper.PrimaryIndexEncoder().EncodeFamily, the same helper used by
the Deleter and by rowenc.EncodePrimaryIndexWithKeyPrefix. The two
parallel branches that the function carried for years - one for
single-default-column families using legacy direct value encoding, one
for multi-column families using tuple encoding - collapse into a
single body. Both paths previously reimplemented the family-0
sentinel, the composite-key skip rule, the empty-family delete, and
the column-id-delta tuple write; now those rules live in one place.

The fetchedCols parameter falls out of the call signature; the encoder
looks up types from the table descriptor when it needs them
(single-default-column legacy encoding only). The two remaining
callers - Inserter.InsertRow and Updater.UpdateRow - pass one fewer
argument.

The redundant RowHelper.encodePrimaryIndexValuesToBuf method (and the
last in-package use of pkg/sql/rowenc/valueside) is removed.

Behavior is unchanged for the cases covered by existing tests:
TestEncoderEqualityDatums (which compares the row-side write path to
the unmodified vector path byte-for-byte), the rowenc/colenc/row unit
suites, and the LDR delete tests.

Closes cockroachdb#131589.

Release note: None
Epic: none
@mw5h mw5h force-pushed the mw5h/dedup-pk-encoding-131589 branch from 340868d to 225d397 Compare June 4, 2026 23:28
@mw5h mw5h closed this Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sql: code for encoding primary index entries is duplicated in multiple locations

2 participants