Skip to content

perf: short-circuit hash_encrypted on root-level hm (Phase 2) #202

Description

@coderdan

Summary

The eql_v2.hash_encrypted chain underpins GROUP BY / JOIN / DISTINCT on eql_v2_encrypted columns. PR #196 wired up the hash operator class correctly — the planner picks HashAggregate / engages bench_text_hmac_idx on equality joins — but per-row hashing is dominated by eql_v2.to_ste_vec_value's JSONB inspection and reconstruction. Even fully inlined (LANGUAGE sql IMMUTABLE), to_ste_vec_value's body parses the JSONB once to check val ? 'sv' AND jsonb_array_length(val->'sv') = 1, and on the THEN branch builds a new JSONB via meta_data(val) || (val->'sv'->0). That's the residual cost.

Fix is a fast-path read of root-level hm with to_ste_vec_value reserved for the rare single-element ste_vec-wrapped case.

Measured impact

10K-row bench fixture, PG 17, EXPLAIN ANALYZE. Each value is a fresh-DB run with the patch applied via CREATE OR REPLACE:

Query current main (plpgsql + SET search_path) naive inline (plpgsql → SQL, body unchanged) fast-path (coalesce(val.data ->> 'hm', …))
SELECT count(*) FROM bench GROUP BY encrypted_text ~309 ms ~336 ms (no improvement) ~73 ms
SELECT count(*) FROM bench a JOIN bench b ON a.encrypted_text = b.encrypted_text ~308 ms ~285 ms (marginal) ~185 ms
SELECT DISTINCT encrypted_text FROM bench ~515 ms ~315 ms (some win from removing plpgsql call overhead, but cost is now dominated by HashAggregate output materialization × to_ste_vec_value) ~72 ms

Plan shapes don't change in any column — both naive-inline and fast-path produce the same HashAggregate / Nested Loop + Memoize structure as current main. The win is purely per-row CPU.

Fix shape

Same pattern as #189 / #201 plus a body change to hash_encrypted:

  1. Rewrite eql_v2.hash_encrypted(eql_v2_encrypted) to short-circuit on root-level hm:

    CREATE OR REPLACE FUNCTION eql_v2.hash_encrypted(val eql_v2_encrypted)
      RETURNS integer
      LANGUAGE sql IMMUTABLE STRICT PARALLEL SAFE
    AS $$
      SELECT pg_catalog.hashtext(
        coalesce(
          val.data ->> 'hm',
          eql_v2.hmac_256(eql_v2.to_ste_vec_value(val))::text
        )
      )
    $$;

    The coalesce is short-circuit-evaluating — when hm is present at root (the common case, including everything the bench fixture emits and everything @cipherstash/protect ships today), the second arm is never evaluated and to_ste_vec_value is never called. When hm is absent at root (single-element ste_vec wrapping), the unwrap kicks in and the existing semantics are preserved.

    If hm is missing entirely, the function returns the hash of NULL rather than raising. Three options for that case, in increasing strictness:

    • Accept it (NULL hashes to a stable bucket; equality will sort it out).
    • Raise an explicit error inside the coalesce via a CASE WHEN … THEN … ELSE raise_exception() END shim.
    • Add a separate CHECK constraint enforcing hm presence at config time.

    Worth deciding as part of this issue.

  2. Convert helpers to inlinable SQL so the planner can splat the chain into the calling query. These are no-op semantically (function bodies stay the same):

    • eql_v2.hmac_256(jsonb) and eql_v2.hmac_256(eql_v2_encrypted)
    • eql_v2.to_ste_vec_value(jsonb) and eql_v2.to_ste_vec_value(eql_v2_encrypted)
  3. Allowlist hash_encrypted + the helpers in tasks/pin_search_path.sql so the post-install loop doesn't re-pin them.

  4. Add splinter justifications in tasks/test/splinter.sh (same shape as the entries added in #201).

bloom_filter extractors are in the same plpgsql + SET search_path shape — worth doing in the same pass since they're called per row by inlined ~~ and the GIN functional index path. Probably negligible at the index-bound baseline (~0.20 ms for bare LIKE) but no reason not to bundle.

Out of scope (separate Phase 3)

Verification

Bench plan + regression assertions covering GROUP BY / JOIN / DISTINCT already exist in #203 (in-repo SQLx tests), #[ignore]'d pending this issue with thresholds set to the measured fast-path numbers + ~2x CI headroom. Remove the #[ignore] markers when this lands.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions