Skip to content

refactor: [OM-387] governance entitlement performance improvement#4635

Open
gergely-kurucz-konghq wants to merge 8 commits into
mainfrom
refactor/OM-387-governance-entitlement-performance-improvement
Open

refactor: [OM-387] governance entitlement performance improvement#4635
gergely-kurucz-konghq wants to merge 8 commits into
mainfrom
refactor/OM-387-governance-entitlement-performance-improvement

Conversation

@gergely-kurucz-konghq

@gergely-kurucz-konghq gergely-kurucz-konghq commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Governance query performance improvements

What

Speeds up POST /governance/query for metered entitlements (~40% lower latency; 100×100 ≈25s → ≈12.5s) and extends the benchmark to cover the metered path.

Why

The endpoint resolves access per customer, and each metered entitlement's GetAccess did far more Postgres work than necessary: the same entitlement row was re-fetched ~6× per balance calc, the customer loop ran serially, and every read took an advisory-lock transaction even when no snapshot was written. The existing benchmark only exercised boolean entitlements, so none of this was visible.

How

  • Benchmark: added an entitlement-kind axis (GOV_BENCH_KIND=boolean|metered|mixed|all) plus a bench-governance-trace target for capturing a single small cell's trace safely.
  • Fewer queries (metered/grant_owner_adapter.go): memoize the owner-entitlement lookup within one balance calculation via a context-scoped cache, collapsing ~6 GetEntitlement fetches to 1 (~23 → ~13 SQL/entitlement). Installer is unexported and read-only-scoped by design (documented invariant).
  • Parallel resolution (governance/service/service.go): resolve a page's customers concurrently with a bounded errgroup, limit max(2, GOMAXPROCS/2) — scales with the box and the ~NumCPU pgx pool, halved because GetAccess already fans out ~10-wide internally. Results written by index to preserve sort order.
  • No lock on empty reads (credit/helper.go): check whether a snapshot segment actually qualifies before taking the owner lock, skipping the lock transaction (and its held connection) on steady-state metered reads.

Behavior-preserving and verified via traces. The credit/entitlement changes also benefit other metered balance reads (e.g. the balance worker), not just governance.

Summary by CodeRabbit

  • Tests
    • Expanded the governance end-to-end benchmark to cover multiple entitlement resolution patterns (boolean, metered, and mixed).
    • Added a pre-benchmark validation step to verify governance query results match expected counts for each tested scenario.
    • Enhanced benchmark data setup to generate the appropriate entitlement mix per feature and to create required metered entitlements for the metered cases.

Add GOV_BENCH_KIND (boolean|metered|mixed|all) so the benchmark can
exercise the metered GetAccess balance path, not just the boolean
short-circuit. Default stays boolean (baselines unchanged). Metered
features share one SUM meter; no events ingested (the per-entitlement
ClickHouse query fires regardless). Add bench-governance-metered and a
bench-governance-trace target scoped to one small cell.
The metered grant owner adapter re-fetched the same immutable
entitlement row 6 times per balance calculation (DescribeOwner,
GetStartOfMeasurement, GetUsagePeriodStartAt x3, GetResetTimelineInclusive),
each issuing ~2 SQL queries. Add an opt-in, context-scoped memo
(WithEntitlementCache) installed at GetEntitlementBalance so those
methods share a single GetEntitlement fetch. Callers that do not
install the cache fall back to a direct fetch, so the balance worker
and one-off value reads are unchanged.

Cuts ~43% of the Postgres queries per metered entitlement in the
governance access path (~23 -> ~13 SQL/ent; verified via traces on a
10x10 metered benchmark).
Replace the serial per-customer GetAccess loop in resolveAccess with a
bounded errgroup (maxCustomerConcurrency=5), writing results into
pre-sized indexed slices to preserve page sort order. Whole-request
error semantics are unchanged.

Conservative bound: GetAccess already fans a customer's entitlements
out ~10-wide, so peak concurrent DB ops is ~K x 10 against a
~NumCPU-sized pgx pool; a higher value mostly queues on connection
acquire. Cuts metered latency ~40% (100x100 ~25s -> ~14s), pool-bound.
snapshotEngineResult took an owner advisory-lock transaction before
checking whether any history segment qualified for a snapshot, so every
metered balance read paid a begin/lock/commit even when nothing was
written. Move the LATEST check and segment scan (both in-memory) ahead
of the lock; take it only when a snapshot will be saved.

Save-path behavior is unchanged (the lock was already released before
saveSnapshot). Verified via traces: on a 10x10 metered governance query
LockOwnerForTx/begin_tx/commit drop from 100 to 0.
Rename WithEntitlementCache to withEntitlementCache so the memo can only
be installed from within the metered package, at vetted read-only entry
points. Document the invariant: cached entries are not invalidated
within scope, so installing it around a stack that mutates the
entitlement would serve stale reads. Safe today — the sole installer is
GetEntitlementBalance, whose call tree performs no entitlement writes.
Replace the fixed maxCustomerConcurrency=5 with max(2, GOMAXPROCS/2) so
the bound tracks the machine (and the ~NumCPU pgx pool). Halved rather
than full GOMAXPROCS because GetAccess already fans out ~10-wide per
customer, so peak DB demand is ~this×10; halving keeps pool headroom
for concurrent requests.
@gergely-kurucz-konghq gergely-kurucz-konghq requested a review from a team as a code owner July 2, 2026 12:36
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c8a8f0e5-d696-48ca-856e-02cc1f177ec6

📥 Commits

Reviewing files that changed from the base of the PR and between 27d8700 and 656773e.

📒 Files selected for processing (1)
  • e2e/governance_bench_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • e2e/governance_bench_test.go

📝 Walkthrough

Walkthrough

The governance benchmark now supports selectable entitlement kinds through GOV_BENCH_KIND. It runs the benchmark per selected kind and seeds boolean or metered entitlements accordingly, including shared meter creation for metered features.

Changes

Governance benchmark entitlement kinds

Layer / File(s) Summary
Kind selection and benchmark notes
e2e/governance_bench_test.go
Adds entitlement-kind variants, environment-based selection, and benchmark documentation for the different resolution paths.
Benchmark loop per kind
e2e/governance_bench_test.go
Runs the benchmark matrix for each selected kind and keeps the warm-up, correctness, and timed query phases aligned with the seeded data.
Kind-aware fixture seeding
e2e/governance_bench_test.go
Seeds metered features with a shared meter when needed and grants either boolean or metered customer entitlements.

Estimated code review effort: 3 (Moderate) | ~20 minutes

Suggested reviewers: borbelyr-kong

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title matches the main change: a governance entitlement performance refactor focused on improving query behavior.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch refactor/OM-387-governance-entitlement-performance-improvement

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gergely-kurucz-konghq gergely-kurucz-konghq added release-note/misc Miscellaneous changes kind/refactor Code refactor, cleanup or minor improvement labels Jul 2, 2026
@greptile-apps

greptile-apps Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR improves POST /governance/query latency for metered entitlements (~40%) through three coordinated changes: skipping the advisory-lock transaction when no snapshot qualifies, memoizing the repeated GetEntitlement DB fetch within one balance calculation, and resolving page customers concurrently via a bounded errgroup.

  • credit/helper.go: snapshotEngineResult now scans segments (pure in-memory) before attempting the advisory lock, so steady-state metered reads that produce no new snapshot skip the lock transaction entirely. The save-outside-lock semantics that existed in the original code are preserved.
  • grant_owner_adapter.go / balance.go: A sync.Map-backed, sync.Once-per-owner cache is installed at the GetEntitlementBalance entry point, collapsing ~6 redundant GetEntitlement queries per entitlement to one. The cache is deliberately unexported and installed only on the read-only balance path.
  • governance/service/service.go: Customer access resolution within a page is parallelised up to max(2, GOMAXPROCS/2) concurrent goroutines; results are written by index into a pre-allocated slice to preserve sort order without shared-state mutation.

Confidence Score: 5/5

All three production changes are semantically correct and behaviour-preserving; the benchmark improvements address the previously flagged anchor-date drift.

The advisory-lock refactoring preserves the original save-outside-lock semantics that existed before this PR. The entitlement cache is correctly scoped per GetEntitlementBalance call via a fresh sync.Map, uses sync.Once for safe concurrent initialisation, and is deliberately unexported to prevent installation around write paths. The errgroup parallelism writes results by index into pre-allocated slices (no shared-state mutation), and loop variable capture is safe under Go 1.25. No correctness regressions were identified.

No files require special attention.

Important Files Changed

Filename Overview
openmeter/credit/helper.go Advisory lock now skipped when no snapshot segment qualifies; LATEST aggregation early-return also moved before the lock. Save-outside-lock semantics preserved from original code.
openmeter/entitlement/metered/grant_owner_adapter.go Per-balance sync.Map / sync.Once cache for owner entitlement lookups correctly scoped and unexported; invariant that it wraps only read-only paths is documented and enforced.
openmeter/governance/service/service.go errgroup-based customer parallelism is correctly structured; index-keyed writes into pre-sized slices avoid shared state; loop variable capture is safe under Go 1.22+ semantics (project uses 1.25).
openmeter/entitlement/metered/balance.go Single withEntitlementCache(ctx) call installs a fresh per-call cache scope; nesting guard ensures re-entrancy is a no-op.
e2e/governance_bench_test.go Adds metered/mixed/all benchmark dimensions; anchor date correctly uses time.Now().UTC() to prevent benchmark drift; meter creation and entitlement seeding look correct.
e2e/Makefile New GOV_BENCH_KIND variable and convenience targets cleanly extend the existing Makefile pattern.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant S as governance/service
    participant EG as errgroup (bounded N)
    participant ES as entitlementService
    participant GEB as GetEntitlementBalance
    participant Cache as entitlementCache (sync.Map)
    participant DB as Postgres

    S->>EG: SetLimit(max(2, GOMAXPROCS/2))
    loop for each customer[i] (parallel)
        EG->>ES: GetAccess(ctx, customerID)
        loop for each entitlement
            ES->>GEB: GetEntitlementBalance(ctx, ownerID, at)
            GEB->>Cache: withEntitlementCache(ctx) — install fresh scope
            GEB->>Cache: getEntitlement(owner) — once.Do
            Cache->>DB: GetEntitlement (1 query, shared by ~6 callers)
            DB-->>Cache: entitlement row
            Cache-->>GEB: cached result
            GEB->>GEB: snapshotEngineResult
            Note over GEB: scan segs (in-memory), skip lock if nothing qualifies
        end
        ES-->>EG: access result
        EG->>S: "results[i] = CustomerAccess"
    end
    S->>EG: g.Wait()
    EG-->>S: []CustomerAccess (sorted by original index)
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant S as governance/service
    participant EG as errgroup (bounded N)
    participant ES as entitlementService
    participant GEB as GetEntitlementBalance
    participant Cache as entitlementCache (sync.Map)
    participant DB as Postgres

    S->>EG: SetLimit(max(2, GOMAXPROCS/2))
    loop for each customer[i] (parallel)
        EG->>ES: GetAccess(ctx, customerID)
        loop for each entitlement
            ES->>GEB: GetEntitlementBalance(ctx, ownerID, at)
            GEB->>Cache: withEntitlementCache(ctx) — install fresh scope
            GEB->>Cache: getEntitlement(owner) — once.Do
            Cache->>DB: GetEntitlement (1 query, shared by ~6 callers)
            DB-->>Cache: entitlement row
            Cache-->>GEB: cached result
            GEB->>GEB: snapshotEngineResult
            Note over GEB: scan segs (in-memory), skip lock if nothing qualifies
        end
        ES-->>EG: access result
        EG->>S: "results[i] = CustomerAccess"
    end
    S->>EG: g.Wait()
    EG-->>S: []CustomerAccess (sorted by original index)
Loading

Reviews (3): Last reviewed commit: "test(e2e): anchor metered bench usage pe..." | Re-trigger Greptile

Comment thread e2e/governance_bench_test.go
A fixed 2024 anchor let the balance engine accrue a monthly reset
period per elapsed month, drifting benchmark cost over time. Anchor at
time.Now() so only the current period is evaluated and runs stay
comparable.

@chrisgacsal chrisgacsal left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I'd wait with merge until @GAlexIHU also reviewed it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/refactor Code refactor, cleanup or minor improvement release-note/misc Miscellaneous changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants